Importing data with Pandas

Author

Karl Gregory

In this note we will learn some basics of importing and manipulating data with the pandas library. The Pandas library is almost always imported as pd. We will follow this convention.

Importing data

Let’s consider importing data from a text file with Pandas. The code below shows how we can take a peek at what a file looks like.

import pandas as pd

dr = "/Users/karlgregory/Library/CloudStorage/OneDrive-UniversityofSouthCarolina/myBooks/compstat/data/" # directory where I have my data files
filename = dr + "safari_comma_missing.dat" # put together the path to the particular file I want 

file = open(filename)  # create a connection to a file
contents = file.read() # read the contents of the file
print(contents)        # print the contents of the file
file.close             # close the connection to the file
Some make-believe safari data
Woohoo!!

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,.,6,rainy,8:12 am,,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards
<function TextIOWrapper.close()>

Now we will use the read_table function from the Pandas library to read in the data. It takes arguments similar to those we used in R when using the read.table() function, but it has some different default settings.

The file whose contents we displayed above has comma separated values and a few missing values, where these are not always coded with the same character. Note how we can handle this with the read_table function:

safari = pd.read_table(filename,sep = ",",skiprows = 3,na_values = ['.',' '])
safari
date wildebeest laughing hyena crocodile weather start end fun guide
0 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs
1 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña
2 10/15/2010 3 NaN 6 rainy 8:12 am NaN no Karl Bruce Gregory
3 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang
4 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
5 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards

What type of object gets created when we read in a data set with the Pandas read_table function?

type(safari)
pandas.core.frame.DataFrame

One of the main Pandas objects is the DataFrame. It is much like a data.frame in R, as it can have columns of different data types (numeric, string, boolean). The Pandas “vector” object is called a Series, and we will see that a data frame consists of several series put together as columns in a spreadsheet-like way. The other kind of object in Pandas is the Index, about which more will be introduced as needed.

Here is a tab-delimited version of the same data file:

filename = dr + "safari_tab_missing.dat"
file = open(filename)
contents = file.read()
print(contents)
file.close
Some make-believe safari data
Woohoo!!

date    wildebeest  laughing hyena  crocodile   weather start   end fun guide
1/13/1999   12  none    2   sunny   7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001   3   1   1   cloudy  6:25 am 12:33 pm    Y   Edsel Peña
10/15/2010  3   .   6   rainy   8:12 am     no  Karl Bruce Gregory
3/02/2006   1   14  5   hot/sunny   7:15 am 3:12 pm y   Lianming Wang
2/28/1988   2   6   3   partly cloudy   4:53 am 2:16 pm Yes Brian Habing
7/14/2015   3   12  0   cloudy  5:47 am 3:46 pm No  Edwards
<function TextIOWrapper.close()>
filename = dr + "safari_tab_missing.dat"
pd.read_table(filename,sep = "\t",skiprows = 3,na_values = ['.',' ']) # use \t 
date wildebeest laughing hyena crocodile weather start end fun guide
0 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs
1 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña
2 10/15/2010 3 NaN 6 rainy 8:12 am NaN no Karl Bruce Gregory
3 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang
4 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
5 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards

We can use the read_fwf function in Pandas to read data in fixed-width format:

filename = dr + "safari_fwf_missing.dat"
file = open(filename)
contents = file.read()
print(contents)
file.close

pd.read_fwf(filename,
            skiprows=4,
            widths=[11,3,5,3,14,9,9,4,18],
            header = None, # since the header names are not giving in the same columns as the data values
            names = ['date', 'wildebeest','laughing hyena','crocodile','weather','start','end','fun','guide'])
Some make-believe safari data
Woohoo!!

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999  12 none 2  sunny         7:21 am  4:14 pm  yes Joshua Tebbs
4/28/2001  3  1    1  cloudy        6:25 am  12:33 pm Y   Edsel Peña
10/15/2010 3  6       rainy         8:12 am           no  Karl Bruce Gregory
3/02/2006  1  14   5  hot/sunny     7:15 am  3:12 pm  y   Lianming Wang
2/28/1988  2  6    3  partly cloudy 4:53 am  2:16 pm  Yes Brian Habing
7/14/2015  3  12   0  cloudy        5:47 am  3:46 pm  No  Edwards
date wildebeest laughing hyena crocodile weather start end fun guide
0 1/13/1999 12 none 2.0 sunny 7:21 am 4:14 pm yes Joshua Tebbs
1 4/28/2001 3 1 1.0 cloudy 6:25 am 12:33 pm Y Edsel Peña
2 10/15/2010 3 6 NaN rainy 8:12 am NaN no Karl Bruce Gregory
3 3/02/2006 1 14 5.0 hot/sunny 7:15 am 3:12 pm y Lianming Wang
4 2/28/1988 2 6 3.0 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
5 7/14/2015 3 12 0.0 cloudy 5:47 am 3:46 pm No Edwards

Dates in Python

The prescribed way to work with dates in Python is with the datetime library, which you can read about here.

import datetime

tcm = datetime.datetime.today() #(year, month, day, hour, minute, second, microsecond -- millionth of a second)
tcm # this current moment
datetime.datetime(2025, 10, 6, 11, 22, 14, 330329)
print(tcm)
2025-10-06 11:22:14.330329

If you need to put in a time zone other than the local one, you need to use the function datetime.now(), which takes a time zone argument. We won’t mess with this.

A datetime object has the year, day, and so one as attributes that we can access with Python’s <object>.<attribute> syntax:

print(tcm.year)
print(tcm.month)
print(tcm.day)
print(tcm.hour)
print(tcm.minute)
print(tcm.second)
print(tcm.microsecond)
2025
10
6
11
22
14
330329

Here are two important functions:

  • strptime: Convert a character string to a datetime object
  • strftime: Convert a datetime object to a character string

Here is an example of how these function work:

viva = datetime.datetime.strptime('11/22/2015 02:23','%m/%d/%Y %H:%M')
viva
datetime.datetime(2015, 11, 22, 2, 23)
datetime.datetime.strftime(viva,'%B %d, Anno Domini %Y, at %H:%M')
'November 22, Anno Domini 2015, at 02:23'

Here is a description of date/time formats.

a = datetime.datetime.strptime('2:20 pm','%I:%M %p')
print(a)
1900-01-01 14:20:00
b = datetime.datetime.strptime('1/13/1999','%m/%d/%Y')
print(b)
1999-01-13 00:00:00

Pandas has some of its own functions for working with time data which are in a Pandas series or data frame. We can convert a Pandas series with character string dates into a series with date time values with the to_datetime function in Pandas. Notice that if no date information is given, the date is set to 1900-01-01.

start = pd.to_datetime(safari.start,format='%I:%M %p') # convert the 12-hour clock times to 24-hour clock times
end = pd.to_datetime(safari.end,format='%I:%M %p') 
print(end)
0   1900-01-01 16:14:00
1   1900-01-01 12:33:00
2                   NaT
3   1900-01-01 15:12:00
4   1900-01-01 14:16:00
5   1900-01-01 15:46:00
Name: end, dtype: datetime64[ns]

To take the difference between the date times in two Pandas series, we can simply use - to subtract one from the other. This creates a timedelta object.

safari['duration'] = end - start
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration
0 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs 0 days 08:53:00
1 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña 0 days 06:08:00
2 10/15/2010 3 NaN 6 rainy 8:12 am NaN no Karl Bruce Gregory NaT
3 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang 0 days 07:57:00
4 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing 0 days 09:23:00
5 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards 0 days 09:59:00

Suppose I want the dates in a different format in the safari data. I can transform them to actual datetime values.

date = pd.to_datetime(safari.date,format='%m/%d/%Y')
safari['date'] = date
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration
0 1999-01-13 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs 0 days 08:53:00
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña 0 days 06:08:00
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN no Karl Bruce Gregory NaT
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang 0 days 07:57:00
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing 0 days 09:23:00
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm No Edwards 0 days 09:59:00

Now let’s get the weekday of each safari. The weekday function gives Monday through Sunday the values 0 through 6, respectively.

safari['weekday'] = safari.date.dt.weekday
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday
0 1999-01-13 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs 0 days 08:53:00 2
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña 0 days 06:08:00 5
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN no Karl Bruce Gregory NaT 4
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang 0 days 07:57:00 3
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing 0 days 09:23:00 6
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm No Edwards 0 days 09:59:00 1

Working with text

Let’s learn a little bit about regular expressions. Read more here

import re
ch = 'Karl Bruce Gregory'
print(re.sub('Karl','Yarnold',ch))  # replace a substring with another substring
print(re.sub('[aeiou]','*',ch))     # replace vowels with '*'
print(re.sub('[a-z]','-',ch))       # replace lower-case letters with '-'
print(re.search('go',ch))           # get location of a string
print(re.findall('r',ch))           # find all 'r's
print(re.split('a',ch))             # split on 'a'
print(re.split('[aeiou]',ch))       # split on vowels
print(re.findall('[A-Z]',ch))       # Find capital letters
print(re.search(' ',ch))            # Find first white space character
Yarnold Bruce Gregory
K*rl Br*c* Gr*g*ry
K--- B---- G------
<re.Match object; span=(14, 16), match='go'>
['r', 'r', 'r', 'r']
['K', 'rl Bruce Gregory']
['K', 'rl Br', 'c', ' Gr', 'g', 'ry']
['K', 'B', 'G']
<re.Match object; span=(4, 5), match=' '>

We can use the str.join function to collapse a list of character strings to a single string, with a character in between:

initials = re.findall('[A-Z]',ch) # get the capital letters
'.'.join(initials) + '.' # place a '.' in between
'K.B.G.'

Here is a function which gives a short version of someone’s name:

def nameabb(ch):
    
    full = re.split(' ',ch)
    n = len(full)
    
    if n == 1 :
        
        return full[0]
        
    else:

        abb = ''
        
        for i in range(0,n-1): abb += full[i][0] + '. '

        abb += full[n-1]

    return(abb)
    
print(nameabb('Matthew Frederick Thaddeus Bailey'))
print(nameabb('Karl Bruce Gregory'))
print(nameabb('Aad W. van der Vaart'))
print(nameabb('Snoopy'))
M. F. T. Bailey
K. B. Gregory
A. W. v. d. Vaart
Snoopy

When we are working with a Pandas data frame it will be best to use the Pandas collection of string operations. We’ll give examples of just a few of them. This topic is more exhaustively covered in this book chapter as well as here.

Suppose I want to make a new column in the data set called “overcast” which has the value False when the string “sunny” appears in the weather column and which is true otherwise. Note that we can use ~ to negate the booleans in a boolean Pandas Series (we cannot use not in front). We use the str.contains function from Pandas to get our boolean series.

safari['overcast']= ~safari.weather.str.contains('sunny') # ~ turns False to True and True to False
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
0 1999-01-13 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña 0 days 06:08:00 5 True
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN no Karl Bruce Gregory NaT 4 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang 0 days 07:57:00 3 False
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing 0 days 09:23:00 6 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm No Edwards 0 days 09:59:00 1 True

We can apply a function to all the elemens in a Pandas series with the map function. For example, if we want to abbreviate the names of the guides, we can apply the nameabb function we defined earlier:

safari.guide = safari.guide.map(nameabb)
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
0 1999-01-13 12 none 2 sunny 7:21 am 4:14 pm yes J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm Y E. Peña 0 days 06:08:00 5 True
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN no K. B. Gregory NaT 4 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm y L. Wang 0 days 07:57:00 3 False
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm Yes B. Habing 0 days 09:23:00 6 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm No Edwards 0 days 09:59:00 1 True

Suppose we want to clean up the “fun” column and convert the values to booleans. One way to do this is through the use of a dictionary object. A dictionary can be used to create a mapping from the different values appearing in the “fun” series to the values True or False. Then we can use the map function in Pandas to perform the mapping:

fundict = {'Yes' : True, 'Y' : True, 'y' : True, 'yes' : True, 'no' : False, 'No' : False} # a dictionary
safari.fun = safari.fun.map(fundict)
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
0 1999-01-13 12 none 2 sunny 7:21 am 4:14 pm True J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm True E. Peña 0 days 06:08:00 5 True
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN False K. B. Gregory NaT 4 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm True L. Wang 0 days 07:57:00 3 False
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm True B. Habing 0 days 09:23:00 6 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm False Edwards 0 days 09:59:00 1 True

If accessing a subset of a Pandas data frame, a good way is to use .loc and then provide the indices, as in the example below. If you try to access a subset using just square brackets as though the data frame were a numpy array, you may have problems. See this page to learn more.

rp = safari['laughing hyena'] == "none"
safari.loc[rp,'laughing hyena'] = 0
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
0 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm True J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm True E. Peña 0 days 06:08:00 5 True
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN False K. B. Gregory NaT 4 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm True L. Wang 0 days 07:57:00 3 False
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm True B. Habing 0 days 09:23:00 6 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm False Edwards 0 days 09:59:00 1 True

Sorting a Pandas data frame

safari.sort_values(by='date')
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm True B. Habing 0 days 09:23:00 6 True
0 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm True J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm True E. Peña 0 days 06:08:00 5 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm True L. Wang 0 days 07:57:00 3 False
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN False K. B. Gregory NaT 4 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm False Edwards 0 days 09:59:00 1 True
safari.sort_values(by=['fun','date'])
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN False K. B. Gregory NaT 4 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm False Edwards 0 days 09:59:00 1 True
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm True B. Habing 0 days 09:23:00 6 True
0 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm True J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm True E. Peña 0 days 06:08:00 5 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm True L. Wang 0 days 07:57:00 3 False

Note that the data do not stay sorted:

safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
0 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm True J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm True E. Peña 0 days 06:08:00 5 True
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN False K. B. Gregory NaT 4 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm True L. Wang 0 days 07:57:00 3 False
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm True B. Habing 0 days 09:23:00 6 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm False Edwards 0 days 09:59:00 1 True

But we can make it stay sorted by setting inplace = True:

safari.sort_values(by=['fun','date'],inplace = True)
safari
date wildebeest laughing hyena crocodile weather start end fun guide duration weekday overcast
2 2010-10-15 3 NaN 6 rainy 8:12 am NaN False K. B. Gregory NaT 4 True
5 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm False Edwards 0 days 09:59:00 1 True
4 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm True B. Habing 0 days 09:23:00 6 True
0 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm True J. Tebbs 0 days 08:53:00 2 False
1 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm True E. Peña 0 days 06:08:00 5 True
3 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm True L. Wang 0 days 07:57:00 3 False