In this note we will learn some basics of importing and manipulating data with the pandas library. The Pandas library is almost always imported as pd. We will follow this convention.
Importing data
Let’s consider importing data from a text file with Pandas. The code below shows how we can take a peek at what a file looks like.
import pandas as pddr ="/Users/karlgregory/Library/CloudStorage/OneDrive-UniversityofSouthCarolina/myBooks/compstat/data/"# directory where I have my data filesfilename = dr +"safari_comma_missing.dat"# put together the path to the particular file I want file=open(filename) # create a connection to a filecontents =file.read() # read the contents of the fileprint(contents) # print the contents of the filefile.close # close the connection to the file
Some make-believe safari data
Woohoo!!
date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,.,6,rainy,8:12 am,,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards
<function TextIOWrapper.close()>
Now we will use the read_table function from the Pandas library to read in the data. It takes arguments similar to those we used in R when using the read.table() function, but it has some different default settings.
The file whose contents we displayed above has comma separated values and a few missing values, where these are not always coded with the same character. Note how we can handle this with the read_table function:
What type of object gets created when we read in a data set with the Pandas read_table function?
type(safari)
pandas.core.frame.DataFrame
One of the main Pandas objects is the DataFrame. It is much like a data.frame in R, as it can have columns of different data types (numeric, string, boolean). The Pandas “vector” object is called a Series, and we will see that a data frame consists of several series put together as columns in a spreadsheet-like way. The other kind of object in Pandas is the Index, about which more will be introduced as needed.
Here is a tab-delimited version of the same data file:
filename = dr +"safari_tab_missing.dat"file=open(filename)contents =file.read()print(contents)file.close
Some make-believe safari data
Woohoo!!
date wildebeest laughing hyena crocodile weather start end fun guide
1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña
10/15/2010 3 . 6 rainy 8:12 am no Karl Bruce Gregory
3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang
2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards
<function TextIOWrapper.close()>
filename = dr +"safari_tab_missing.dat"pd.read_table(filename,sep ="\t",skiprows =3,na_values = ['.',' ']) # use \t
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
0
1/13/1999
12
none
2
sunny
7:21 am
4:14 pm
yes
Joshua Tebbs
1
4/28/2001
3
1
1
cloudy
6:25 am
12:33 pm
Y
Edsel Peña
2
10/15/2010
3
NaN
6
rainy
8:12 am
NaN
no
Karl Bruce Gregory
3
3/02/2006
1
14
5
hot/sunny
7:15 am
3:12 pm
y
Lianming Wang
4
2/28/1988
2
6
3
partly cloudy
4:53 am
2:16 pm
Yes
Brian Habing
5
7/14/2015
3
12
0
cloudy
5:47 am
3:46 pm
No
Edwards
We can use the read_fwf function in Pandas to read data in fixed-width format:
filename = dr +"safari_fwf_missing.dat"file=open(filename)contents =file.read()print(contents)file.closepd.read_fwf(filename, skiprows=4, widths=[11,3,5,3,14,9,9,4,18], header =None, # since the header names are not giving in the same columns as the data values names = ['date', 'wildebeest','laughing hyena','crocodile','weather','start','end','fun','guide'])
Some make-believe safari data
Woohoo!!
date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña
10/15/2010 3 6 rainy 8:12 am no Karl Bruce Gregory
3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang
2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
0
1/13/1999
12
none
2.0
sunny
7:21 am
4:14 pm
yes
Joshua Tebbs
1
4/28/2001
3
1
1.0
cloudy
6:25 am
12:33 pm
Y
Edsel Peña
2
10/15/2010
3
6
NaN
rainy
8:12 am
NaN
no
Karl Bruce Gregory
3
3/02/2006
1
14
5.0
hot/sunny
7:15 am
3:12 pm
y
Lianming Wang
4
2/28/1988
2
6
3.0
partly cloudy
4:53 am
2:16 pm
Yes
Brian Habing
5
7/14/2015
3
12
0.0
cloudy
5:47 am
3:46 pm
No
Edwards
Dates in Python
The prescribed way to work with dates in Python is with the datetime library, which you can read about here.
import datetimetcm = datetime.datetime.today() #(year, month, day, hour, minute, second, microsecond -- millionth of a second)tcm # this current moment
If you need to put in a time zone other than the local one, you need to use the function datetime.now(), which takes a time zone argument. We won’t mess with this.
A datetime object has the year, day, and so one as attributes that we can access with Python’s <object>.<attribute> syntax:
a = datetime.datetime.strptime('2:20 pm','%I:%M %p')print(a)
1900-01-01 14:20:00
b = datetime.datetime.strptime('1/13/1999','%m/%d/%Y')print(b)
1999-01-13 00:00:00
Pandas has some of its own functions for working with time data which are in a Pandas series or data frame. We can convert a Pandas series with character string dates into a series with date time values with the to_datetime function in Pandas. Notice that if no date information is given, the date is set to 1900-01-01.
start = pd.to_datetime(safari.start,format='%I:%M %p') # convert the 12-hour clock times to 24-hour clock timesend = pd.to_datetime(safari.end,format='%I:%M %p') print(end)
To take the difference between the date times in two Pandas series, we can simply use - to subtract one from the other. This creates a timedelta object.
safari['duration'] = end - startsafari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
0
1/13/1999
12
none
2
sunny
7:21 am
4:14 pm
yes
Joshua Tebbs
0 days 08:53:00
1
4/28/2001
3
1
1
cloudy
6:25 am
12:33 pm
Y
Edsel Peña
0 days 06:08:00
2
10/15/2010
3
NaN
6
rainy
8:12 am
NaN
no
Karl Bruce Gregory
NaT
3
3/02/2006
1
14
5
hot/sunny
7:15 am
3:12 pm
y
Lianming Wang
0 days 07:57:00
4
2/28/1988
2
6
3
partly cloudy
4:53 am
2:16 pm
Yes
Brian Habing
0 days 09:23:00
5
7/14/2015
3
12
0
cloudy
5:47 am
3:46 pm
No
Edwards
0 days 09:59:00
Suppose I want the dates in a different format in the safari data. I can transform them to actual datetime values.
date = pd.to_datetime(safari.date,format='%m/%d/%Y')safari['date'] = datesafari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
0
1999-01-13
12
none
2
sunny
7:21 am
4:14 pm
yes
Joshua Tebbs
0 days 08:53:00
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
Y
Edsel Peña
0 days 06:08:00
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
no
Karl Bruce Gregory
NaT
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
y
Lianming Wang
0 days 07:57:00
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
Yes
Brian Habing
0 days 09:23:00
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
No
Edwards
0 days 09:59:00
Now let’s get the weekday of each safari. The weekday function gives Monday through Sunday the values 0 through 6, respectively.
safari['weekday'] = safari.date.dt.weekdaysafari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
0
1999-01-13
12
none
2
sunny
7:21 am
4:14 pm
yes
Joshua Tebbs
0 days 08:53:00
2
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
Y
Edsel Peña
0 days 06:08:00
5
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
no
Karl Bruce Gregory
NaT
4
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
y
Lianming Wang
0 days 07:57:00
3
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
Yes
Brian Habing
0 days 09:23:00
6
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
No
Edwards
0 days 09:59:00
1
Working with text
Let’s learn a little bit about regular expressions. Read more here
import rech ='Karl Bruce Gregory'print(re.sub('Karl','Yarnold',ch)) # replace a substring with another substringprint(re.sub('[aeiou]','*',ch)) # replace vowels with '*'print(re.sub('[a-z]','-',ch)) # replace lower-case letters with '-'print(re.search('go',ch)) # get location of a stringprint(re.findall('r',ch)) # find all 'r'sprint(re.split('a',ch)) # split on 'a'print(re.split('[aeiou]',ch)) # split on vowelsprint(re.findall('[A-Z]',ch)) # Find capital lettersprint(re.search(' ',ch)) # Find first white space character
We can use the str.join function to collapse a list of character strings to a single string, with a character in between:
initials = re.findall('[A-Z]',ch) # get the capital letters'.'.join(initials) +'.'# place a '.' in between
'K.B.G.'
Here is a function which gives a short version of someone’s name:
def nameabb(ch): full = re.split(' ',ch) n =len(full)if n ==1 :return full[0]else: abb =''for i inrange(0,n-1): abb += full[i][0] +'. ' abb += full[n-1]return(abb)print(nameabb('Matthew Frederick Thaddeus Bailey'))print(nameabb('Karl Bruce Gregory'))print(nameabb('Aad W. van der Vaart'))print(nameabb('Snoopy'))
M. F. T. Bailey
K. B. Gregory
A. W. v. d. Vaart
Snoopy
When we are working with a Pandas data frame it will be best to use the Pandas collection of string operations. We’ll give examples of just a few of them. This topic is more exhaustively covered in this book chapter as well as here.
Suppose I want to make a new column in the data set called “overcast” which has the value False when the string “sunny” appears in the weather column and which is true otherwise. Note that we can use ~ to negate the booleans in a boolean Pandas Series (we cannot use not in front). We use the str.contains function from Pandas to get our boolean series.
safari['overcast']=~safari.weather.str.contains('sunny') # ~ turns False to True and True to Falsesafari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
overcast
0
1999-01-13
12
none
2
sunny
7:21 am
4:14 pm
yes
Joshua Tebbs
0 days 08:53:00
2
False
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
Y
Edsel Peña
0 days 06:08:00
5
True
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
no
Karl Bruce Gregory
NaT
4
True
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
y
Lianming Wang
0 days 07:57:00
3
False
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
Yes
Brian Habing
0 days 09:23:00
6
True
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
No
Edwards
0 days 09:59:00
1
True
We can apply a function to all the elemens in a Pandas series with the map function. For example, if we want to abbreviate the names of the guides, we can apply the nameabb function we defined earlier:
safari.guide = safari.guide.map(nameabb)safari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
overcast
0
1999-01-13
12
none
2
sunny
7:21 am
4:14 pm
yes
J. Tebbs
0 days 08:53:00
2
False
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
Y
E. Peña
0 days 06:08:00
5
True
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
no
K. B. Gregory
NaT
4
True
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
y
L. Wang
0 days 07:57:00
3
False
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
Yes
B. Habing
0 days 09:23:00
6
True
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
No
Edwards
0 days 09:59:00
1
True
Suppose we want to clean up the “fun” column and convert the values to booleans. One way to do this is through the use of a dictionary object. A dictionary can be used to create a mapping from the different values appearing in the “fun” series to the values True or False. Then we can use the map function in Pandas to perform the mapping:
If accessing a subset of a Pandas data frame, a good way is to use .loc and then provide the indices, as in the example below. If you try to access a subset using just square brackets as though the data frame were a numpy array, you may have problems. See this page to learn more.
rp = safari['laughing hyena'] =="none"safari.loc[rp,'laughing hyena'] =0safari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
overcast
0
1999-01-13
12
0
2
sunny
7:21 am
4:14 pm
True
J. Tebbs
0 days 08:53:00
2
False
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
True
E. Peña
0 days 06:08:00
5
True
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
False
K. B. Gregory
NaT
4
True
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
True
L. Wang
0 days 07:57:00
3
False
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
True
B. Habing
0 days 09:23:00
6
True
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
False
Edwards
0 days 09:59:00
1
True
Sorting a Pandas data frame
safari.sort_values(by='date')
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
overcast
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
True
B. Habing
0 days 09:23:00
6
True
0
1999-01-13
12
0
2
sunny
7:21 am
4:14 pm
True
J. Tebbs
0 days 08:53:00
2
False
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
True
E. Peña
0 days 06:08:00
5
True
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
True
L. Wang
0 days 07:57:00
3
False
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
False
K. B. Gregory
NaT
4
True
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
False
Edwards
0 days 09:59:00
1
True
safari.sort_values(by=['fun','date'])
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
overcast
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
False
K. B. Gregory
NaT
4
True
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
False
Edwards
0 days 09:59:00
1
True
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
True
B. Habing
0 days 09:23:00
6
True
0
1999-01-13
12
0
2
sunny
7:21 am
4:14 pm
True
J. Tebbs
0 days 08:53:00
2
False
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
True
E. Peña
0 days 06:08:00
5
True
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
True
L. Wang
0 days 07:57:00
3
False
Note that the data do not stay sorted:
safari
date
wildebeest
laughing hyena
crocodile
weather
start
end
fun
guide
duration
weekday
overcast
0
1999-01-13
12
0
2
sunny
7:21 am
4:14 pm
True
J. Tebbs
0 days 08:53:00
2
False
1
2001-04-28
3
1
1
cloudy
6:25 am
12:33 pm
True
E. Peña
0 days 06:08:00
5
True
2
2010-10-15
3
NaN
6
rainy
8:12 am
NaN
False
K. B. Gregory
NaT
4
True
3
2006-03-02
1
14
5
hot/sunny
7:15 am
3:12 pm
True
L. Wang
0 days 07:57:00
3
False
4
1988-02-28
2
6
3
partly cloudy
4:53 am
2:16 pm
True
B. Habing
0 days 09:23:00
6
True
5
2015-07-14
3
12
0
cloudy
5:47 am
3:46 pm
False
Edwards
0 days 09:59:00
1
True
But we can make it stay sorted by setting inplace = True: