Manipulating data sets

Author

Karl Gregory

We’ll continue using the Pandas package for everything here.

import pandas as pd

dr = '/Users/karlgregory/Library/CloudStorage/OneDrive-UniversityofSouthCarolina/myBooks/compstat/data/'

trcols = ['NA','Place','First','Last','City','State','Age','Division','DP','Time','Rank']
tr23 = pd.read_table(dr + 'tr50k_2023.txt',header = None,names = trcols,usecols = range(1,len(trcols)))
tr24 = pd.read_table(dr + 'tr50k_2024.txt',header = None,names = trcols,usecols = range(1,len(trcols)))
tr24
Place First Last City State Age Division DP Time Rank
0 1 Drew Marshall Charlotte NC 30 M 1 4:15:08 93.78
1 2 Brent Bookwalter Montreat NC 40 M 2 4:18:06 99.61
2 3 Devon Brodmyer Brevard NC 22 M 3 4:42:09 93.33
3 4 Alan Garvick Leicester NC 33 M 4 4:54:52 86.52
4 5 Jonathan Keller Woodruff SC 27 M 5 4:57:10 85.85
... ... ... ... ... ... ... ... ... ... ...
200 201 Jessica Jasinski LINCOLNTON NC 35 F 44 10:12:28 62.33
201 202 Matt Kirchner Atlanta GA 41 M 158 10:12:28 51.43
202 203 Rick Gray Johnson City TN 63 M 159 10:18:03 64.00
203 204 Kimber Jones Mars Hill NC 33 F 45 10:24:42 56.73
204 205 Brandon King Weaverville NC 36 M 160 10:24:44 68.37

205 rows × 10 columns

In R, to read names like O’something, we had to specify that ' should not be interpreted as a quoting character. The Pandas read_table function uses " by default, so our Irish names got read in okay this time.

ir = tr24['Last'].str.match(".*['].*") # are there any names with an apostrophe?
tr24[ir]
Place First Last City State Age Division DP Time Rank
32 33 Jack O'Hara Asheville NC 28 M 30 6:09:57 68.96
41 42 Collin O'Berry Asheville NC 39 M 39 6:29:01 73.81

Merging data sets

tr_merged0 = pd.merge(tr23, tr24, on = ['First', 'Last'], suffixes = [' 2023',' 2024'])
tr_merged = tr_merged0[['First', 'Last', 'Time 2023', 'Time 2024']]
tr_merged
First Last Time 2023 Time 2024
0 Adam Corey 5:29:05 5:31:55
1 Patrick Donoghue 5:50:24 5:55:23
2 Travis Pace 5:53:17 5:46:06
3 David Nielsen 5:55:03 6:01:44
4 Matthew Steed 5:57:55 5:41:34
5 Michelle Gray 6:17:49 6:48:08
6 Andrew Hall 6:29:50 6:25:33
7 Bryson Peter 6:31:10 6:44:51
8 Tyson Jeffus 6:32:04 6:07:53
9 Daniel Merlin 6:33:04 5:42:52
10 Hunter Steed 6:35:50 6:52:47
11 Ivan Ng 6:36:15 7:13:36
12 Jake Hinshaw 6:39:17 6:25:35
13 Ryan Ford 6:50:58 7:54:12
14 Timothy Newell 7:20:10 6:55:19
15 Jared Wingler 7:29:25 7:27:15
16 Liana Hastings 7:33:00 8:41:13
17 Kati Cushman 7:55:59 6:59:52
18 Daniel Wells 7:57:52 8:35:15
19 Spencer Wells 7:57:55 8:38:07
20 Rio Kira 8:01:30 8:13:46
21 Amy McDowell 9:15:15 9:44:09
22 Rick Gray 9:58:03 10:18:03

You can do left joins and right joins with the merge function in Pandas as well. Consult the help file!

Concatenating data sets

Let’s read the data in from all the years, 2011 through 2024, using a loop, combining all the data into one large data frame.

li = list() # make a list of data frames to concatenate into one big one
for i in range(11,25):
    
    file = dr + 'tr50k_20' + str(i) + '.txt'
    tryr = pd.read_table(file,header = None,names = trcols,usecols = range(1,len(trcols)))
    tryr['Year'] = '20' + str(i)
    li.append(tryr) # append data frame to the list of data frames

tr = pd.concat(li)
tr
Place First Last City State Age Division DP Time Rank Year
0 1 Wes Kessenich Waxhaw NC 49 M 1 4:23:08 93.77 2011
1 2 Matt Broughton Sherrills Ford NC 45 M 2 4:23:23 92.79 2011
2 3 Dwight Winters Morganton NC 35 M 3 4:47:02 95.84 2011
3 4 Chris Squires Raleigh NC 46 M 4 4:57:16 79.09 2011
4 5 Tammy Tribett Manassas VA 39 F 1 4:59:00 75.58 2011
... ... ... ... ... ... ... ... ... ... ... ...
200 201 Jessica Jasinski LINCOLNTON NC 35 F 44 10:12:28 62.33 2024
201 202 Matt Kirchner Atlanta GA 41 M 158 10:12:28 51.43 2024
202 203 Rick Gray Johnson City TN 63 M 159 10:18:03 64.00 2024
203 204 Kimber Jones Mars Hill NC 33 F 45 10:24:42 56.73 2024
204 205 Brandon King Weaverville NC 36 M 160 10:24:44 68.37 2024

2012 rows × 11 columns

Aggregating data

Let’s work towards getting the average time in each division for each year of the race. First we’ll need to convert the times to seconds, because right now they are just character strings. Below we define a function to make the conversion and then add a new column to the data set containing the columns in seconds.

def secs(ch):
    ch2 = ch.replace("`","") # for a tie, the character "`" is added, which we need to remove
    h, m, s = ch2.split(":") # split a character into multiple strings based on a splitting character
    return int(h)*3600 + int(m) * 60 + int(s) # can just put return in front of an expression

tr['Seconds'] = tr['Time'].apply(secs)
tr
Place First Last City State Age Division DP Time Rank Year Seconds
0 1 Wes Kessenich Waxhaw NC 49 M 1 4:23:08 93.77 2011 15788
1 2 Matt Broughton Sherrills Ford NC 45 M 2 4:23:23 92.79 2011 15803
2 3 Dwight Winters Morganton NC 35 M 3 4:47:02 95.84 2011 17222
3 4 Chris Squires Raleigh NC 46 M 4 4:57:16 79.09 2011 17836
4 5 Tammy Tribett Manassas VA 39 F 1 4:59:00 75.58 2011 17940
... ... ... ... ... ... ... ... ... ... ... ... ...
200 201 Jessica Jasinski LINCOLNTON NC 35 F 44 10:12:28 62.33 2024 36748
201 202 Matt Kirchner Atlanta GA 41 M 158 10:12:28 51.43 2024 36748
202 203 Rick Gray Johnson City TN 63 M 159 10:18:03 64.00 2024 37083
203 204 Kimber Jones Mars Hill NC 33 F 45 10:24:42 56.73 2024 37482
204 205 Brandon King Weaverville NC 36 M 160 10:24:44 68.37 2024 37484

2012 rows × 12 columns

Here we define a function which will convert a number of seconds into a time in the form hh:mm:ss:

import time

def secs2time(x):
    return time.strftime("%H:%M:%S",time.gmtime(x)) # number of seconds from 00:00:00, Jan 1, 1970 GMT, re-express as hh:mm:ss

Now we apply the function defined above to the seconds series of the data frame using map:

tm0 = tr.pivot_table(values = 'Seconds', index = ['Year','Division'], aggfunc = {'Seconds' : 'mean'})
tm0['Ave. Time'] = tm0.Seconds.map(secs2time)  # use .map to apply the function
tm = tm0.reset_index(None)                     # the pivot_table creates multi-indexes that you can get rid of with .reset_index(None)
tm.drop('Seconds', axis = 1)                   # get rid of the 'Seconds' series
Year Division Ave. Time
0 2011 F 06:42:20
1 2011 M 06:10:06
2 2012 F 06:24:39
3 2012 M 06:04:34
4 2013 F 06:38:18
5 2013 M 05:47:18
6 2014 F 07:44:31
7 2014 M 07:51:12
8 2015 F 08:12:44
9 2015 M 07:45:11
10 2016 F 07:48:02
11 2016 M 07:49:02
12 2017 F 08:07:43
13 2017 M 07:39:59
14 2018 F 08:39:39
15 2018 M 08:02:19
16 2019 F 07:57:04
17 2019 M 07:26:02
18 2020 F 08:08:10
19 2020 M 06:59:01
20 2021 F 08:05:19
21 2021 M 07:24:56
22 2022 F 07:53:16
23 2022 M 07:20:23
24 2023 F 08:01:09
25 2023 M 07:07:56
26 2024 F 08:09:41
27 2024 M 07:36:25
age = tr.pivot_table(values = 'Age', index = ['Year','Division'], aggfunc = {'Age' : ['min','median','max']}) # min, median, and max are recognized built-in functions
age
max median min
Year Division
2011 F 64 39.0 23
M 67 40.0 21
2012 F 65 34.0 25
M 68 43.0 25
2013 F 56 44.0 23
M 63 38.5 22
2014 F 44 35.5 29
M 63 36.0 4
2015 F 52 35.0 22
M 70 38.5 21
2016 F 54 36.0 23
M 71 38.0 21
2017 F 55 37.0 25
M 72 39.0 22
2018 F 59 38.0 25
M 73 40.0 19
2019 F 58 36.5 25
M 73 40.0 19
2020 F 45 40.0 22
M 74 41.0 20
2021 F 52 39.0 24
M 72 39.0 20
2022 F 55 39.0 25
M 72 38.0 20
2023 F 59 38.0 22
M 73 39.0 20
2024 F 65 36.0 20
M 78 40.0 19
import numpy as np

def q10(x):
    return np.quantile(x,.10)
def q05(x):
    return np.quantile(x,.05)
    
tr_pctl = tr.pivot_table(values = 'Seconds', 
                         index = ['Year','Division'],
                         aggfunc = {'Seconds' : [('95th percentile',q05), ('90th percentile',q10),('Best','min')]})
tr_pctl
90th percentile 95th percentile Best
Year Division
2011 F 18287.0 18231.00 17940
M 18184.0 17283.40 15788
2012 F 20267.9 19718.00 18821
M 17665.6 16306.20 14590
2013 F 17743.0 17102.50 16462
M 17297.8 15944.70 15679
2014 F 21416.9 20971.95 20527
M 20559.3 20009.55 18154
2015 F 22611.2 19162.80 18684
M 21807.1 21071.15 19154
2016 F 22070.6 20839.60 18785
M 21497.1 19780.10 16653
2017 F 24213.4 21517.30 18183
M 21000.0 18796.20 15750
2018 F 25463.9 23127.50 18178
M 23219.0 21220.00 18056
2019 F 22719.1 21845.65 18665
M 21513.4 20295.00 15489
2020 F 23341.6 22589.60 22548
M 19441.0 18340.80 16017
2021 F 23155.4 20972.40 19461
M 21565.4 19851.10 17192
2022 F 24208.1 23461.45 18537
M 21095.0 20096.75 16681
2023 F 23428.1 21265.85 19743
M 20264.0 19204.80 16938
2024 F 24923.8 21871.20 19419
M 20637.0 20138.25 15308

Note that with pivot tables, Pandas introduces what is called a multi-index, which is just a hierarchical index. In this example we have the division within the year, so to access a row of the above table, we need to give both the year and the division as a tuple:

tr_pctl.loc[('2011','M'),] # access rows of this with a tuple
90th percentile    18184.0
95th percentile    17283.4
Best               15788.0
Name: (2011, M), dtype: float64

Let’s plot these quantiles over the years of the Table Rock 50k for each division.

tr_pctl_M = tr_pctl.loc[(slice(None),'M'),] # keep all rows of the first index 
tr_pctl_F = tr_pctl.loc[(slice(None),'F'),] # keep all rows of the first index 
tr_pctl_M.plot()
tr_pctl_F.plot() # would be nice to re-label these axes with hh:mm:ss times

Reshaping data sets

Long to wide

Let’s focus on the top 3 finishers each year.

pod = tr['Place'] <= 3
tr_pod = tr[pod]
tr_pod.pivot(index = 'Place', values = 'Time', columns='Year')
Year 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Place
1 4:23:08 4:03:10 4:21:19 5:02:34 5:11:24 4:37:33 4:22:30 5:00:56 4:18:09 4:26:57 4:46:32 4:38:01 4:42:18 4:15:08
2 4:23:23 4:31:35 4:22:00 5:23:32 5:19:14 4:52:18 4:41:06 5:01:33 5:01:42 4:34:21 4:47:37 4:51:39 4:46:22 4:18:06
3 4:47:02 4:31:51 4:34:22 5:34:01 5:20:16 4:54:49 4:54:47 5:02:58 5:07:24 4:49:28 4:50:14 5:08:57 4:47:12 4:42:09

Now let’s look at the top 3 places in each division in each year:

dpod = tr['DP'] <= 3
tr_dpod = tr[dpod]
tr_dpod.pivot(index = ['DP','Division'], values = 'Time', columns='Year')
Year 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
DP Division
1 F 4:59:00 5:13:41 4:34:22 5:42:07 5:11:24 5:13:05 5:03:03 5:02:58 5:11:05 6:15:48 5:24:21 5:08:57 5:29:03 5:23:39
M 4:23:08 4:03:10 4:21:19 5:02:34 5:19:14 4:37:33 4:22:30 5:00:56 4:18:09 4:26:57 4:46:32 4:38:01 4:42:18 4:15:08
2 F 5:03:51 5:36:41 4:55:43 5:55:36 5:20:16 5:20:00 5:28:32 5:19:09 5:50:53 6:16:40 5:45:59 6:23:25 5:30:29 5:51:21
M 4:23:23 4:31:35 4:22:00 5:23:32 5:29:05 4:52:18 4:41:06 5:01:33 5:01:42 4:34:21 4:47:37 4:51:39 4:46:22 4:18:06
3 F 5:04:47 5:40:24 5:33:20 6:09:05 6:31:00 5:45:57 5:32:01 6:00:39 5:57:34 6:37:16 5:47:59 6:32:22 5:53:12 5:53:37
M 4:47:02 4:31:51 4:46:58 5:34:01 5:37:52 4:54:49 4:54:47 5:09:13 5:07:24 4:49:28 4:50:14 5:12:04 4:47:12 4:42:09

Wide to long

The wide_to_long function can be used to convert a data set in “wide” format to “long” format. To illustrate how to do this, let’s convert our wide dataset showing the times of the top three finishers in each division in each year back into long format. In order to do that, we will need to “flatten out” the multi-indices created in the pivot operation.

wide0 = pd.pivot(tr_dpod,index = ['DP','Division'], values = 'Time', columns='Year')
wide = wide0.rename_axis(columns=None).reset_index() # remove the multi-indices
wide
DP Division 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
0 1 F 4:59:00 5:13:41 4:34:22 5:42:07 5:11:24 5:13:05 5:03:03 5:02:58 5:11:05 6:15:48 5:24:21 5:08:57 5:29:03 5:23:39
1 1 M 4:23:08 4:03:10 4:21:19 5:02:34 5:19:14 4:37:33 4:22:30 5:00:56 4:18:09 4:26:57 4:46:32 4:38:01 4:42:18 4:15:08
2 2 F 5:03:51 5:36:41 4:55:43 5:55:36 5:20:16 5:20:00 5:28:32 5:19:09 5:50:53 6:16:40 5:45:59 6:23:25 5:30:29 5:51:21
3 2 M 4:23:23 4:31:35 4:22:00 5:23:32 5:29:05 4:52:18 4:41:06 5:01:33 5:01:42 4:34:21 4:47:37 4:51:39 4:46:22 4:18:06
4 3 F 5:04:47 5:40:24 5:33:20 6:09:05 6:31:00 5:45:57 5:32:01 6:00:39 5:57:34 6:37:16 5:47:59 6:32:22 5:53:12 5:53:37
5 3 M 4:47:02 4:31:51 4:46:58 5:34:01 5:37:52 4:54:49 4:54:47 5:09:13 5:07:24 4:49:28 4:50:14 5:12:04 4:47:12 4:42:09

Now we will be able to access the column names in the way we will need to in order to use the wide_to_long function.

long = pd.wide_to_long(wide,stubnames = '', sep = "", i = ['DP','Division'], j = 'Year')
long
DP Division Year
1 F 2011 4:59:00
2012 5:13:41
2013 4:34:22
2014 5:42:07
2015 5:11:24
... ... ... ...
3 M 2020 4:49:28
2021 4:50:14
2022 5:12:04
2023 4:47:12
2024 4:42:09

84 rows × 1 columns