Manipulating data sets

Author

Karl Gregory

Merging data sets

I collected the results from the 2023 and 2024 Table Rock 50k by copying and pasting two tables like this into text files:

The text files tr50k_2023.txt and tr50k_2024.txt can be downloaded from here. They look like this:

readLines(con = "data/tr50k_2024.txt", n = 10)

 [1] "results\t1\tDrew\tMarshall\tCharlotte\tNC\t30\tM\t1\t4:15:08\t93.78"  
 [2] "results\t2\tBrent\tBookwalter\tMontreat\tNC\t40\tM\t2\t4:18:06\t99.61"
 [3] "results\t3\tDevon\tBrodmyer\tBrevard\tNC\t22\tM\t3\t4:42:09\t93.33"   
 [4] "results\t4\tAlan\tGarvick\tLeicester\tNC\t33\tM\t4\t4:54:52\t86.52"   
 [5] "results\t5\tJonathan\tKeller\tWoodruff\tSC\t27\tM\t5\t4:57:10\t85.85" 
 [6] "results\t6\tPeyton\tWhitt\tCandler\tNC\t23\tM\t6\t5:19:43\t80.91"     
 [7] "results\t7\tAli\tCooper\tHendersonville\tNC\t28\tF\t1\t5:23:39\t91.69"
 [8] "results\t8\tTimothy\tGehret\tSeattle\tWA\t41\tM\t7\t5:25:16\t82.25"   
 [9] "results\t9\tAdam\tCorey\tWinterville\tNC\t36\tM\t8\t5:31:55\t77.76"   
[10] "results\t10\tWill\tWick\tCharlotte\tNC\t50\tM\t9\t5:35:50\t75.45"

Below we read in each text file to a data frame with the brilliant and beautiful read.table() function. There is no “header” in this file giving the names of the columns, so if we want nice column names we have to provide them (otherwise they will be named V1, V2, and so on), which we can do with the col.names = option. We can specify the classes of the columns with the colClasses = option; with this option we can also specify which, if any, columns we don’t want read in:

columns <- c(NA,"Place","First","Last", "City", "State", "Age", "Division", "DP", "Time", "Rank")
# put "NULL" if you want to skip a column! We don't want the first one.
classes <- c("NULL", "numeric",rep("character",4),"numeric","character","numeric","character","numeric")

tr50k_2023 <- read.table(file = "data/tr50k_2023.txt", 
                         sep = "\t", # data are tab separated
                         quote = "", # some apparently Irish folk ran this race, with names like O'Something.  The "'" is used by default as a quote character, so this messed things up. Had to specify the option quote = "".
                         col.names = columns,
                         colClasses= classes)

# read the 2024 data in the same way
tr50k_2024 <- read.table(file = "data/tr50k_2024.txt", 
                         sep = "\t",
                         quote = "", 
                         col.names = columns,
                         colClasses= classes)

Since these are fairly large data frames, we may just want to see a small preview of their contents. We can use the head() function to print only the first several rows:

head(tr50k_2023)

  Place   First     Last      City State Age Division DP    Time  Rank
1     1   James Thompson Asheville    NC  39        M  1 4:42:18 93.55
2     2  Robbie    Harms  Fletcher    NC  30        M  2 4:46:22 87.05
3     3   Chris   Ingram Charlotte    NC  43        M  3 4:47:12 81.69
4     4    Alex    Black  Marshall    NC  33        M  4 4:56:19 77.94
5     5    Mark  Rebholz Charlotte    NC  33        M  5 4:56:44 88.10
6     6 William  Connell   Astoria    NY  40        M  6 4:57:45 86.15

head(tr50k_2024)

  Place    First       Last      City State Age Division DP    Time  Rank
1     1     Drew   Marshall Charlotte    NC  30        M  1 4:15:08 93.78
2     2    Brent Bookwalter  Montreat    NC  40        M  2 4:18:06 99.61
3     3    Devon   Brodmyer   Brevard    NC  22        M  3 4:42:09 93.33
4     4     Alan    Garvick Leicester    NC  33        M  4 4:54:52 86.52
5     5 Jonathan     Keller  Woodruff    SC  27        M  5 4:57:10 85.85
6     6   Peyton      Whitt   Candler    NC  23        M  6 5:19:43 80.91

Now, a handful of the long-distance runner types actually ran the Table Rock 50k in 2023 and 2024. Suppose we want to make a data set of just these runners, so we can compare their performances in the two years.

To achieve this we will use the merge() function. The merge function takes as arguments the two data sets as well as a by = argument which specifies which column or columns identify each unique record in the two data sets. The merge() function will return a new data frame containing the by = column (or columns) as well as all the other columns from each data frame; but the new data frame will include only records for identifiers appearing in both data sets. This is the default behavior of the merge() function: You get the intersection of the two data sets, which is the set of records with the identifier appearing in both. This is sometimes called an inner join. One can also specify the suffixes = option to change the way columns from the two data frame are renamed when they are brought into the new data frame.

tr50k_merged <- merge(tr50k_2023,tr50k_2024,
                      by = c("Last","First"), 
                      suffixes = c("_23","_24"))
head(tr50k_merged)

      Last    First Place_23      City_23 State_23 Age_23 Division_23 DP_23
1    Corey     Adam       12  Winterville       NC     35           M    11
2  Cushman     Kati      133    Morganton       NC     40           F    21
3 Donoghue  Patrick       21        Boone       NC     42           M    19
4     Ford     Ryan       67   Greenville       NC     34           M    60
5     Gray Michelle       34 Johnson City       TN     54           F     4
6     Gray     Rick      186 Johnson City       TN     62           M   145
  Time_23 Rank_23 Place_24      City_24 State_24 Age_24 Division_24 DP_24
1 5:29:05   77.76        9  Winterville       NC     36           M     8
2 7:55:59   72.56       64    Morganton       NC     41           F     6
3 5:50:24   74.72       26        Boone       NC     43           M    23
4 6:50:58   60.23      107   Greenville       NC     35           M    87
5 6:17:49   85.22       47 Johnson City       TN     55           F     4
6 9:58:03   64.00      203 Johnson City       TN     63           M   159
   Time_24 Rank_24
1  5:31:55   77.76
2  6:59:52   72.56
3  5:55:23   74.72
4  7:54:12   60.23
5  6:48:08   85.22
6 10:18:03   64.00

That’s a lot of columns. Let’s drop some of them by using the subset() function, as below:

subset(tr50k_merged, select = c("Last","First","Place_23","Place_24","Time_23","Time_24"))

       Last    First Place_23 Place_24 Time_23  Time_24
1     Corey     Adam       12        9 5:29:05  5:31:55
2   Cushman     Kati      133       64 7:55:59  6:59:52
3  Donoghue  Patrick       21       26 5:50:24  5:55:23
4      Ford     Ryan       67      107 6:50:58  7:54:12
5      Gray Michelle       34       47 6:17:49  6:48:08
6      Gray     Rick      186      203 9:58:03 10:18:03
7      Hall   Andrew       44       39 6:29:50  6:25:33
8  Hastings    Liana      108      157 7:33:00  8:41:13
9   Hinshaw     Jake       59       40 6:39:17  6:25:35
10   Jeffus    Tyson       48       32 6:32:04  6:07:53
11     Kira      Rio      141      127 8:01:30  8:13:46
12 McDowell      Amy      173      192 9:15:15  9:44:09
13   Merlin   Daniel       49       15 6:33:04  5:42:52
14   Newell  Timothy       94       55 7:20:10  6:55:19
15       Ng     Ivan       57       76 6:36:15  7:13:36
16  Nielsen    David       25       27 5:55:03  6:01:44
17     Pace   Travis       24       21 5:53:17  5:46:06
18    Peter   Bryson       47       45 6:31:10  6:44:51
19    Steed   Hunter       56       53 6:35:50  6:52:47
20    Steed  Matthew       27       12 5:57:55  5:41:34
21    Wells   Daniel      136      150 7:57:52  8:35:15
22    Wells  Spencer      137      154 7:57:55  8:38:07
23  Wingler    Jared       99       89 7:29:25  7:27:15

We see that 23 runners participated in both the 2023 and 2024 Table Rock 50k. Cool!

Another way to use the merge() function is, instead of keeping only the records in the intersection of the two data frames, to keep all the records from one or the other during the merge. Naming the arguments x = and y = such that the first data set is x and the second is y, we can set the option all.x = T so that all the records from the x data frame are kept. If we do this, we keep all the records in the x data set, even though there will be will be no corresponding identifier in the y data set for some of these. This results in the assignment of missing values to these records in the columns coming from the y data set. This is sometimes called a left outer join.

tr50k_merged_left <- merge(x = tr50k_2023,
                           y = tr50k_2024,
                           by = c("Last","First"),
                           all.x = T,
                           suffixes = c("_23","_24"))
head(tr50k_merged_left)

    Last    First Place_23      City_23 State_23 Age_23 Division_23 DP_23
1  Abreo   Justin      115 Fayetteville       NC     31           M   101
2 Adcock Jonathan      104       Stuart       VA     41           M    93
3 Adcock      Reb      153       Stuart       VA     43           F    27
4  Allen     Alex      160     Hoschton       GA     33           M   129
5  Annas     Matt      177    Morganton       NC     38           M   140
6  Annas   Torrey      176    Morganton       NC     37           F    37
  Time_23 Rank_23 Place_24 City_24 State_24 Age_24 Division_24 DP_24 Time_24
1 7:40:15   71.69       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
2 7:31:44   69.70       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
3 8:19:49   72.87       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
4 8:34:22   51.70       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
5 9:22:14   52.02       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
6 9:22:13   59.13       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
  Rank_24
1      NA
2      NA
3      NA
4      NA
5      NA
6      NA

We can likewise do a right outer join, which keeps all the records in the y data set.

tr50k_merged_right <- merge(x = tr50k_2023,
                            y = tr50k_2024,
                            by = c("Last","First"),
                            all.y = T,
                            suffixes = c("_23","_24"))

head(tr50k_merged_right)

      Last     First Place_23 City_23 State_23 Age_23 Division_23 DP_23 Time_23
1   Alford Nathaniel       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
2    Amore      Tony       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
3 Atkinson     James       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
4    Baker  Patricia       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
5  Barrett    Andrew       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
6    Beard  Jennifer       NA    <NA>     <NA>     NA        <NA>    NA    <NA>
  Rank_23 Place_24         City_24 State_24 Age_24 Division_24 DP_24 Time_24
1      NA       41    Johnson City       TN     24           M    38 6:26:30
2      NA       16       Lancaster       SC     44           M    15 5:43:16
3      NA      140 Fort Eisenhower       GA     44           M   112 8:29:43
4      NA      152         Raleigh       NC     33           F    33 8:36:20
5      NA      169     Chapel Hill       NC     57           M   132 8:56:42
6      NA      103         Sanford       NC     42           F    18 7:49:28
  Rank_24
1   71.60
2   80.91
3   51.63
4   62.68
5   73.90
6   66.96

Stacking data sets

We may want to combine two data sets by simply stacking one on top of the other. This is possible when the data sets have the same columns. In our case, we would like to keep track of which records came from the 2023 data and which came from the 2024 data. The code below adds a column to each data set with entries giving the year, either 2023 or 2024. We can then use the rbind() function (we used it before to stack matrices) to stack the two data sets:

tr50k_2023$Year <- 2023
tr50k_2024$Year <- 2024
tr50k <- rbind(tr50k_2023,tr50k_2024) # here is the stacking with rbind()
tr50k <- subset(tr50k,select = c("Place","Time","Last","First","Year")) # keep only certain columns
head(sort_by(tr50k,~Place)) # let's sort the data by place

    Place    Time       Last  First Year
1       1 4:42:18   Thompson  James 2023
188     1 4:15:08   Marshall   Drew 2024
2       2 4:46:22      Harms Robbie 2023
189     2 4:18:06 Bookwalter  Brent 2024
3       3 4:47:12     Ingram  Chris 2023
190     3 4:42:09   Brodmyer  Devon 2024

Reshaping a data set

Another way in which we might want to manipulate a data set is to “reshape” it. By “reshaping”, one typically means doing a long to wide or a wide to long transformation of the data set.

Long to wide

The tr50k data set contains all the Table Rock 50k data from the years 2023 and 2024, where the number of rows is equal to the total number of finishers of these two races. Suppose we would like construct a table for comparing the finishing times of the 1st-place finishers in the two years as well as the 2nd-place finishers in the two years and so on. That is, suppose we want a table where each row corresponds to a place and in each row we can see who finished at that place and with what finishing time in the year 2023 as well as in the year 2024.

We can achieve this with the reshape() function in R. This function was designed with longitudinal data in mind, which are data collected over time. Consider a “toy” experiment involving two subjects, where a response is measured on each subject at four observation times. The data might be presented in a table like this:

toy_data <- data.frame(Subject = c(1,1,1,1,2,2,2,2),
                       Observation = c(1,2,3,4,1,2,3,4), 
                       Response = rnorm(8))
toy_data

  Subject Observation   Response
1       1           1 -1.2914822
2       1           2 -0.5079160
3       1           3 -2.4747996
4       1           4  0.1384790
5       2           1  0.4772520
6       2           2 -1.8414791
7       2           3 -0.5111206
8       2           4 -0.2685817

If we want instead to represent the data in a table having one row for each subject and four columns giving the four observations, we can obtain it with the reshape() function as follows:

wide <- reshape(toy_data,
                idvar = "Subject",
                timevar = "Observation", # indexes the responses
                sep = " ", # try different characters here and see what happens!
                direction = "wide", # we want to go from long to wide
                new.row.names = 1:2) # keep the row names from getting wacky
wide

  Subject Response 1 Response 2 Response 3 Response 4
1       1  -1.291482  -0.507916 -2.4747996  0.1384790
2       2   0.477252  -1.841479 -0.5111206 -0.2685817

Now let’s try to make the table we want for the Table Rock 50k data. We’ll use the reshape() function, treating the place column as the “subject”:

tr50k_wide <- reshape(tr50k,
                      idvar = "Place",
                      timevar = "Year",
                      sep = " ",
                      direction = "wide")
                        
head(tr50k_wide)

  Place Time 2023 Last 2023 First 2023 Time 2024  Last 2024 First 2024
1     1   4:42:18  Thompson      James   4:15:08   Marshall       Drew
2     2   4:46:22     Harms     Robbie   4:18:06 Bookwalter      Brent
3     3   4:47:12    Ingram      Chris   4:42:09   Brodmyer      Devon
4     4   4:56:19     Black       Alex   4:54:52    Garvick       Alan
5     5   4:56:44   Rebholz       Mark   4:57:10     Keller   Jonathan
6     6   4:57:45   Connell    William   5:19:43      Whitt     Peyton

Now we have the desired table in “wide” format, which is handy for seeing the time you would need to get in each year to finish in a certain place.

Wide to long

Going back to our toy data example for a moment, the code below shows how we can go from the wide format to a long format:

long <- reshape(wide,
                idvar = "Subject",
                varying = 2:5, # columns with the measurements across time
                timevar = "Observation",
                sep = " ",
                times = 1:4,
                direction = "long", # we want to go from wide to long
                new.row.names = 1:8) # keep the row names from getting wacky
long

  Subject Observation   Response
1       1           1 -1.2914822
2       2           1  0.4772520
3       1           2 -0.5079160
4       2           2 -1.8414791
5       1           3 -2.4747996
6       2           3 -0.5111206
7       1           4  0.1384790
8       2           4 -0.2685817

We see that things are sorted in a different way, but it is the same data set as the original one.

We can reshape the Table Rock data from wide to long with the following:

tr50k_long <- reshape(tr50k_wide,
                      idvar = "Place",
                      varying = 2:7,
                      sep = " ",
                      timevar = "Year",
                      times = c("2023","2024"),
                      new.row.names = 1:1e4,
                      direction = "long")
head(tr50k_long)

  Place Year    Time     Last   First
1     1 2023 4:42:18 Thompson   James
2     2 2023 4:46:22    Harms  Robbie
3     3 2023 4:47:12   Ingram   Chris
4     4 2023 4:56:19    Black    Alex
5     5 2023 4:56:44  Rebholz    Mark
6     6 2023 4:57:45  Connell William

Aggregating data

Focusing just on the 2024 Table Rock 50k, suppose we want to know how many runners from each state participated in each division as well as the best performance in each division from each state. We can use the aggregate function to count the number of participants and to find the minimum division placement at each combination of the division and state columns.

a1 <- aggregate(DP ~ Division + State, 
                data = tr50k_2024, # specify which data frame
                FUN = length) # use length() function to count how many participants

a2 <- aggregate(DP ~ Division + State, 
                data = tr50k_2024, 
                FUN = min) # use min() function to get the smallest DP value

# merge the two data frames produced by the aggregate() function
DivState24 <- merge(a1,a2, by = c("Division","State")) 

# give columns nice names
colnames(DivState24)[3:4] <- c("Runners","Best DP") 

DivState24

   Division State Runners Best DP
1         F    FL       1      27
2         F    GA       2       3
3         F    NC      32       1
4         F    OH       1      36
5         F    SC       1      23
6         F    TN       5       4
7         F    VA       2      13
8         F    WI       1      20
9         M    AL       2      79
10        M    CO       1     150
11        M    FL       9      59
12        M    GA       4      52
13        M    IN       1      16
14        M    MD       1      84
15        M    MN       1     134
16        M    NC      99       1
17        M    OH       1      80
18        M    PA       3     105
19        M    SC      19       5
20        M    TN      11      38
21        M    TX       2      27
22        M    VA       5      43
23        M    WA       1       7

Practice

Practice writing code and anticipating the output of code with the following exercises.