Importing and formatting data

Author

Karl Gregory

Importing data tends to be a large part of the working statistician’s job (for good or ill). Here we’ll learn some tools for reading data sets into R when these are stored in a plain text file, sometimes with the extensions .dat,.txt, or .csv. The extension doesn’t really matter; we consider any file that can be opened by a simple text editor.

Reading data

Typically data are stored in text files where each row corresponds to a row in a spreadsheet, and the values are separated by some character, very often a comma. If not a comma, it may be a tab (long space) or other delimiter. If the values are not delimited by a special character, they may be in so-called fixed-width format, which we describe in the following.

Comma-separated

Suppose a data file looks like this:

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,0,6,rainy,8:12 am,11:34 am,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards

We see that the first row appears to give column names, and that the values are separated by commas. We can read this into R with the read.table() function. We specify the options sep = "," since the values are separated by the character , and put header = T so that the first row is assumed to contain column names:

safari <- read.table(file = "data/safari_comma.dat", 
                     sep = ",", 
                     header = TRUE)
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3              0         6         rainy 8:12 am 11:34 am
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Sometimes the data do not begin on the first row of the text file. Suppose the file looked like this:

Some make-believe safari data
Woohoo!!

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,.,6,rainy,8:12 am,,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards

Now, in addition to the fact that the data do not begin on line 1 of the file, we see that there are a couple of missing values in this version of the data. The statistician is forever coping with missing data. Here we must specify which characters should be taken as missing values. These we specify with the option na.strings = as shown below. In order to skip the first three lines of data, we use the skip = option.

safari <- read.table(file = "data/safari_comma_missing.dat", 
                     sep = ",", 
                     header = TRUE,
                     skip = 3,
                     na.strings = c("","."))
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

In the special case that the data values are separated by commas and the first line contains column names, we can use the function read.csv() to achieve the same result. It is simply a “wrapper” for the read.table() function, which means that it simple executes the latter with certain options already specified, such as sep = "," and header = T. Data files with data values separated by commas often have the extension .csv.

safari <- read.csv(file = "data/safari_comma_missing.dat", 
                   skip = 3,
                   na.strings = c("","."))
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Other-delimited

As we have said, sometimes the data values are not delimited by a comma. For example, we might have a data file like this:

Some make-believe safari data
Woohoo!!

date;wildebeest;laughing hyena;crocodile;weather;start;end;fun;guide
1/13/1999;12;none;2;sunny;7:21 am;4:14 pm;yes;Joshua Tebbs
4/28/2001;3;1;1;cloudy;6:25 am;12:33 pm;Y;Edsel Peña
10/15/2010;3;.;6;rainy;8:12 am;;no;Karl Bruce Gregory
3/02/2006;1;14;5;hot/sunny;7:15 am;3:12 pm;y;Lianming Wang
2/28/1988;2;6;3;partly cloudy;4:53 am;2:16 pm;Yes;Brian Habing
7/14/2015;3;12;0;cloudy;5:47 am;3:46 pm;No;Edwards

To read this in, we have only to specify the delimiter with sep = ";":

safari <- read.table(file = "data/safari_semi_missing.dat", 
                     sep = ";", 
                     header = TRUE,
                     skip = 3,
                     na.strings = c("","."))
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Data are often tab delimited, looking like this:

Some make-believe safari data
Woohoo!!

date    wildebeest  laughing hyena  crocodile   weather start   end fun guide
1/13/1999   12  none    2   sunny   7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001   3   1   1   cloudy  6:25 am 12:33 pm    Y   Edsel Peña
10/15/2010  3   .   6   rainy   8:12 am     no  Karl Bruce Gregory
3/02/2006   1   14  5   hot/sunny   7:15 am 3:12 pm y   Lianming Wang
2/28/1988   2   6   3   partly cloudy   4:53 am 2:16 pm Yes Brian Habing
7/14/2015   3   12  0   cloudy  5:47 am 3:46 pm No  Edwards

To specify the tab as the delimiting character, one must put sep = "\t":

safari <- read.table(file = "data/safari_tab_missing.dat", 
                     sep = "\t", 
                     header = TRUE,
                     skip = 3,
                     na.strings = c("","."))
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Fixed-width

Sometimes the data values are not separated by any special delimiting character, but rather arranged in a format such that the values belonging to a column are always place in the same columns of text characters. This is called a fixed-width format. Here is an example:

writeLines(readLines(con = "data/safari_fwf_missing.dat"))
Some make-believe safari data
Woohoo!!

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999  12 none 2  sunny         7:21 am  4:14 pm  yes Joshua Tebbs
4/28/2001  3  1    1  cloudy        6:25 am  12:33 pm Y   Edsel Peña
10/15/2010 3  6       rainy         8:12 am           no  Karl Bruce Gregory
3/02/2006  1  14   5  hot/sunny     7:15 am  3:12 pm  y   Lianming Wang
2/28/1988  2  6    3  partly cloudy 4:53 am  2:16 pm  Yes Brian Habing
7/14/2015  3  12   0  cloudy        5:47 am  3:46 pm  No  Edwards

We can use the read.fwf() function to read this file. We must specify how many characters are alloted to each column using the widths = option. Also important is the strip.white=T option, which will remove from character strings the extra spaces that are read in when reading fixed-width data:

safari <- read.fwf(file = "data/safari_fwf_missing.dat", 
                   widths = c(11,3,5,3,14,9,9,4,18),
                   header = T,
                   skip = 3,
                   na.strings  = c("","."),
                   strip.white = T,
                   sep =",")
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3              6        NA         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Formatting data

Reading the data into R is only the first step in preparing the data for analysis. The statistician must check that numbers are treated as numbers, character strings as character strings, and may wish also to check ensure that dates and times are interpreted as such, rather than as arbitrary character strings.

Column classes

A good way to see the types, or more properly the classes assigned to the columns is to apply with sapply() the class() function to the data frame:

sapply(safari,class)
          date     wildebeest laughing.hyena      crocodile        weather 
   "character"      "integer"    "character"      "integer"    "character" 
         start            end            fun          guide 
   "character"    "character"    "character"    "character" 

The reason laughing.hyena is read in as character column is that it has the value "none" on one line. Let’s overwrite "none" with 0, noting that when we do this, our 0 will be coerced to the character string "0", since the column is a character column; however, we can fix this by changing the class of this column to integer, as below:

safari$laughing.hyena[safari$laughing.hyena=="none"] <- 0
class(safari$laughing.hyena) <- "integer"
sapply(safari,class)
          date     wildebeest laughing.hyena      crocodile        weather 
   "character"      "integer"      "integer"      "integer"    "character" 
         start            end            fun          guide 
   "character"    "character"    "character"    "character" 

The remaining columns appear to have been read in as expected. We may wish, however, that the years in the data column be recognized as such, whereas currently they are merely strings of characters which have no meaning to the software.

Dates

The first column of the safari data set contains dates, but they have been read in as character strings. We would like R to recognize them as dates. To convert character strings to dates we can use the as.Date() function. By default the function assumes the format yyyy-mm-dd, but other formats can be specified:

as.Date("2014-07-05")
[1] "2014-07-05"
as.Date("7/5/2014") # will not be recognized correctly. Must specify a format!
[1] "0007-05-20"
as.Date("7/5/2014", format = "%m/%d/%Y")
[1] "2014-07-05"
as.Date("July 7, 2014", format = "%B %d, %Y")
[1] "2014-07-07"

Run ?as.Date to read more about date formats.

A date object stores the date as the number of days which have passed since January 1st, 1970. Dates earlier than this are stored as negative values:

as.Date(0)
[1] "1970-01-01"
as.Date(1)
[1] "1970-01-02"
as.Date(-1)
[1] "1969-12-31"

There are several cool ways to play with dates in R:

today <- Sys.Date() # today's date
today
[1] "2025-09-11"
today - 30 # the date 30 days ago
[1] "2025-08-12"
seq(today,by="3 days",length = 12) # make a sequence of dates
 [1] "2025-09-11" "2025-09-14" "2025-09-17" "2025-09-20" "2025-09-23"
 [6] "2025-09-26" "2025-09-29" "2025-10-02" "2025-10-05" "2025-10-08"
[11] "2025-10-11" "2025-10-14"
weekdays(as.Date(c("2014-07-05","2017-07-05")))
[1] "Saturday"  "Wednesday"
months(today) # extract the month from a date
[1] "September"

We can print a date in a different formats other than the default with the format() function:

format(today,"%m/%d/%y")
[1] "09/11/25"
format(today,"%b %d, %Y")
[1] "Sep 11, 2025"
format(today,"%B %d, %Y")
[1] "September 11, 2025"

In the date column of the safari data set, the dates are recorded in many different formats. It takes a little bit of work to standardize the formats: Below, we use a loop to go through the entries of the date column to extract the date from each one with the as.Date() function. We have to list all the different formats to try by using the tryFormats option. Then we re-print the date in the format we want to keep. After the loop is done, we use as.Date() again to convert the entire vector of character strings into a column having the date class.

To convert the character strings in the data column of the safari data set, which represent dates to our eyes but which are not yet interpreted as dates by our software, we can use the as.Date() function, specifying the format in which they are written, to convert the character strings to actual date values. The code below overwrites the date colums with these true dates, and as a bonus, add to the data set a weekday column:

safari$date <- as.Date(safari$date, "%m/%d/%Y")
safari$day <- weekdays(safari$date)
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1 1999-01-13         12              0         2         sunny 7:21 am  4:14 pm
2 2001-04-28          3              1         1        cloudy 6:25 am 12:33 pm
3 2010-10-15          3              6        NA         rainy 8:12 am     <NA>
4 2006-03-02          1             14         5     hot/sunny 7:15 am  3:12 pm
5 1988-02-28          2              6         3 partly cloudy 4:53 am  2:16 pm
6 2015-07-14          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide       day
1 yes       Joshua Tebbs Wednesday
2   Y         Edsel Peña  Saturday
3  no Karl Bruce Gregory    Friday
4   y      Lianming Wang  Thursday
5 Yes       Brian Habing    Sunday
6  No            Edwards   Tuesday

Now we see that the first column has the date class.

sapply(safari,class)
          date     wildebeest laughing.hyena      crocodile        weather 
        "Date"      "integer"      "integer"      "integer"    "character" 
         start            end            fun          guide            day 
   "character"    "character"    "character"    "character"    "character" 

Now that we have actual dates in the data set, the dates can be interpreted as such by other functions, such as the plot() function:

plot(wildebeest ~ date, data=safari)

Text processing

It is helpful to know a few tricks for dealing with text data. Here we will introduce a few functions that should come in handy.

Suppose we want to standardize the responses in the fun column of the safari data. One way to do this is to make a character vector containing all the strings we should interpret as “yes”; then we can use the operator %in% to check, for each entry of the fun column of the safari data, if its value is one of the values in our vector. Thus:

yes <- c("y","yes","Y","Yes")
safari$fun <- safari$fun %in% yes
safari
        date wildebeest laughing.hyena crocodile       weather   start      end
1 1999-01-13         12              0         2         sunny 7:21 am  4:14 pm
2 2001-04-28          3              1         1        cloudy 6:25 am 12:33 pm
3 2010-10-15          3              6        NA         rainy 8:12 am     <NA>
4 2006-03-02          1             14         5     hot/sunny 7:15 am  3:12 pm
5 1988-02-28          2              6         3 partly cloudy 4:53 am  2:16 pm
6 2015-07-14          3             12         0        cloudy 5:47 am  3:46 pm
    fun              guide       day
1  TRUE       Joshua Tebbs Wednesday
2  TRUE         Edsel Peña  Saturday
3 FALSE Karl Bruce Gregory    Friday
4  TRUE      Lianming Wang  Thursday
5  TRUE       Brian Habing    Sunday
6 FALSE            Edwards   Tuesday

Now the fun column is has logical values.

Next, suppose we want to re-format the start and end times so that the 24-hour clock is used (this would be helpful to us if we later want read these times as actual times instead of arbitrary character strings). We will write a function which does this, introducing a few helpful functions along the way.

The substr() function extracts a substring from a character string:

substr("howdy", start = 2, stop = 3)
[1] "ow"
substr("howdy", start = 5, stop = 5)
[1] "y"

The regexpr() function returns the location in a character strong of the first match to a given pattern.

regexpr("ow","howdy") # look for pattern "ow" in string "howdy"
[1] 2
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
regexpr("o","howdy-do") 
[1] 2
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The gregexr() returns not just the position of the first match, but of every match:

gregexpr("o","howdy-do")
[[1]]
[1] 2 8
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
gregexpr("o[ ]","howdy-do to you") # returns the position of "o" only when it is followed by a space
[[1]]
[1]  8 11
attr(,"match.length")
[1] 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The gregexpr() function and other functions like it (run ?gregexpr() to see other similar functions) allow the use of regular expressions to set patterns for which to look for matches. These have a standard syntax across many software programs and operating systems (so they are useful not just for R).

The function sprintf() is used to print numbers as character strings in very specific formats:

e <- exp(1)
sprintf(e,fmt = "%f")
[1] "2.718282"
sprintf(e,fmt = "%.2f")
[1] "2.72"
sprintf(e,fmt = "%05.2f")
[1] "02.72"
sprintf(e,fmt = "%E")
[1] "2.718282E+00"

Print integer values with leading zeros:

sprintf(1,fmt = "%02d")
[1] "01"
sprintf(7,fmt = "%03d")
[1] "007"

Now we will use the substr(), regexpr(), and sprintf() functions to build a function that converts times given as hh:mm with an am or pm afterwards to a 24-hour clock time in the format hh:mm.

twelve2twentyfour <- function(x){

  if(is.na(x)) return(NA)
    
  colon <- regexpr(":",x) # get position of colon
  hh <- as.numeric(substr(x,start = 1,stop = colon - 1))
  mm <- substr(x,start = colon + 1, stop = colon + 2)
  
  m <- regexpr("m",x) # get position of character "m"
  ap <- substr(x,start=m-1,stop = m-1)
  if((ap == "p") & (hh != 12)) hh <- hh + 12
  if((ap == "a") & (hh == 12)) hh <- hh - 12
  
  val <- paste(sprintf(hh,fmt = "%02d"),mm,sep=":")
  
  return(val)
  
}

twelve2twentyfour("2:20 pm")
[1] "14:20"
twelve2twentyfour("12:20am")
[1] "00:20"
twelve2twentyfour(" 9:45 am")
[1] "09:45"
twelve2twentyfour("12:15 pm")
[1] "12:15"

Now we replace the start and end times in our safari data set with the newly formatted times by using the sapply() function to apply our twelve2twentyfour() function to each value in the start and end columns.

safari$start <- sapply(safari$start,twelve2twentyfour)
safari$end <- sapply(safari$end,twelve2twentyfour)

safari
        date wildebeest laughing.hyena crocodile       weather start   end
1 1999-01-13         12              0         2         sunny 07:21 16:14
2 2001-04-28          3              1         1        cloudy 06:25 12:33
3 2010-10-15          3              6        NA         rainy 08:12  <NA>
4 2006-03-02          1             14         5     hot/sunny 07:15 15:12
5 1988-02-28          2              6         3 partly cloudy 04:53 14:16
6 2015-07-14          3             12         0        cloudy 05:47 15:46
    fun              guide       day
1  TRUE       Joshua Tebbs Wednesday
2  TRUE         Edsel Peña  Saturday
3 FALSE Karl Bruce Gregory    Friday
4  TRUE      Lianming Wang  Thursday
5  TRUE       Brian Habing    Sunday
6 FALSE            Edwards   Tuesday

Another very useful function is the grep() function which looks for a pattern in the entries of a character vector and returns in the indices in which the pattern was found. Below, we make a new logical column in the safari data set called “overcast” which will be TRUE except when the pattern "sunny" appears in the string describing the weather.

safari$overcast <- rep(TRUE,nrow(safari)) # set all equal to true
safari$overcast[grep("sunny",safari$weather)] <- FALSE # make false the sunny days
safari
        date wildebeest laughing.hyena crocodile       weather start   end
1 1999-01-13         12              0         2         sunny 07:21 16:14
2 2001-04-28          3              1         1        cloudy 06:25 12:33
3 2010-10-15          3              6        NA         rainy 08:12  <NA>
4 2006-03-02          1             14         5     hot/sunny 07:15 15:12
5 1988-02-28          2              6         3 partly cloudy 04:53 14:16
6 2015-07-14          3             12         0        cloudy 05:47 15:46
    fun              guide       day overcast
1  TRUE       Joshua Tebbs Wednesday    FALSE
2  TRUE         Edsel Peña  Saturday     TRUE
3 FALSE Karl Bruce Gregory    Friday     TRUE
4  TRUE      Lianming Wang  Thursday    FALSE
5  TRUE       Brian Habing    Sunday     TRUE
6 FALSE            Edwards   Tuesday     TRUE

As another exercise, suppose we wish to replace the names in the guide column with the first initial, followed by a period, and then the last name after a space, where we do not change the name if only the last name is given. The code below makes a function to convert each name to this shortened version and applies the function to the guide column of the safari data set.

shortname <- function(ch){
  
  # check for spaces in the name
  spaces <- grep(" ",ch)
  if(length(spaces) == 0){# if no spaces
    
    val <- ch
    
  } else {
  
    # find position of first character in last name  
    ln_start <- max(gregexpr(" ",ch)[[1]]) + 1
    
    # get last name
    ln <- substr(ch,ln_start,nchar(ch))
    
    # get first initial
    finit <- substr(ch,1,1)
    
    # put together
    val <- paste(finit,ln,sep = ". ")  
    
  }
  
  return(val)
    
}

safari$guide <- sapply(safari$guide,shortname)
safari
        date wildebeest laughing.hyena crocodile       weather start   end
1 1999-01-13         12              0         2         sunny 07:21 16:14
2 2001-04-28          3              1         1        cloudy 06:25 12:33
3 2010-10-15          3              6        NA         rainy 08:12  <NA>
4 2006-03-02          1             14         5     hot/sunny 07:15 15:12
5 1988-02-28          2              6         3 partly cloudy 04:53 14:16
6 2015-07-14          3             12         0        cloudy 05:47 15:46
    fun      guide       day overcast
1  TRUE   J. Tebbs Wednesday    FALSE
2  TRUE    E. Peña  Saturday     TRUE
3 FALSE K. Gregory    Friday     TRUE
4  TRUE    L. Wang  Thursday    FALSE
5  TRUE  B. Habing    Sunday     TRUE
6 FALSE    Edwards   Tuesday     TRUE

Date-times

Suppose we wish to create another column in the data set giving the duration of each safari, say in the format hh:mm. To compute the entries of this column we need to subtract the end times from the start times. Since we have now converted the times to 24-hour clock times, it will be simple to convert these new character strongs to actual calendar times.

The POSIXct class stores calendar times as a number of seconds from the time 00:00:00 (midnight) in the timezone GMT on January 1st, 1970 (when the machines awoke…). The name “POSIX” refers to a set of standards designed to promote compatibility of code across operating systems, and “ct” stands for calendar time. The default format in which R prints objects of this class is yyyy-mm-dd hh:mm:ss tz, where tz is the time zone.

time0 <- as.POSIXct(0,tz="GMT") # Greenwich mean time
time0
[1] "1970-01-01 GMT"
time0 + 30 # thirty seconds later
[1] "1970-01-01 00:00:30 GMT"
time0 - 60 # one minute earlier
[1] "1969-12-31 23:59:00 GMT"
# birth of my daughter Viva
viva <- as.POSIXct("2015-11-22 02:23", tz="CET")
as.numeric(viva) # number of seconds since "origin"
[1] 1448155380
print(viva,tz = "US/Eastern") # in EST
[1] "2015-11-21 20:23:00 EST"
print(viva,tz = "US/Central") # in CST
[1] "2015-11-21 19:23:00 CST"

We can read calender times in other formats using the strptime() function:

# birth of my daughter Lois
lois <- strptime("7/5/2017 21:56", format = "%m/%d/%Y %H:%M")
lois
[1] "2017-07-05 21:56:00 EDT"

Note that the 24-hour clock is always used (not the 12-hour clock with “am” and “pm”). Since we have already converted the start and end times in the safari data to 24-hour times, we can use the paste() function to construct character strings which can be interpreted as calendar times:

paste(safari$date,safari$start)
[1] "1999-01-13 07:21" "2001-04-28 06:25" "2010-10-15 08:12" "2006-03-02 07:15"
[5] "1988-02-28 04:53" "2015-07-14 05:47"

To convert these character strings to calendar times, it is most convenient to use the strptime() function. Afterwards we can use the timediff() function to get the difference in time between the calendar time representations of the start and end times. The below obtains these differences in minutes and then converts the number of minutes into the format hh:mm:

# get the start and end times
start <- strptime(paste(safari$date,safari$start),"%Y-%m-%d %H:%M")
end <- strptime(paste(safari$date,safari$end),"%Y-%m-%d %H:%M")

# get the difference in number of minutes
minutes <- as.integer(difftime(end,start,unit="min"))

# write this as hh:mm
duration <- paste(sprintf(floor(minutes/60),fmt="%02.f"),
                  sprintf(minutes %% 60,fmt="%02.f"),sep=":")

# find durations with string "NA" replace with real NA
duration[grep(pattern = "NA",x = duration)] <- NA

# add a column to the data set
safari$duration <- duration
safari
        date wildebeest laughing.hyena crocodile       weather start   end
1 1999-01-13         12              0         2         sunny 07:21 16:14
2 2001-04-28          3              1         1        cloudy 06:25 12:33
3 2010-10-15          3              6        NA         rainy 08:12  <NA>
4 2006-03-02          1             14         5     hot/sunny 07:15 15:12
5 1988-02-28          2              6         3 partly cloudy 04:53 14:16
6 2015-07-14          3             12         0        cloudy 05:47 15:46
    fun      guide       day overcast duration
1  TRUE   J. Tebbs Wednesday    FALSE    08:53
2  TRUE    E. Peña  Saturday     TRUE    06:08
3 FALSE K. Gregory    Friday     TRUE     <NA>
4  TRUE    L. Wang  Thursday    FALSE    07:57
5  TRUE  B. Habing    Sunday     TRUE    09:23
6 FALSE    Edwards   Tuesday     TRUE    09:59

Sorting a data frame

We can use the sort_by() function to sort the rows of a data frame according to the values in one (or more columns):

sort_by(safari, ~ date) # sort by the date column
        date wildebeest laughing.hyena crocodile       weather start   end
5 1988-02-28          2              6         3 partly cloudy 04:53 14:16
1 1999-01-13         12              0         2         sunny 07:21 16:14
2 2001-04-28          3              1         1        cloudy 06:25 12:33
4 2006-03-02          1             14         5     hot/sunny 07:15 15:12
3 2010-10-15          3              6        NA         rainy 08:12  <NA>
6 2015-07-14          3             12         0        cloudy 05:47 15:46
    fun      guide       day overcast duration
5  TRUE  B. Habing    Sunday     TRUE    09:23
1  TRUE   J. Tebbs Wednesday    FALSE    08:53
2  TRUE    E. Peña  Saturday     TRUE    06:08
4  TRUE    L. Wang  Thursday    FALSE    07:57
3 FALSE K. Gregory    Friday     TRUE     <NA>
6 FALSE    Edwards   Tuesday     TRUE    09:59
sort_by(safari, ~ fun + date) # sort first by fun, then by date within fun
        date wildebeest laughing.hyena crocodile       weather start   end
3 2010-10-15          3              6        NA         rainy 08:12  <NA>
6 2015-07-14          3             12         0        cloudy 05:47 15:46
5 1988-02-28          2              6         3 partly cloudy 04:53 14:16
1 1999-01-13         12              0         2         sunny 07:21 16:14
2 2001-04-28          3              1         1        cloudy 06:25 12:33
4 2006-03-02          1             14         5     hot/sunny 07:15 15:12
    fun      guide       day overcast duration
3 FALSE K. Gregory    Friday     TRUE     <NA>
6 FALSE    Edwards   Tuesday     TRUE    09:59
5  TRUE  B. Habing    Sunday     TRUE    09:23
1  TRUE   J. Tebbs Wednesday    FALSE    08:53
2  TRUE    E. Peña  Saturday     TRUE    06:08
4  TRUE    L. Wang  Thursday    FALSE    07:57