Importing and formatting data

Author

Karl Gregory

Importing data tends to be a large part of the working statistician’s job (for good or ill). Here we’ll learn some tools for reading data sets into R when these are stored in a plain text file, sometimes with the extensions .dat,.txt, or .csv. The extension doesn’t really matter; we consider any file that can be opened by a simple text editor.

Reading data

Typically data are stored in text files where each row corresponds to a row in a spreadsheet, and the values are separated by some character, very often a comma. If not a comma, it may be a tab (long space) or other delimiter. If the values are not delimited by a special character, they may be in so-called fixed-width format, which we describe in the following.

The data sets used in this note can be downloaded from this link. Get these ones:

In order to access a data file after you have saved it on your computer, you will have to provide the path to the file. This can be a complete file path or the relative file path from the current working directory. The working directory is usually the directory (i.e. folder) from where you opened your R script, but you can change it with the setwd() command, and you can find out what it is with the getwd() command. In the following, the data files are accessed in a folder called data which is sitting in the current working directory, so each file path looks like data/<filename>.

Comma-separated

Suppose a data file looks like this:

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,0,6,rainy,8:12 am,11:34 am,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards

We see that the first row appears to give column names, and that the values are separated by commas. We can read this into R with the read.table() function. We specify the options sep = "," since the values are separated by the character , and put header = T so that the first row is assumed to contain column names:

safari <- read.table(file = "data/safari_comma.dat", 
                     sep = ",", 
                     header = TRUE)
safari

        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3              0         6         rainy 8:12 am 11:34 am
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Sometimes the data do not begin on the first row of the text file. Suppose the file looked like this:

Some make-believe safari data
Woohoo!!

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,.,6,rainy,8:12 am,,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards

Now, in addition to the fact that the data do not begin on line 1 of the file, we see that there are a couple of missing values in this version of the data. The statistician is forever coping with missing data. Here we must specify which characters should be taken as missing values. These we specify with the option na.strings = as shown below. In order to skip the first three lines of data, we use the skip = option.

safari <- read.table(file = "data/safari_comma_missing.dat", 
                     sep = ",", 
                     header = TRUE,
                     skip = 3,
                     na.strings = c("","."))
safari

        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

In the special case that the data values are separated by commas and the first line contains column names, we can use the function read.csv() to achieve the same result. It is simply a “wrapper” for the read.table() function, which means that it simple executes the latter with certain options already specified, such as sep = "," and header = T. Data files with data values separated by commas often have the extension .csv.

safari <- read.csv(file = "data/safari_comma_missing.dat", 
                   skip = 3,
                   na.strings = c("","."))
safari

        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Other-delimited

As we have said, sometimes the data values are not delimited by a comma. For example, we might have a data file like this:

Some make-believe safari data
Woohoo!!

date;wildebeest;laughing hyena;crocodile;weather;start;end;fun;guide
1/13/1999;12;none;2;sunny;7:21 am;4:14 pm;yes;Joshua Tebbs
4/28/2001;3;1;1;cloudy;6:25 am;12:33 pm;Y;Edsel Peña
10/15/2010;3;.;6;rainy;8:12 am;;no;Karl Bruce Gregory
3/02/2006;1;14;5;hot/sunny;7:15 am;3:12 pm;y;Lianming Wang
2/28/1988;2;6;3;partly cloudy;4:53 am;2:16 pm;Yes;Brian Habing
7/14/2015;3;12;0;cloudy;5:47 am;3:46 pm;No;Edwards

To read this in, we have only to specify the delimiter with sep = ";":

safari <- read.table(file = "data/safari_semi_missing.dat", 
                     sep = ";", 
                     header = TRUE,
                     skip = 3,
                     na.strings = c("","."))
safari

        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Data are often tab delimited, looking like this:

Some make-believe safari data
Woohoo!!

date    wildebeest  laughing hyena  crocodile   weather start   end fun guide
1/13/1999   12  none    2   sunny   7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001   3   1   1   cloudy  6:25 am 12:33 pm    Y   Edsel Peña
10/15/2010  3   .   6   rainy   8:12 am     no  Karl Bruce Gregory
3/02/2006   1   14  5   hot/sunny   7:15 am 3:12 pm y   Lianming Wang
2/28/1988   2   6   3   partly cloudy   4:53 am 2:16 pm Yes Brian Habing
7/14/2015   3   12  0   cloudy  5:47 am 3:46 pm No  Edwards

To specify the tab as the delimiting character, one must put sep = "\t":

safari <- read.table(file = "data/safari_tab_missing.dat", 
                     sep = "\t", 
                     header = TRUE,
                     skip = 3,
                     na.strings = c("","."))
safari

        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3           <NA>         6         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Fixed-width

Sometimes the data values are not separated by any special delimiting character, but rather arranged in a format such that the values belonging to a column are always place in the same columns of text characters. This is called a fixed-width format. Here is an example:

writeLines(readLines(con = "data/safari_fwf_missing.dat"))

Some make-believe safari data
Woohoo!!

date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999  12 none 2  sunny         7:21 am  4:14 pm  yes Joshua Tebbs
4/28/2001  3  1    1  cloudy        6:25 am  12:33 pm Y   Edsel Peña
10/15/2010 3  6       rainy         8:12 am           no  Karl Bruce Gregory
3/02/2006  1  14   5  hot/sunny     7:15 am  3:12 pm  y   Lianming Wang
2/28/1988  2  6    3  partly cloudy 4:53 am  2:16 pm  Yes Brian Habing
7/14/2015  3  12   0  cloudy        5:47 am  3:46 pm  No  Edwards

We can use the read.fwf() function to read this file. We must specify how many characters are alloted to each column using the widths = option. Also important is the strip.white=T option, which will remove from character strings the extra spaces that are read in when reading fixed-width data:

safari <- read.fwf(file = "data/safari_fwf_missing.dat", 
                   widths = c(11,3,5,3,14,9,9,4,18),
                   header = T,
                   skip = 3,
                   na.strings  = c("","."),
                   strip.white = T,
                   sep =",")
safari

        date wildebeest laughing.hyena crocodile       weather   start      end
1  1/13/1999         12           none         2         sunny 7:21 am  4:14 pm
2  4/28/2001          3              1         1        cloudy 6:25 am 12:33 pm
3 10/15/2010          3              6        NA         rainy 8:12 am     <NA>
4  3/02/2006          1             14         5     hot/sunny 7:15 am  3:12 pm
5  2/28/1988          2              6         3 partly cloudy 4:53 am  2:16 pm
6  7/14/2015          3             12         0        cloudy 5:47 am  3:46 pm
  fun              guide
1 yes       Joshua Tebbs
2   Y         Edsel Peña
3  no Karl Bruce Gregory
4   y      Lianming Wang
5 Yes       Brian Habing
6  No            Edwards

Formatting data

Reading the data into R is only the first step in preparing the data for analysis. The statistician must check that numbers are treated as numbers, character strings as character strings, and may wish also to check ensure that dates and times are interpreted as such, rather than as arbitrary character strings.

Column classes

A good way to see the types, or more properly the classes assigned to the columns is to apply with sapply() the class() function to the data frame:

sapply(safari,class)

          date     wildebeest laughing.hyena      crocodile        weather 
   "character"      "integer"    "character"      "integer"    "character" 
         start            end            fun          guide 
   "character"    "character"    "character"    "character"

The reason laughing.hyena is read in as character column is that it has the value "none" on one line. Let’s overwrite "none" with 0, noting that when we do this, our 0 will be coerced to the character string "0", since the column is a character column; however, we can fix this by changing the class of this column to integer, as below:

safari$laughing.hyena[safari$laughing.hyena=="none"] <- 0
class(safari$laughing.hyena) <- "integer"
sapply(safari,class)

          date     wildebeest laughing.hyena      crocodile        weather 
   "character"      "integer"      "integer"      "integer"    "character" 
         start            end            fun          guide 
   "character"    "character"    "character"    "character"

The remaining columns appear to have been read in as expected. We may wish, however, that the years in the data column be recognized as such, whereas currently they are merely strings of characters which have no meaning to the software.

If we want to change the names of the columns, we can use the colnames() function. Below we change the name of the laughing hyena column:

colnames(safari)

[1] "date"           "wildebeest"     "laughing.hyena" "crocodile"     
[5] "weather"        "start"          "end"            "fun"           
[9] "guide"

colnames(safari)[3] <- "hyena"
safari

        date wildebeest hyena crocodile       weather   start      end fun
1  1/13/1999         12     0         2         sunny 7:21 am  4:14 pm yes
2  4/28/2001          3     1         1        cloudy 6:25 am 12:33 pm   Y
3 10/15/2010          3     6        NA         rainy 8:12 am     <NA>  no
4  3/02/2006          1    14         5     hot/sunny 7:15 am  3:12 pm   y
5  2/28/1988          2     6         3 partly cloudy 4:53 am  2:16 pm Yes
6  7/14/2015          3    12         0        cloudy 5:47 am  3:46 pm  No
               guide
1       Joshua Tebbs
2         Edsel Peña
3 Karl Bruce Gregory
4      Lianming Wang
5       Brian Habing
6            Edwards

Text processing

It is helpful to know a few tricks for dealing with text data. Here we will introduce a few functions that should come in handy.

Suppose we want to standardize the responses in the fun column of the safari data. One way to do this is to make a character vector containing all the strings we should interpret as “yes”; then we can use the operator %in% to check, for each entry of the fun column of the safari data, if its value is one of the values in our vector. Thus:

yes <- c("y","yes","Y","Yes")
safari$fun <- safari$fun %in% yes
safari

        date wildebeest hyena crocodile       weather   start      end   fun
1  1/13/1999         12     0         2         sunny 7:21 am  4:14 pm  TRUE
2  4/28/2001          3     1         1        cloudy 6:25 am 12:33 pm  TRUE
3 10/15/2010          3     6        NA         rainy 8:12 am     <NA> FALSE
4  3/02/2006          1    14         5     hot/sunny 7:15 am  3:12 pm  TRUE
5  2/28/1988          2     6         3 partly cloudy 4:53 am  2:16 pm  TRUE
6  7/14/2015          3    12         0        cloudy 5:47 am  3:46 pm FALSE
               guide
1       Joshua Tebbs
2         Edsel Peña
3 Karl Bruce Gregory
4      Lianming Wang
5       Brian Habing
6            Edwards

Now the fun column is has logical values.

The substr() function extracts a substring from a character string:

substr("howdy", start = 2, stop = 3)

[1] "ow"

substr("howdy", start = 5, stop = 5)

[1] "y"

The regexpr() function returns the location in a character strong of the first match to a given pattern.

regexpr("ow","howdy") # look for pattern "ow" in string "howdy"

[1] 2
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

regexpr("o","howdy-do")

[1] 2
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The gregexr() returns not just the position of the first match, but of every match:

gregexpr("o","howdy-do")

[[1]]
[1] 2 8
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The gregexpr() function and other functions like it (run ?gregexpr() to see other similar functions) allow the use of regular expressions to set patterns for which to look for matches. These have a standard syntax across many software programs and operating systems (so they are useful not just for R). More can be found on regular expressions here. Below we use a regular expression to retrieve the positions of “o” when this is followed by any lower case letter:

gregexpr("o[a-z]","howdy-do to you") # returns the position of "o" only when it is followed by a space

[[1]]
[1]  2 14
attr(,"match.length")
[1] 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The function sprintf() is used to print numbers as character strings in very specific formats:

e <- exp(1)
sprintf(e,fmt = "%f")

[1] "2.718282"

sprintf(e,fmt = "%.2f")

[1] "2.72"

sprintf(e,fmt = "%05.2f")

[1] "02.72"

sprintf(e,fmt = "%E")

[1] "2.718282E+00"

Print integer values with leading zeros:

sprintf(1,fmt = "%02d")

[1] "01"

sprintf(7,fmt = "%03d")

[1] "007"

Another very useful function is the grep() function which looks for a pattern in the entries of a character vector and returns in the indices in which the pattern was found. Below, we make a new logical column in the safari data set called “overcast” which will be TRUE except when the pattern "sunny" appears in the string describing the weather.

safari$overcast <- TRUE                      # set all equal to true
sunny_days <- grep("sunny",safari$weather)   # find which days are sunny
safari$overcast[sunny_days] <- FALSE         # make false the sunny days
safari

        date wildebeest hyena crocodile       weather   start      end   fun
1  1/13/1999         12     0         2         sunny 7:21 am  4:14 pm  TRUE
2  4/28/2001          3     1         1        cloudy 6:25 am 12:33 pm  TRUE
3 10/15/2010          3     6        NA         rainy 8:12 am     <NA> FALSE
4  3/02/2006          1    14         5     hot/sunny 7:15 am  3:12 pm  TRUE
5  2/28/1988          2     6         3 partly cloudy 4:53 am  2:16 pm  TRUE
6  7/14/2015          3    12         0        cloudy 5:47 am  3:46 pm FALSE
               guide overcast
1       Joshua Tebbs    FALSE
2         Edsel Peña     TRUE
3 Karl Bruce Gregory     TRUE
4      Lianming Wang    FALSE
5       Brian Habing     TRUE
6            Edwards     TRUE

As another exercise, suppose we wish to replace the names in the guide column with the first initial, followed by a period, and then the last name after a space, where we do not change the name if only the last name is given. The code below makes a function to convert each name to this shortened version and applies the function to the guide column of the safari data set. The strsplit() function comes in very handy here.

It works like this:

ch <- "Howdy ho!"
strsplit(ch," ")

[[1]]
[1] "Howdy" "ho!"

strsplit(ch," ")[[1]]

[1] "Howdy" "ho!"

So now we use it to make a function to abbreviate the names:

nameabb <- function(ch){
  
  full <- strsplit(ch," ")[[1]] # get first element of the list
  n <- length(full)
  
  if(n == 1){
    
    abb <- full
      
  } else {
    
    abb <- paste(substr(full[1],1,1),". ",full[n],sep="")
    
  }
  
  return(abb)

}


safari$guide <- sapply(safari$guide,nameabb)
safari

        date wildebeest hyena crocodile       weather   start      end   fun
1  1/13/1999         12     0         2         sunny 7:21 am  4:14 pm  TRUE
2  4/28/2001          3     1         1        cloudy 6:25 am 12:33 pm  TRUE
3 10/15/2010          3     6        NA         rainy 8:12 am     <NA> FALSE
4  3/02/2006          1    14         5     hot/sunny 7:15 am  3:12 pm  TRUE
5  2/28/1988          2     6         3 partly cloudy 4:53 am  2:16 pm  TRUE
6  7/14/2015          3    12         0        cloudy 5:47 am  3:46 pm FALSE
       guide overcast
1   J. Tebbs    FALSE
2    E. Peña     TRUE
3 K. Gregory     TRUE
4    L. Wang    FALSE
5  B. Habing     TRUE
6    Edwards     TRUE

Some other useful function are the toupper() and tolower() functions, which convert lower case to capital letters and vice versa, respectively:

ch <- "johann sebastian bach"
toupper(ch)

[1] "JOHANN SEBASTIAN BACH"

ch <- "Die Jugend ist Allem fähig."
tolower(ch)

[1] "die jugend ist allem fähig."

Dates

The first column of the safari data set contains dates, but they have been read in as character strings. We would like R to recognize them as dates. To convert character strings to dates we can use the as.Date() function. By default the function assumes the format yyyy-mm-dd, but other formats can be specified:

as.Date("2014-07-05")

[1] "2014-07-05"

as.Date("7/5/2014") # will not be recognized correctly. Must specify a format!

[1] "0007-05-20"

as.Date("7/5/2014", format = "%m/%d/%Y")

[1] "2014-07-05"

as.Date("July 7, 2014", format = "%B %d, %Y")

[1] "2014-07-07"

Run ?as.Date to read more about date formats.

A date object stores the date as the number of days which have passed since January 1st, 1970. Dates earlier than this are stored as negative values:

as.Date(0)

[1] "1970-01-01"

as.Date(1)

[1] "1970-01-02"

as.Date(-1)

[1] "1969-12-31"

There are several cool ways to play with dates in R:

today <- Sys.Date() # today's date
today

[1] "2025-10-27"

today - 30 # the date 30 days ago

[1] "2025-09-27"

seq(today,by="3 days",length = 12) # make a sequence of dates

 [1] "2025-10-27" "2025-10-30" "2025-11-02" "2025-11-05" "2025-11-08"
 [6] "2025-11-11" "2025-11-14" "2025-11-17" "2025-11-20" "2025-11-23"
[11] "2025-11-26" "2025-11-29"

weekdays(as.Date(c("2014-07-05","2017-07-05")))

[1] "Saturday"  "Wednesday"

months(today) # extract the month from a date

[1] "October"

We can print a date in a different formats other than the default with the format() function:

format(today,"%m/%d/%y")

[1] "10/27/25"

format(today,"%b %d, %Y")

[1] "Oct 27, 2025"

format(today,"%B %d, %Y")

[1] "October 27, 2025"

In the date column of the safari data set, the dates are recorded in many different formats. It takes a little bit of work to standardize the formats: Below, we use a loop to go through the entries of the date column to extract the date from each one with the as.Date() function. We have to list all the different formats to try by using the tryFormats option. Then we re-print the date in the format we want to keep. After the loop is done, we use as.Date() again to convert the entire vector of character strings into a column having the date class.

To convert the character strings in the data column of the safari data set, which represent dates to our eyes but which are not yet interpreted as dates by our software, we can use the as.Date() function, specifying the format in which they are written, to convert the character strings to actual date values. The code below overwrites the date colums with these true dates, and as a bonus, add to the data set a weekday column:

safari$date <- as.Date(safari$date, "%m/%d/%Y")
safari$day <- weekdays(safari$date)
safari

        date wildebeest hyena crocodile       weather   start      end   fun
1 1999-01-13         12     0         2         sunny 7:21 am  4:14 pm  TRUE
2 2001-04-28          3     1         1        cloudy 6:25 am 12:33 pm  TRUE
3 2010-10-15          3     6        NA         rainy 8:12 am     <NA> FALSE
4 2006-03-02          1    14         5     hot/sunny 7:15 am  3:12 pm  TRUE
5 1988-02-28          2     6         3 partly cloudy 4:53 am  2:16 pm  TRUE
6 2015-07-14          3    12         0        cloudy 5:47 am  3:46 pm FALSE
       guide overcast       day
1   J. Tebbs    FALSE Wednesday
2    E. Peña     TRUE  Saturday
3 K. Gregory     TRUE    Friday
4    L. Wang    FALSE  Thursday
5  B. Habing     TRUE    Sunday
6    Edwards     TRUE   Tuesday

Now we see that the first column has the date class.

sapply(safari,class)

       date  wildebeest       hyena   crocodile     weather       start 
     "Date"   "integer"   "integer"   "integer" "character" "character" 
        end         fun       guide    overcast         day 
"character"   "logical" "character"   "logical" "character"

Now that we have actual dates in the data set, the dates can be interpreted as such by other functions, such as the plot() function:

plot(wildebeest ~ date, data=safari)

Times (Datetimes)

Next, suppose we want to re-format the start and end times in the safari data set so that they are 24-hour clock times.

We can use the strptime() function to turn character strings into actual time values:

strptime("2:20 am",format="%I:%M %p")

[1] "2025-10-27 02:20:00 EDT"

strptime("2:20 pm",format="%I:%M %p")

[1] "2025-10-27 14:20:00 EDT"

strptime("7/5/2017 21:56", format = "%m/%d/%Y %H:%M") # birth of Lois!

[1] "2017-07-05 21:56:00 EDT"

Note that if a date is not given, it uses the current day.

These time values are values of a class called the POSIXct class. It stores calendar times as a number of seconds from the time 00:00:00 (midnight) in the timezone GMT on January 1st, 1970 (when the machines awoke…). The name “POSIX” refers to a set of standards designed to promote compatibility of code across operating systems, and “ct” stands for calendar time. The default format in which R prints objects of this class is yyyy-mm-dd hh:mm:ss tz, where tz is the time zone.

Now we can use the strftime() to print the time values in whatever format we want:

start <- strptime(safari$start,format="%I:%M %p") # get the values as actual time
end <- strptime(safari$end,format="%I:%M %p")

start24 <- strftime(start,format="%H:%M")         # convert the times to strings in a particular format
end24 <- strftime(end,format="%H:%M")

safari$start <- start24                           # replace the columns of the data frame
safari$end <- end24
safari

        date wildebeest hyena crocodile       weather start   end   fun
1 1999-01-13         12     0         2         sunny 07:21 16:14  TRUE
2 2001-04-28          3     1         1        cloudy 06:25 12:33  TRUE
3 2010-10-15          3     6        NA         rainy 08:12  <NA> FALSE
4 2006-03-02          1    14         5     hot/sunny 07:15 15:12  TRUE
5 1988-02-28          2     6         3 partly cloudy 04:53 14:16  TRUE
6 2015-07-14          3    12         0        cloudy 05:47 15:46 FALSE
       guide overcast       day
1   J. Tebbs    FALSE Wednesday
2    E. Peña     TRUE  Saturday
3 K. Gregory     TRUE    Friday
4    L. Wang    FALSE  Thursday
5  B. Habing     TRUE    Sunday
6    Edwards     TRUE   Tuesday

Suppose we wish to create another column in the data set giving the duration of each safari, say in the format hh:mm.

We can use the timediff() function to get the difference in time between the calendar time representations of the start and end times. The below obtains these differences in minutes and then converts the number of minutes into the format hh:mm:

minutes <- as.integer(difftime(end,start,unit="min"))           # get the difference in number of minutes

duration <- paste(sprintf(floor(minutes/60),fmt="%02.f"),
                  sprintf(minutes %% 60,fmt="%02.f"),sep=":")   # write this as hh:mm

# find durations with string "NA" replace with real NA
duration[grep(pattern = "NA",x = duration)] <- NA

# add a column to the data set
safari$duration <- duration
safari

        date wildebeest hyena crocodile       weather start   end   fun
1 1999-01-13         12     0         2         sunny 07:21 16:14  TRUE
2 2001-04-28          3     1         1        cloudy 06:25 12:33  TRUE
3 2010-10-15          3     6        NA         rainy 08:12  <NA> FALSE
4 2006-03-02          1    14         5     hot/sunny 07:15 15:12  TRUE
5 1988-02-28          2     6         3 partly cloudy 04:53 14:16  TRUE
6 2015-07-14          3    12         0        cloudy 05:47 15:46 FALSE
       guide overcast       day duration
1   J. Tebbs    FALSE Wednesday    08:53
2    E. Peña     TRUE  Saturday    06:08
3 K. Gregory     TRUE    Friday     <NA>
4    L. Wang    FALSE  Thursday    07:57
5  B. Habing     TRUE    Sunday    09:23
6    Edwards     TRUE   Tuesday    09:59

Sorting a data frame

We can use the sort_by() function to sort the rows of a data frame according to the values in one (or more columns):

sort_by(safari, ~ date) # sort by the date column

        date wildebeest hyena crocodile       weather start   end   fun
5 1988-02-28          2     6         3 partly cloudy 04:53 14:16  TRUE
1 1999-01-13         12     0         2         sunny 07:21 16:14  TRUE
2 2001-04-28          3     1         1        cloudy 06:25 12:33  TRUE
4 2006-03-02          1    14         5     hot/sunny 07:15 15:12  TRUE
3 2010-10-15          3     6        NA         rainy 08:12  <NA> FALSE
6 2015-07-14          3    12         0        cloudy 05:47 15:46 FALSE
       guide overcast       day duration
5  B. Habing     TRUE    Sunday    09:23
1   J. Tebbs    FALSE Wednesday    08:53
2    E. Peña     TRUE  Saturday    06:08
4    L. Wang    FALSE  Thursday    07:57
3 K. Gregory     TRUE    Friday     <NA>
6    Edwards     TRUE   Tuesday    09:59

sort_by(safari, ~ fun + date) # sort first by fun, then by date within fun

        date wildebeest hyena crocodile       weather start   end   fun
3 2010-10-15          3     6        NA         rainy 08:12  <NA> FALSE
6 2015-07-14          3    12         0        cloudy 05:47 15:46 FALSE
5 1988-02-28          2     6         3 partly cloudy 04:53 14:16  TRUE
1 1999-01-13         12     0         2         sunny 07:21 16:14  TRUE
2 2001-04-28          3     1         1        cloudy 06:25 12:33  TRUE
4 2006-03-02          1    14         5     hot/sunny 07:15 15:12  TRUE
       guide overcast       day duration
3 K. Gregory     TRUE    Friday     <NA>
6    Edwards     TRUE   Tuesday    09:59
5  B. Habing     TRUE    Sunday    09:23
1   J. Tebbs    FALSE Wednesday    08:53
2    E. Peña     TRUE  Saturday    06:08
4    L. Wang    FALSE  Thursday    07:57

Practice

Practice importing and formatting the data set bird_sightings.tsv downloadable from here.

Read the data set in with the read.table() function. When properly imported it should look like this:

                  bird distance..ft.         location              when
1    Blue-footed booby            <2 Isla de la Plata  October 22, 2004
2                robin            20    out my window              <NA>
3             Cardinal            10      In the yard     June 18, 2010
4          Road runner             5            Texas December 20, 2023
5 xantus’s hummingbird             4    at the feeder September 1, 2020
6  Eastern wood-peewee            10         out east November 19, 2021
7           bald eagle          >100      on the lake      July 4, 2024
  had.binoculars
1             no
2            yes
3            yes
4             no
5             no
6            yes
7              Y

Rename the columns so that the data set looks like this:

                  Bird Dist(ft)         Location              Date Binocs
1    Blue-footed booby       <2 Isla de la Plata  October 22, 2004     no
2                robin       20    out my window              <NA>    yes
3             Cardinal       10      In the yard     June 18, 2010    yes
4          Road runner        5            Texas December 20, 2023     no
5 xantus’s hummingbird        4    at the feeder September 1, 2020     no
6  Eastern wood-peewee       10         out east November 19, 2021    yes
7           bald eagle     >100      on the lake      July 4, 2024      Y

Write code to convert the values in the date column to date values and then replace the values in this column with dates in the format dd.mm.yyyy so that the data set looks like this:

                  Bird Dist(ft)         Location       Date Binocs
1    Blue-footed booby       <2 Isla de la Plata 22.10.2004     no
2                robin       20    out my window       <NA>    yes
3             Cardinal       10      In the yard 18.06.2010    yes
4          Road runner        5            Texas 20.12.2023     no
5 xantus’s hummingbird        4    at the feeder 01.09.2020     no
6  Eastern wood-peewee       10         out east 19.11.2021    yes
7           bald eagle     >100      on the lake 04.07.2024      Y

Make the capitalization of the bird names conform to this convention: Each word in a bird name should be capitalized, except when there is a sequence of hyphenated words; of these only the first word should be capitalized. Write a function which corrects capitalizations according to this convention and apply it to the bird column of the data set. The result should look like this:

                  Bird Dist(ft)         Location       Date Binocs
1    Blue-footed Booby       <2 Isla de la Plata 22.10.2004     no
2                Robin       20    out my window       <NA>    yes
3             Cardinal       10      In the yard 18.06.2010    yes
4          Road Runner        5            Texas 20.12.2023     no
5 Xantus’s Hummingbird        4    at the feeder 01.09.2020     no
6  Eastern Wood-peewee       10         out east 19.11.2021    yes
7           Bald Eagle     >100      on the lake 04.07.2024      Y

Remove the inequality characters from the distance column and make it a column of numeric values; the same time, create a new column called “Dcens” containing the character strings “gt” if “>” was removed, “lt” if “<” was removed, and “eq”, otherwise.

                  Bird Dist(ft)         Location       Date Binocs Dcens
1    Blue-footed Booby        2 Isla de la Plata 22.10.2004     no    lt
2                Robin       20    out my window       <NA>    yes    eq
3             Cardinal       10      In the yard 18.06.2010    yes    eq
4          Road Runner        5            Texas 20.12.2023     no    eq
5 Xantus’s Hummingbird        4    at the feeder 01.09.2020     no    eq
6  Eastern Wood-peewee       10         out east 19.11.2021    yes    eq
7           Bald Eagle      100      on the lake 04.07.2024      Y    gt

Replace the values in the binoculars column with logical values.

                  Bird Dist(ft)         Location       Date Binocs Dcens
1    Blue-footed Booby        2 Isla de la Plata 22.10.2004  FALSE    lt
2                Robin       20    out my window       <NA>   TRUE    eq
3             Cardinal       10      In the yard 18.06.2010   TRUE    eq
4          Road Runner        5            Texas 20.12.2023  FALSE    eq
5 Xantus’s Hummingbird        4    at the feeder 01.09.2020  FALSE    eq
6  Eastern Wood-peewee       10         out east 19.11.2021   TRUE    eq
7           Bald Eagle      100      on the lake 04.07.2024   TRUE    gt

Sort the data according to the distances at which the birds were seen.

                  Bird Dist(ft)         Location       Date Binocs Dcens
1    Blue-footed Booby        2 Isla de la Plata 22.10.2004  FALSE    lt
5 Xantus’s Hummingbird        4    at the feeder 01.09.2020  FALSE    eq
4          Road Runner        5            Texas 20.12.2023  FALSE    eq
3             Cardinal       10      In the yard 18.06.2010   TRUE    eq
6  Eastern Wood-peewee       10         out east 19.11.2021   TRUE    eq
2                Robin       20    out my window       <NA>   TRUE    eq
7           Bald Eagle      100      on the lake 04.07.2024   TRUE    gt