date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,0,6,rainy,8:12 am,11:34 am,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards
Importing and formatting data
$$
$$
Importing data tends to be a large part of the working statistician’s job (for good or ill). Here we’ll learn some tools for reading data sets into R when these are stored in a plain text file, sometimes with the extensions .dat
,.txt
, or .csv
. The extension doesn’t really matter; we consider any file that can be opened by a simple text editor.
Reading data
Typically data are stored in text files where each row corresponds to a row in a spreadsheet, and the values are separated by some character, very often a comma. If not a comma, it may be a tab (long space) or other delimiter. If the values are not delimited by a special character, they may be in so-called fixed-width format, which we describe in the following.
Comma-separated
Suppose a data file looks like this:
We see that the first row appears to give column names, and that the values are separated by commas. We can read this into R with the read.table()
function. We specify the options sep = ","
since the values are separated by the character ,
and put header = T
so that the first row is assumed to contain column names:
<- read.table(file = "data/safari_comma.dat",
safari sep = ",",
header = TRUE)
safari
date wildebeest laughing.hyena crocodile weather start end
1 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm
2 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm
3 10/15/2010 3 0 6 rainy 8:12 am 11:34 am
4 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm
5 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm
6 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm
fun guide
1 yes Joshua Tebbs
2 Y Edsel Peña
3 no Karl Bruce Gregory
4 y Lianming Wang
5 Yes Brian Habing
6 No Edwards
Sometimes the data do not begin on the first row of the text file. Suppose the file looked like this:
Some make-believe safari data
Woohoo!!
date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999,12,none,2,sunny,7:21 am,4:14 pm,yes,Joshua Tebbs
4/28/2001,3,1,1,cloudy,6:25 am,12:33 pm,Y,Edsel Peña
10/15/2010,3,.,6,rainy,8:12 am,,no,Karl Bruce Gregory
3/02/2006,1,14,5,hot/sunny,7:15 am,3:12 pm,y,Lianming Wang
2/28/1988,2,6,3,partly cloudy,4:53 am,2:16 pm,Yes,Brian Habing
7/14/2015,3,12,0,cloudy,5:47 am,3:46 pm,No,Edwards
Now, in addition to the fact that the data do not begin on line 1 of the file, we see that there are a couple of missing values in this version of the data. The statistician is forever coping with missing data. Here we must specify which characters should be taken as missing values. These we specify with the option na.strings =
as shown below. In order to skip the first three lines of data, we use the skip =
option.
<- read.table(file = "data/safari_comma_missing.dat",
safari sep = ",",
header = TRUE,
skip = 3,
na.strings = c("","."))
safari
date wildebeest laughing.hyena crocodile weather start end
1 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm
2 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm
3 10/15/2010 3 <NA> 6 rainy 8:12 am <NA>
4 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm
5 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm
6 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm
fun guide
1 yes Joshua Tebbs
2 Y Edsel Peña
3 no Karl Bruce Gregory
4 y Lianming Wang
5 Yes Brian Habing
6 No Edwards
In the special case that the data values are separated by commas and the first line contains column names, we can use the function read.csv()
to achieve the same result. It is simply a “wrapper” for the read.table()
function, which means that it simple executes the latter with certain options already specified, such as sep = ","
and header = T
. Data files with data values separated by commas often have the extension .csv
.
<- read.csv(file = "data/safari_comma_missing.dat",
safari skip = 3,
na.strings = c("","."))
safari
date wildebeest laughing.hyena crocodile weather start end
1 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm
2 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm
3 10/15/2010 3 <NA> 6 rainy 8:12 am <NA>
4 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm
5 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm
6 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm
fun guide
1 yes Joshua Tebbs
2 Y Edsel Peña
3 no Karl Bruce Gregory
4 y Lianming Wang
5 Yes Brian Habing
6 No Edwards
Other-delimited
As we have said, sometimes the data values are not delimited by a comma. For example, we might have a data file like this:
Some make-believe safari data
Woohoo!!
date;wildebeest;laughing hyena;crocodile;weather;start;end;fun;guide
1/13/1999;12;none;2;sunny;7:21 am;4:14 pm;yes;Joshua Tebbs
4/28/2001;3;1;1;cloudy;6:25 am;12:33 pm;Y;Edsel Peña
10/15/2010;3;.;6;rainy;8:12 am;;no;Karl Bruce Gregory
3/02/2006;1;14;5;hot/sunny;7:15 am;3:12 pm;y;Lianming Wang
2/28/1988;2;6;3;partly cloudy;4:53 am;2:16 pm;Yes;Brian Habing
7/14/2015;3;12;0;cloudy;5:47 am;3:46 pm;No;Edwards
To read this in, we have only to specify the delimiter with sep = ";"
:
<- read.table(file = "data/safari_semi_missing.dat",
safari sep = ";",
header = TRUE,
skip = 3,
na.strings = c("","."))
safari
date wildebeest laughing.hyena crocodile weather start end
1 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm
2 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm
3 10/15/2010 3 <NA> 6 rainy 8:12 am <NA>
4 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm
5 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm
6 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm
fun guide
1 yes Joshua Tebbs
2 Y Edsel Peña
3 no Karl Bruce Gregory
4 y Lianming Wang
5 Yes Brian Habing
6 No Edwards
Data are often tab delimited, looking like this:
Some make-believe safari data
Woohoo!!
date wildebeest laughing hyena crocodile weather start end fun guide
1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña
10/15/2010 3 . 6 rainy 8:12 am no Karl Bruce Gregory
3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang
2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards
To specify the tab as the delimiting character, one must put sep = "\t"
:
<- read.table(file = "data/safari_tab_missing.dat",
safari sep = "\t",
header = TRUE,
skip = 3,
na.strings = c("","."))
safari
date wildebeest laughing.hyena crocodile weather start end
1 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm
2 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm
3 10/15/2010 3 <NA> 6 rainy 8:12 am <NA>
4 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm
5 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm
6 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm
fun guide
1 yes Joshua Tebbs
2 Y Edsel Peña
3 no Karl Bruce Gregory
4 y Lianming Wang
5 Yes Brian Habing
6 No Edwards
Fixed-width
Sometimes the data values are not separated by any special delimiting character, but rather arranged in a format such that the values belonging to a column are always place in the same columns of text characters. This is called a fixed-width format. Here is an example:
writeLines(readLines(con = "data/safari_fwf_missing.dat"))
Some make-believe safari data
Woohoo!!
date,wildebeest,laughing hyena,crocodile,weather,start,end,fun,guide
1/13/1999 12 none 2 sunny 7:21 am 4:14 pm yes Joshua Tebbs
4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm Y Edsel Peña
10/15/2010 3 6 rainy 8:12 am no Karl Bruce Gregory
3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm y Lianming Wang
2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm Yes Brian Habing
7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm No Edwards
We can use the read.fwf()
function to read this file. We must specify how many characters are alloted to each column using the widths =
option. Also important is the strip.white=T
option, which will remove from character strings the extra spaces that are read in when reading fixed-width data:
<- read.fwf(file = "data/safari_fwf_missing.dat",
safari widths = c(11,3,5,3,14,9,9,4,18),
header = T,
skip = 3,
na.strings = c("","."),
strip.white = T,
sep =",")
safari
date wildebeest laughing.hyena crocodile weather start end
1 1/13/1999 12 none 2 sunny 7:21 am 4:14 pm
2 4/28/2001 3 1 1 cloudy 6:25 am 12:33 pm
3 10/15/2010 3 6 NA rainy 8:12 am <NA>
4 3/02/2006 1 14 5 hot/sunny 7:15 am 3:12 pm
5 2/28/1988 2 6 3 partly cloudy 4:53 am 2:16 pm
6 7/14/2015 3 12 0 cloudy 5:47 am 3:46 pm
fun guide
1 yes Joshua Tebbs
2 Y Edsel Peña
3 no Karl Bruce Gregory
4 y Lianming Wang
5 Yes Brian Habing
6 No Edwards
Formatting data
Reading the data into R is only the first step in preparing the data for analysis. The statistician must check that numbers are treated as numbers, character strings as character strings, and may wish also to check ensure that dates and times are interpreted as such, rather than as arbitrary character strings.
Column classes
A good way to see the types, or more properly the classes assigned to the columns is to apply with sapply()
the class()
function to the data frame:
sapply(safari,class)
date wildebeest laughing.hyena crocodile weather
"character" "integer" "character" "integer" "character"
start end fun guide
"character" "character" "character" "character"
The reason laughing.hyena
is read in as character column is that it has the value "none"
on one line. Let’s overwrite "none"
with 0
, noting that when we do this, our 0
will be coerced to the character string "0"
, since the column is a character column; however, we can fix this by changing the class of this column to integer
, as below:
$laughing.hyena[safari$laughing.hyena=="none"] <- 0
safariclass(safari$laughing.hyena) <- "integer"
sapply(safari,class)
date wildebeest laughing.hyena crocodile weather
"character" "integer" "integer" "integer" "character"
start end fun guide
"character" "character" "character" "character"
The remaining columns appear to have been read in as expected. We may wish, however, that the years in the data column be recognized as such, whereas currently they are merely strings of characters which have no meaning to the software.
Dates
The first column of the safari data set contains dates, but they have been read in as character strings. We would like R to recognize them as dates. To convert character strings to dates we can use the as.Date()
function. By default the function assumes the format yyyy-mm-dd, but other formats can be specified:
as.Date("2014-07-05")
[1] "2014-07-05"
as.Date("7/5/2014") # will not be recognized correctly. Must specify a format!
[1] "0007-05-20"
as.Date("7/5/2014", format = "%m/%d/%Y")
[1] "2014-07-05"
as.Date("July 7, 2014", format = "%B %d, %Y")
[1] "2014-07-07"
Run ?as.Date
to read more about date formats.
A date object stores the date as the number of days which have passed since January 1st, 1970. Dates earlier than this are stored as negative values:
as.Date(0)
[1] "1970-01-01"
as.Date(1)
[1] "1970-01-02"
as.Date(-1)
[1] "1969-12-31"
There are several cool ways to play with dates in R:
<- Sys.Date() # today's date
today today
[1] "2025-09-11"
- 30 # the date 30 days ago today
[1] "2025-08-12"
seq(today,by="3 days",length = 12) # make a sequence of dates
[1] "2025-09-11" "2025-09-14" "2025-09-17" "2025-09-20" "2025-09-23"
[6] "2025-09-26" "2025-09-29" "2025-10-02" "2025-10-05" "2025-10-08"
[11] "2025-10-11" "2025-10-14"
weekdays(as.Date(c("2014-07-05","2017-07-05")))
[1] "Saturday" "Wednesday"
months(today) # extract the month from a date
[1] "September"
We can print a date in a different formats other than the default with the format()
function:
format(today,"%m/%d/%y")
[1] "09/11/25"
format(today,"%b %d, %Y")
[1] "Sep 11, 2025"
format(today,"%B %d, %Y")
[1] "September 11, 2025"
In the date column of the safari data set, the dates are recorded in many different formats. It takes a little bit of work to standardize the formats: Below, we use a loop to go through the entries of the date column to extract the date from each one with the as.Date()
function. We have to list all the different formats to try by using the tryFormats
option. Then we re-print the date in the format we want to keep. After the loop is done, we use as.Date()
again to convert the entire vector of character strings into a column having the date class.
To convert the character strings in the data column of the safari data set, which represent dates to our eyes but which are not yet interpreted as dates by our software, we can use the as.Date()
function, specifying the format in which they are written, to convert the character strings to actual date values. The code below overwrites the date colums with these true dates, and as a bonus, add to the data set a weekday column:
$date <- as.Date(safari$date, "%m/%d/%Y")
safari$day <- weekdays(safari$date)
safari safari
date wildebeest laughing.hyena crocodile weather start end
1 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm
2 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm
3 2010-10-15 3 6 NA rainy 8:12 am <NA>
4 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm
5 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm
6 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm
fun guide day
1 yes Joshua Tebbs Wednesday
2 Y Edsel Peña Saturday
3 no Karl Bruce Gregory Friday
4 y Lianming Wang Thursday
5 Yes Brian Habing Sunday
6 No Edwards Tuesday
Now we see that the first column has the date class.
sapply(safari,class)
date wildebeest laughing.hyena crocodile weather
"Date" "integer" "integer" "integer" "character"
start end fun guide day
"character" "character" "character" "character" "character"
Now that we have actual dates in the data set, the dates can be interpreted as such by other functions, such as the plot()
function:
plot(wildebeest ~ date, data=safari)
Text processing
It is helpful to know a few tricks for dealing with text data. Here we will introduce a few functions that should come in handy.
Suppose we want to standardize the responses in the fun column of the safari data. One way to do this is to make a character vector containing all the strings we should interpret as “yes”; then we can use the operator %in%
to check, for each entry of the fun column of the safari data, if its value is one of the values in our vector. Thus:
<- c("y","yes","Y","Yes")
yes $fun <- safari$fun %in% yes
safari safari
date wildebeest laughing.hyena crocodile weather start end
1 1999-01-13 12 0 2 sunny 7:21 am 4:14 pm
2 2001-04-28 3 1 1 cloudy 6:25 am 12:33 pm
3 2010-10-15 3 6 NA rainy 8:12 am <NA>
4 2006-03-02 1 14 5 hot/sunny 7:15 am 3:12 pm
5 1988-02-28 2 6 3 partly cloudy 4:53 am 2:16 pm
6 2015-07-14 3 12 0 cloudy 5:47 am 3:46 pm
fun guide day
1 TRUE Joshua Tebbs Wednesday
2 TRUE Edsel Peña Saturday
3 FALSE Karl Bruce Gregory Friday
4 TRUE Lianming Wang Thursday
5 TRUE Brian Habing Sunday
6 FALSE Edwards Tuesday
Now the fun column is has logical values.
Next, suppose we want to re-format the start and end times so that the 24-hour clock is used (this would be helpful to us if we later want read these times as actual times instead of arbitrary character strings). We will write a function which does this, introducing a few helpful functions along the way.
The substr()
function extracts a substring from a character string:
substr("howdy", start = 2, stop = 3)
[1] "ow"
substr("howdy", start = 5, stop = 5)
[1] "y"
The regexpr()
function returns the location in a character strong of the first match to a given pattern.
regexpr("ow","howdy") # look for pattern "ow" in string "howdy"
[1] 2
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
regexpr("o","howdy-do")
[1] 2
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
The gregexr()
returns not just the position of the first match, but of every match:
gregexpr("o","howdy-do")
[[1]]
[1] 2 8
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
gregexpr("o[ ]","howdy-do to you") # returns the position of "o" only when it is followed by a space
[[1]]
[1] 8 11
attr(,"match.length")
[1] 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
The gregexpr()
function and other functions like it (run ?gregexpr()
to see other similar functions) allow the use of regular expressions to set patterns for which to look for matches. These have a standard syntax across many software programs and operating systems (so they are useful not just for R).
The function sprintf()
is used to print numbers as character strings in very specific formats:
<- exp(1)
e sprintf(e,fmt = "%f")
[1] "2.718282"
sprintf(e,fmt = "%.2f")
[1] "2.72"
sprintf(e,fmt = "%05.2f")
[1] "02.72"
sprintf(e,fmt = "%E")
[1] "2.718282E+00"
Print integer values with leading zeros:
sprintf(1,fmt = "%02d")
[1] "01"
sprintf(7,fmt = "%03d")
[1] "007"
Now we will use the substr()
, regexpr()
, and sprintf()
functions to build a function that converts times given as hh:mm with an am or pm afterwards to a 24-hour clock time in the format hh:mm.
<- function(x){
twelve2twentyfour
if(is.na(x)) return(NA)
<- regexpr(":",x) # get position of colon
colon <- as.numeric(substr(x,start = 1,stop = colon - 1))
hh <- substr(x,start = colon + 1, stop = colon + 2)
mm
<- regexpr("m",x) # get position of character "m"
m <- substr(x,start=m-1,stop = m-1)
ap if((ap == "p") & (hh != 12)) hh <- hh + 12
if((ap == "a") & (hh == 12)) hh <- hh - 12
<- paste(sprintf(hh,fmt = "%02d"),mm,sep=":")
val
return(val)
}
twelve2twentyfour("2:20 pm")
[1] "14:20"
twelve2twentyfour("12:20am")
[1] "00:20"
twelve2twentyfour(" 9:45 am")
[1] "09:45"
twelve2twentyfour("12:15 pm")
[1] "12:15"
Now we replace the start and end times in our safari data set with the newly formatted times by using the sapply()
function to apply our twelve2twentyfour()
function to each value in the start
and end
columns.
$start <- sapply(safari$start,twelve2twentyfour)
safari$end <- sapply(safari$end,twelve2twentyfour)
safari
safari
date wildebeest laughing.hyena crocodile weather start end
1 1999-01-13 12 0 2 sunny 07:21 16:14
2 2001-04-28 3 1 1 cloudy 06:25 12:33
3 2010-10-15 3 6 NA rainy 08:12 <NA>
4 2006-03-02 1 14 5 hot/sunny 07:15 15:12
5 1988-02-28 2 6 3 partly cloudy 04:53 14:16
6 2015-07-14 3 12 0 cloudy 05:47 15:46
fun guide day
1 TRUE Joshua Tebbs Wednesday
2 TRUE Edsel Peña Saturday
3 FALSE Karl Bruce Gregory Friday
4 TRUE Lianming Wang Thursday
5 TRUE Brian Habing Sunday
6 FALSE Edwards Tuesday
Another very useful function is the grep()
function which looks for a pattern in the entries of a character vector and returns in the indices in which the pattern was found. Below, we make a new logical column in the safari data set called “overcast” which will be TRUE
except when the pattern "sunny"
appears in the string describing the weather.
$overcast <- rep(TRUE,nrow(safari)) # set all equal to true
safari$overcast[grep("sunny",safari$weather)] <- FALSE # make false the sunny days
safari safari
date wildebeest laughing.hyena crocodile weather start end
1 1999-01-13 12 0 2 sunny 07:21 16:14
2 2001-04-28 3 1 1 cloudy 06:25 12:33
3 2010-10-15 3 6 NA rainy 08:12 <NA>
4 2006-03-02 1 14 5 hot/sunny 07:15 15:12
5 1988-02-28 2 6 3 partly cloudy 04:53 14:16
6 2015-07-14 3 12 0 cloudy 05:47 15:46
fun guide day overcast
1 TRUE Joshua Tebbs Wednesday FALSE
2 TRUE Edsel Peña Saturday TRUE
3 FALSE Karl Bruce Gregory Friday TRUE
4 TRUE Lianming Wang Thursday FALSE
5 TRUE Brian Habing Sunday TRUE
6 FALSE Edwards Tuesday TRUE
As another exercise, suppose we wish to replace the names in the guide column with the first initial, followed by a period, and then the last name after a space, where we do not change the name if only the last name is given. The code below makes a function to convert each name to this shortened version and applies the function to the guide column of the safari data set.
<- function(ch){
shortname
# check for spaces in the name
<- grep(" ",ch)
spaces if(length(spaces) == 0){# if no spaces
<- ch
val
else {
}
# find position of first character in last name
<- max(gregexpr(" ",ch)[[1]]) + 1
ln_start
# get last name
<- substr(ch,ln_start,nchar(ch))
ln
# get first initial
<- substr(ch,1,1)
finit
# put together
<- paste(finit,ln,sep = ". ")
val
}
return(val)
}
$guide <- sapply(safari$guide,shortname)
safari safari
date wildebeest laughing.hyena crocodile weather start end
1 1999-01-13 12 0 2 sunny 07:21 16:14
2 2001-04-28 3 1 1 cloudy 06:25 12:33
3 2010-10-15 3 6 NA rainy 08:12 <NA>
4 2006-03-02 1 14 5 hot/sunny 07:15 15:12
5 1988-02-28 2 6 3 partly cloudy 04:53 14:16
6 2015-07-14 3 12 0 cloudy 05:47 15:46
fun guide day overcast
1 TRUE J. Tebbs Wednesday FALSE
2 TRUE E. Peña Saturday TRUE
3 FALSE K. Gregory Friday TRUE
4 TRUE L. Wang Thursday FALSE
5 TRUE B. Habing Sunday TRUE
6 FALSE Edwards Tuesday TRUE
Date-times
Suppose we wish to create another column in the data set giving the duration of each safari, say in the format hh:mm. To compute the entries of this column we need to subtract the end times from the start times. Since we have now converted the times to 24-hour clock times, it will be simple to convert these new character strongs to actual calendar times.
The POSIXct
class stores calendar times as a number of seconds from the time 00:00:00 (midnight) in the timezone GMT on January 1st, 1970 (when the machines awoke…). The name “POSIX” refers to a set of standards designed to promote compatibility of code across operating systems, and “ct” stands for calendar time. The default format in which R prints objects of this class is yyyy-mm-dd hh:mm:ss tz, where tz is the time zone.
<- as.POSIXct(0,tz="GMT") # Greenwich mean time
time0 time0
[1] "1970-01-01 GMT"
+ 30 # thirty seconds later time0
[1] "1970-01-01 00:00:30 GMT"
- 60 # one minute earlier time0
[1] "1969-12-31 23:59:00 GMT"
# birth of my daughter Viva
<- as.POSIXct("2015-11-22 02:23", tz="CET")
viva as.numeric(viva) # number of seconds since "origin"
[1] 1448155380
print(viva,tz = "US/Eastern") # in EST
[1] "2015-11-21 20:23:00 EST"
print(viva,tz = "US/Central") # in CST
[1] "2015-11-21 19:23:00 CST"
We can read calender times in other formats using the strptime()
function:
# birth of my daughter Lois
<- strptime("7/5/2017 21:56", format = "%m/%d/%Y %H:%M")
lois lois
[1] "2017-07-05 21:56:00 EDT"
Note that the 24-hour clock is always used (not the 12-hour clock with “am” and “pm”). Since we have already converted the start and end times in the safari data to 24-hour times, we can use the paste()
function to construct character strings which can be interpreted as calendar times:
paste(safari$date,safari$start)
[1] "1999-01-13 07:21" "2001-04-28 06:25" "2010-10-15 08:12" "2006-03-02 07:15"
[5] "1988-02-28 04:53" "2015-07-14 05:47"
To convert these character strings to calendar times, it is most convenient to use the strptime()
function. Afterwards we can use the timediff()
function to get the difference in time between the calendar time representations of the start and end times. The below obtains these differences in minutes and then converts the number of minutes into the format hh:mm:
# get the start and end times
<- strptime(paste(safari$date,safari$start),"%Y-%m-%d %H:%M")
start <- strptime(paste(safari$date,safari$end),"%Y-%m-%d %H:%M")
end
# get the difference in number of minutes
<- as.integer(difftime(end,start,unit="min"))
minutes
# write this as hh:mm
<- paste(sprintf(floor(minutes/60),fmt="%02.f"),
duration sprintf(minutes %% 60,fmt="%02.f"),sep=":")
# find durations with string "NA" replace with real NA
grep(pattern = "NA",x = duration)] <- NA
duration[
# add a column to the data set
$duration <- duration
safari safari
date wildebeest laughing.hyena crocodile weather start end
1 1999-01-13 12 0 2 sunny 07:21 16:14
2 2001-04-28 3 1 1 cloudy 06:25 12:33
3 2010-10-15 3 6 NA rainy 08:12 <NA>
4 2006-03-02 1 14 5 hot/sunny 07:15 15:12
5 1988-02-28 2 6 3 partly cloudy 04:53 14:16
6 2015-07-14 3 12 0 cloudy 05:47 15:46
fun guide day overcast duration
1 TRUE J. Tebbs Wednesday FALSE 08:53
2 TRUE E. Peña Saturday TRUE 06:08
3 FALSE K. Gregory Friday TRUE <NA>
4 TRUE L. Wang Thursday FALSE 07:57
5 TRUE B. Habing Sunday TRUE 09:23
6 FALSE Edwards Tuesday TRUE 09:59
Sorting a data frame
We can use the sort_by()
function to sort the rows of a data frame according to the values in one (or more columns):
sort_by(safari, ~ date) # sort by the date column
date wildebeest laughing.hyena crocodile weather start end
5 1988-02-28 2 6 3 partly cloudy 04:53 14:16
1 1999-01-13 12 0 2 sunny 07:21 16:14
2 2001-04-28 3 1 1 cloudy 06:25 12:33
4 2006-03-02 1 14 5 hot/sunny 07:15 15:12
3 2010-10-15 3 6 NA rainy 08:12 <NA>
6 2015-07-14 3 12 0 cloudy 05:47 15:46
fun guide day overcast duration
5 TRUE B. Habing Sunday TRUE 09:23
1 TRUE J. Tebbs Wednesday FALSE 08:53
2 TRUE E. Peña Saturday TRUE 06:08
4 TRUE L. Wang Thursday FALSE 07:57
3 FALSE K. Gregory Friday TRUE <NA>
6 FALSE Edwards Tuesday TRUE 09:59
sort_by(safari, ~ fun + date) # sort first by fun, then by date within fun
date wildebeest laughing.hyena crocodile weather start end
3 2010-10-15 3 6 NA rainy 08:12 <NA>
6 2015-07-14 3 12 0 cloudy 05:47 15:46
5 1988-02-28 2 6 3 partly cloudy 04:53 14:16
1 1999-01-13 12 0 2 sunny 07:21 16:14
2 2001-04-28 3 1 1 cloudy 06:25 12:33
4 2006-03-02 1 14 5 hot/sunny 07:15 15:12
fun guide day overcast duration
3 FALSE K. Gregory Friday TRUE <NA>
6 FALSE Edwards Tuesday TRUE 09:59
5 TRUE B. Habing Sunday TRUE 09:23
1 TRUE J. Tebbs Wednesday FALSE 08:53
2 TRUE E. Peña Saturday TRUE 06:08
4 TRUE L. Wang Thursday FALSE 07:57