SQL basics

Author

Karl Gregory

In this note we will access a database made available as a companion to the textbook Modern Data Science with R by Baumer, Kaplan, and Horton (2021). This note follows very closely a note prepared by Dr. Will Consagra. More examples of SQL queries on this database can be found here.

A database is a collection of (usually large) data sets, not stored on your local machine, and organized a way that facilitates retrieval of selected subsets of the data. SQL stands for structured query language and is a language used to retrieve data stored in a database. SQL commands are not executed by your local machine, but rather by the machine storing the data, and the results of the “queries” submitted in SQL are sent to your machine. This allows you to work with massive data sets without storing them on your own computer.

The first step in retrieving data with SQL queries is to create a connection to a database. We will do this from within R using the library mdsr, which enables connection to a database called “airlines” stored on an Amazon Web Services server. The following R commands establish the connection:

library(mdsr) 
db <- dbConnect_scidb("airlines") # establish a connection to the database "airlines"

The airlines database contains information on about 18 million flights between US airports taking place during the years 2013–2015.

Exploring a database

Now that we have created a connection to a database, we can submit SQL queries to retrieve data from it. If we are compiling a .qmd document in Rstudio, we can execute SQL in a code chunk. We just need to open the code chunk with {sql connection=<db>}, where <db> is whatever you have named the database connection.

The data sets in a database are called tables. To see a list of the tables, we can submit the following SQL command:

SHOW TABLES
5 records
Tables_in_airlines
airports
carriers
flights
flights_summary
planes

We see that there are five tables, or data sets in the “airlines” database.

In order to get a summary of a particular table, we can use DESCRIBE:

DESCRIBE flights
Displaying records 1 - 10
Field Type Null Key Default Extra
year smallint(4) YES NA
month smallint(2) YES NA
day smallint(2) YES NA
dep_time smallint(4) YES NA
sched_dep_time smallint(4) YES NA
dep_delay smallint(4) YES NA
arr_time smallint(4) YES NA
sched_arr_time smallint(4) YES NA
arr_delay smallint(4) YES NA
carrier varchar(2) NO
DESCRIBE carriers
2 records
Field Type Null Key Default Extra
carrier varchar(7) NO PRI
name varchar(255) NO
DESCRIBE planes
9 records
Field Type Null Key Default Extra
tailnum varchar(6) NO PRI
year int(11) YES NA
type text YES NA
manufacturer text YES NA
model text YES NA
engines int(11) YES NA
seats int(11) YES NA
speed int(11) YES NA
engine text YES NA
DESCRIBE airports
9 records
Field Type Null Key Default Extra
faa varchar(3) NO PRI
name varchar(255) YES NA
lat decimal(10,7) YES NA
lon decimal(10,7) YES NA
alt int(11) YES NA
tz smallint(4) YES NA
dst char(1) YES NA
city varchar(255) YES NA
country varchar(255) YES NA

The two pdfs airports_table_schema and flights_table_schema give descriptions of the airports and flights tables in the airlines database.

Retrieving data

An SQL query can have the following commands:

  • SELECT: Set which fields, i.e. variables/columns, from which to retrieve values.
  • FROM: Set table or tables you want to retrieve values from.
  • JOIN: Merge tables together.
  • WHERE: Retrieve records (rows of data) conditionally .
  • GROUP BY: Aggregate records.
  • HAVING: Conditionally output records based on newly created variables.
  • ORDER BY: Sort the records in the output data set.
  • LIMIT: Set the number of records to output.

We will go through each of these.

SELECT and FROM

Every SQL query must have a SELECT and a FROM command.

To get the names of the carriers in the “carriers” table, we can run the following:

SELECT name
FROM carriers
Displaying records 1 - 10
name
Titan Airways
Tradewind Aviation
Comlux Aviation, AG
Master Top Linhas Aereas Ltd.
Flair Airlines Ltd.
Swift Air, LLC
DCA
ACM AIR CHARTER GmbH
Inter Island Airways, d/b/a Inter Island Air
Polar Airlines de Mexico d/b/a Nova Air

Here we pull several fields from the “flights” table, but we limit the number of records to pull so we don’t get an enormous amount of data:

SELECT carrier, origin, dest, dep_time, year, month, day
FROM flights 
LIMIT 20
Displaying records 1 - 10
carrier origin dest dep_time year month day
AA LAX DFW 2 2013 10 1
FL SFO ATL 4 2013 10 1
AA SFO DFW 11 2013 10 1
AA SEA ORD 14 2013 10 1
UA LAX IAH 16 2013 10 1
UA SFO IAH 22 2013 10 1
US LAX CLT 29 2013 10 1
AS ANC SEA 29 2013 10 1
DL SEA MSP 31 2013 10 1
DL LAX MSP 32 2013 10 1

We can also construct new variables out of the ones in the table. One way to do this is with CONCAT and AS:

SELECT CONCAT(name,' (',faa,')') AS airport, city AS location
FROM airports
ORDER BY name
LIMIT 10
Displaying records 1 - 10
airport location
Aberdeen Regional Airport (ABR) Aberdeen
Abilene Rgnl (ABI) Abilene
Abraham Lincoln Capital (SPI) Springfield
Acadiana Rgnl (ARA) Louisiana
Adak Airport (ADK) Adak Island
Adams Fld (LIT) Little Rock
Addison (ADS) Addison
Adirondack Regional Airport (SLK) Saranac Lake
Akhiok Airport (AKK) Akhiok
Akiak Airport (AKI) Akiak

Note that in the above query we also used AS to rename a field. We also used ORDER BY to sort the resulting table.

WHERE

Now suppose we only want flights departing from Dallas/Forth-Worth (DFW). We can use WHERE as follows:

SELECT carrier, origin, dest, year, month, day
FROM flights 
WHERE origin = 'DFW' 
LIMIT 20
Displaying records 1 - 10
carrier origin dest year month day
AA DFW MIA 2014 10 1
US DFW CLT 2013 10 1
US DFW CLT 2014 10 1
OO DFW IAH 2013 10 1
AA DFW ORD 2015 10 1
AA DFW CLT 2015 10 1
DL DFW ATL 2013 10 1
DL DFW ATL 2014 10 1
UA DFW ORD 2014 10 1
AA DFW MIA 2013 10 1

We can use AND to add more conditions:

SELECT carrier, origin, dest, year, month, day
FROM flights 
WHERE origin = 'DFW' AND year = 2013 AND month = 1 AND day = 15
LIMIT 20
Displaying records 1 - 10
carrier origin dest year month day
US DFW CLT 2013 1 15
OO DFW IAH 2013 1 15
AA DFW ORD 2013 1 15
AA DFW MIA 2013 1 15
AA DFW LGA 2013 1 15
DL DFW ATL 2013 1 15
OO DFW LAX 2013 1 15
MQ DFW HOU 2013 1 15
OO DFW ORD 2013 1 15
US DFW PHX 2013 1 15

We can also use OR. Here we collect all the flight departing or arriving at the Columbia Metropolitan Airport (CAE) on a particular day. Note that we can use parentheses to enforce an order of operations:

SELECT carrier, origin, dest, year, month, day
FROM flights 
WHERE (origin = 'CAE' OR dest = 'CAE') AND year = 2013 AND month = 1 AND day = 15
LIMIT 20
Displaying records 1 - 10
carrier origin dest year month day
EV CAE ATL 2013 1 15
EV CAE IAD 2013 1 15
MQ CAE DFW 2013 1 15
EV CAE ORD 2013 1 15
DL CAE ATL 2013 1 15
EV CAE ATL 2013 1 15
MQ DFW CAE 2013 1 15
EV CAE ATL 2013 1 15
EV CAE ORD 2013 1 15
EV ATL CAE 2013 1 15

We can convert the year, month, and day information to an actual date using the STR_TO_DATE function. Let’s consider flights departing from CAE with destination IAH.

SELECT carrier, origin, dest, STR_TO_DATE(CONCAT(year,'/', month,'/', day),"%Y/%m/%d") AS date
FROM flights 
WHERE origin = 'CAE' AND dest = 'IAH'
LIMIT 50
Displaying records 1 - 10
carrier origin dest date
EV CAE IAH 2015-10-01
EV CAE IAH 2014-10-01
EV CAE IAH 2013-10-01
EV CAE IAH 2014-10-02
EV CAE IAH 2015-10-02
EV CAE IAH 2013-10-02
EV CAE IAH 2014-10-03
EV CAE IAH 2013-10-03
EV CAE IAH 2015-10-04
EV CAE IAH 2013-10-04

Get a list of mile-high airports and sort them by altitude in descending order:

SELECT faa, name, alt
FROM airports
WHERE alt >= 5280
ORDER BY alt DESC
LIMIT 20
Displaying records 1 - 10
faa name alt
TEX Telluride 9078
TVL Lake Tahoe Airport 8544
ASE Aspen Pitkin County Sardy Field 7820
GUC Gunnison - Crested Butte 7678
BCE Bryce Canyon 7590
ALS San Luis Valley Regional Airport 7539
LAR Laramie Regional Airport 7284
LAM Los Alamos Airport 7171
EVW Evanston-Uinta CO Burns Fld 7143
MMH Mammoth Yosemite Airport 7128

Operators BETWEEN, DISTINCT, and IN

The BETWEEN operator lets us specify a range of values:

SELECT CONCAT(name,' (',faa,')') AS airport, city, lat, lon
FROM airports
WHERE lon BETWEEN -90 AND -80 AND lat BETWEEN 45 AND 50
ORDER BY airport
Displaying records 1 - 10
airport city lat lon
Alpena County Regional Airport (APN) Alpena 45.07810 -83.56030
Charlevoix Municipal Airport (CVX) Charelvoix 45.30478 -85.27533
Chippewa County International Airport (CIU) Sault Ste Marie 46.25080 -84.47240
Delta County Airport (ESC) Escanaba 45.72278 -87.09361
Drummond Island Airport (DRM) Drummond Island 46.00931 -83.74393
Eagle River (EGV) Eagle River 45.93233 -89.26828
Ephraim-Gibraltar Airport (3D2) Ephraim 45.13542 -87.18794
Ford Airport (IMT) Iron Mountain 45.81836 -88.11456
Houghton County Memorial Airport (CMX) Hancock 47.16840 -88.48910
Lakeland (ARV) Minocqua - Woodruff 45.92778 -89.73083

Note that we do not have to SELECT the fields used in the WHERE statement.

The DISTINCT operator selects only the unique values of a field. We can use DISTINCT to see that all the airports in the airports table are within the United States.

SELECT DISTINCT country
FROM airports
1 records
country
United States

Here are all the distinct destinations flown to from CAE in the year 2015:

SELECT DISTINCT dest AS Destination
FROM flights
WHERE origin = 'CAE' AND year = 2015
6 records
Destination
ATL
LGA
ORD
DFW
IAH
IAD

The IN operator can be used in the WHERE statement in order to keep a record if a field has one of several specified values:

SELECT flight, origin, dest, 
       STR_TO_DATE(CONCAT(year,'/', month,'/', day),"%Y/%m/%d") AS date
FROM flights
WHERE origin = 'CAE' AND dest IN ('IAH','DFW') AND year = 2014
LIMIT 30
Displaying records 1 - 10
flight origin dest date
3663 CAE DFW 2014-10-01
3544 CAE DFW 2014-10-01
4253 CAE IAH 2014-10-01
3492 CAE DFW 2014-10-01
3663 CAE DFW 2014-10-02
3544 CAE DFW 2014-10-02
4253 CAE IAH 2014-10-02
3492 CAE DFW 2014-10-02
3663 CAE DFW 2014-10-03
3544 CAE DFW 2014-10-03

GROUP BY

We can use GROUP BY to aggregate groups of records sharing the same value of a field. Here we compute the total number of flights leaving each airport in 2013:

SELECT origin, COUNT(*) AS Departures 
FROM flights
WHERE year = 2013
GROUP BY origin
ORDER BY Departures DESC
LIMIT 20
Displaying records 1 - 10
origin Departures
ATL 398878
ORD 306439
DFW 286427
DEN 226597
LAX 224821
IAH 181071
PHX 177027
SFO 168070
DTW 163550
CLT 148745

We can put list several fields after GROUP BY. Below we obtain the number of flights departing CAE on each day in the year 2013:

SELECT origin, COUNT(*) AS numDepartures, STR_TO_DATE(CONCAT(year,'/', month,'/', day),"%Y/%m/%d") AS theDate
FROM flights
WHERE year = 2013 AND origin = 'CAE'
GROUP BY year, month, day
Displaying records 1 - 10
origin numDepartures theDate
CAE 11 2013-01-01
CAE 18 2013-01-02
CAE 18 2013-01-03
CAE 18 2013-01-04
CAE 14 2013-01-05
CAE 15 2013-01-06
CAE 19 2013-01-07
CAE 18 2013-01-08
CAE 17 2013-01-09
CAE 19 2013-01-10

Besides simply counting the records in each group, we can compute, for example the average number of minutes departures were delayed for each carrier:

SELECT origin, carrier, AVG(dep_delay) AS avg_delay
FROM flights
WHERE year = 2013 AND origin = 'CAE'
GROUP BY carrier
4 records
origin carrier avg_delay
CAE 9E 14.5786
CAE DL 5.8581
CAE EV 14.1871
CAE MQ 7.4855

HAVING

We use HAVING to obtain a subset of records, similarly to how we use WHERE, but we let HAVING operate on newly created variables.

Here we select carriers with over 500 flights departing 2013 from any of the airports CAE, GSP, and CHS:

SELECT carrier, COUNT(*) AS departures
FROM flights
WHERE year = 2013 AND origin IN ('CAE','GSP','CHS')
GROUP BY carrier
HAVING departures > 500 -- HAVING uses new variables you have defined
ORDER BY departures DESC
LIMIT 20
7 records
carrier departures
EV 13105
DL 5552
WN 4612
MQ 3469
9E 2947
YV 2345
B6 959

In the above, we could move everything in the WHERE statement to the HAVING statement, but it will run slower!! So it is best practice to use HAVING only on newly created variables.

Merging tables

We can write SQL commands to merge data sets in an inner join, a left join, or a right join. One has to specify the “left” table, the type of join, the “right” table, and then instructions for how to match records in the left and right tables.

This info goes on the FROM line like this: FROM (left table) (type of join) (right table) ON (conditions for matching records).

Inner join

Suppose we want to pull into the last table the full names of the carriers, which must be retrieved from the carriers table. We can use an inner join to do this. Note that we must now refer to each field relative to its table as “table.field”:

SELECT carriers.name AS carrier, COUNT(*) AS departures
FROM flights JOIN carriers ON flights.carrier = carriers.carrier
WHERE flights.year = 2013 AND flights.origin IN ('CAE','GSP','CHS')
GROUP BY flights.carrier
HAVING departures > 500 -- HAVING uses new variables you have defined
7 records
carrier departures
Endeavor Air Inc. 2947
JetBlue Airways 959
Delta Air Lines Inc. 5552
ExpressJet Airlines Inc. 13105
Envoy Air 3469
Southwest Airlines Co. 4612
Mesa Airlines Inc. 2345

In the FROM line we can designate aliases for the tables in order to make our code more concise. Here is the same query as above:

SELECT c.name AS carrier, COUNT(*) AS departures
FROM flights AS f 
JOIN carriers AS c ON f.carrier = c.carrier
WHERE f.year = 2013 AND f.origin IN ('CAE','GSP','CHS')
GROUP BY f.carrier
HAVING departures > 500 -- HAVING uses new variables you have defined
ORDER BY departures DESC
7 records
carrier departures
ExpressJet Airlines Inc. 13105
Delta Air Lines Inc. 5552
Southwest Airlines Co. 4612
Envoy Air 3469
Endeavor Air Inc. 2947
Mesa Airlines Inc. 2345
JetBlue Airways 959

Returning to the table showing the average departure delay time for each carrier leaving from CAE, we now merge with this the carriers table in order to display the full name of each carrier. We now group the results by the year as well as by the carrier:

SELECT f.origin, f.year, c.name, AVG(f.dep_delay) AS avg_delay
FROM flights AS f
JOIN carriers AS c ON f.carrier = c.carrier
WHERE f.origin = 'CAE'
GROUP BY f.carrier, f.year
Displaying records 1 - 10
origin year name avg_delay
CAE 2013 Endeavor Air Inc. 14.5786
CAE 2013 Delta Air Lines Inc. 5.8581
CAE 2014 Delta Air Lines Inc. 5.1412
CAE 2015 Delta Air Lines Inc. 4.0054
CAE 2013 ExpressJet Airlines Inc. 14.1871
CAE 2014 ExpressJet Airlines Inc. 13.6217
CAE 2015 ExpressJet Airlines Inc. 11.1207
CAE 2013 Envoy Air 7.4855
CAE 2014 Envoy Air 10.3580
CAE 2015 Envoy Air 14.7472

Here we obtain a summary of the kinds of planes departing CAE by merging the flights table with the planes table:

SELECT p.model, p.manufacturer, p.seats, COUNT(*) departures
FROM flights AS f
JOIN planes AS p ON f.tailnum = p.tailnum
WHERE f.origin = 'CAE' AND f.year = '2013'
GROUP BY p.model
ORDER BY departures DESC
LIMIT 20
Displaying records 1 - 10
model manufacturer seats departures
CL-600-2B19 CANADAIR 55 2455
EMB-145LR EMBRAER 55 537
MD-88 MCDONNELL DOUGLAS AIRCRAFT CO 142 325
DC-9-51 MCDONNELL DOUGLAS 139 206
EMB-145XR EMBRAER 55 156
CL-600-2C10 BOMBARDIER INC 80 118
CL-600-2D24 BOMBARDIER INC 95 51
A319-114 AIRBUS 145 44
A320-211 AIRBUS INDUSTRIE 182 17
A320-212 AIRBUS 182 11

As another example, we next make a table of flights which has additional information about the airports not found in the flights table—namely the full name of the airport and its altitude, which we must pull from the airports table. We return only flights from airports at altitudes above 1 mile:

SELECT f.origin, a.name, f.dest, f.carrier, a.alt -- use table aliases here which are defined in FROM
FROM flights AS f JOIN airports AS a ON f.origin = a.faa
WHERE a.alt > 5280
LIMIT 20
Displaying records 1 - 10
origin name dest carrier alt
DEN Denver Intl JFK B6 5431
DEN Denver Intl CLT US 5431
DEN Denver Intl IAH UA 5431
DRO Durango La Plata Co DEN EV 6685
COS City Of Colorado Springs Muni ORD OO 6187
DEN Denver Intl PHX US 5431
CPR Natrona Co Intl SLC OO 5347
DEN Denver Intl IAD UA 5431
DEN Denver Intl PHX WN 5431
ABQ Albuquerque International Sunport SLC OO 5355

We can merge more than two data sets by adding more JOIN commands. Below we list flights departing in January of 2013 from mile-high airports in planes with over 100 seats, along with the full names of the carriers of these flights:

SELECT a.name AS origin, f.dest,
STR_TO_DATE(CONCAT(f.year,'/', f.month,'/', f.day),"%Y/%m/%d") AS date,
CONCAT(LPAD(f.hour,2,'0'),':',LPAD(f.minute,2,'0')) AS sched_dep_time, -- format scheduled departure time as HH:MM
c.name AS carrier, p.model, p.seats
FROM flights AS f
JOIN airports AS a ON f.origin = a.faa
JOIN carriers AS c ON f.carrier = c.carrier
JOIN planes AS p ON f.tailnum = p.tailnum
WHERE a.alt > 5280 AND f.year = '2013' AND f.month = '1' AND p.seats > 100
LIMIT 20
Displaying records 1 - 10
origin dest date sched_dep_time carrier model seats
Denver Intl PHL 2013-01-01 07:00 US Airways Inc. A320-214 182
Denver Intl PHL 2013-01-02 14:15 US Airways Inc. A320-214 182
Denver Intl CLT 2013-01-06 02:00 US Airways Inc. A320-214 182
Denver Intl PHL 2013-01-20 07:00 US Airways Inc. A320-214 182
Denver Intl PHL 2013-01-07 11:05 US Airways Inc. A320-214 182
Denver Intl SFO 2013-01-03 14:35 United Air Lines Inc. 737-824 149
Denver Intl SFO 2013-01-07 14:35 United Air Lines Inc. 737-824 149
Yampa Valley IAH 2013-01-21 14:07 United Air Lines Inc. 737-824 149
Denver Intl IAD 2013-01-27 17:50 United Air Lines Inc. 737-824 149
Denver Intl OMA 2013-01-29 19:39 United Air Lines Inc. 737-824 149

We can do multiple joins involving the same data sets. For example, if I want a table giving the full name of both the origin and destination airport of several flights, I can join twice on the airports table. The query below is an example of this.

SELECT f.flight, o.name AS origin, d.name AS destination, p.model
FROM flights AS f 
JOIN airports AS o ON f.origin = o.faa 
JOIN airports AS d ON f.dest = d.faa 
JOIN planes AS p on f.tailnum = p.tailnum
WHERE p.model LIKE '737%' AND f.distance < 500 AND f.year = 2013 AND f.month = 1 AND f.day = 15
LIMIT 20
Displaying records 1 - 10
flight origin destination model
54 Louis Armstrong New Orleans Intl George Bush Intercontinental 737-824
1116 Newark Liberty Intl General Edward Lawrence Logan Intl 737-824
1703 Newark Liberty Intl General Edward Lawrence Logan Intl 737-824
1211 Denver Intl Will Rogers World 737-824
1290 Chicago Ohare Intl Port Columbus Intl 737-824
1285 Cleveland Hopkins Intl Newark Liberty Intl 737-724
1289 Newark Liberty Intl General Edward Lawrence Logan Intl 737-824
158 Hartsfield Jackson Atlanta Intl Tampa Intl 737-76N
552 Tampa Intl Hartsfield Jackson Atlanta Intl 737-76N
1623 General Edward Lawrence Logan Intl Newark Liberty Intl 737-824

The above query used LIKE, which allows you to search for a substring match. LIKE '737%' matches when a string begins with '737', LIKE '%737' would match on strings ending in '737', and LIKE '%737%' would match when a string contained '737' anywhere.

Left and right joins

An inner join, such as we have performed above, will only include in the output table records for which a match was found. Suppose a flight in the flights table had a carrier which does not appear in the carriers table; in this case, the record would be dropped, or deleted. If we wish to keep all the flights regardless of whether the carrier appears in the carriers table, we can do a left join, which will keep all the records in the left data set.

SELECT f.origin, a.name AS airport, a.alt, c.name AS carrier, f.dest
FROM flights AS f 
JOIN airports AS a ON f.origin = a.faa
LEFT JOIN carriers AS c ON f.carrier = c.carrier 
WHERE a.alt > 5280
LIMIT 20
Displaying records 1 - 10
origin airport alt carrier dest
DEN Denver Intl 5431 JetBlue Airways JFK
DEN Denver Intl 5431 US Airways Inc. CLT
DEN Denver Intl 5431 United Air Lines Inc. IAH
DRO Durango La Plata Co 6685 ExpressJet Airlines Inc. DEN
COS City Of Colorado Springs Muni 6187 SkyWest Airlines Inc. ORD
DEN Denver Intl 5431 US Airways Inc. PHX
CPR Natrona Co Intl 5347 SkyWest Airlines Inc. SLC
DEN Denver Intl 5431 United Air Lines Inc. IAD
DEN Denver Intl 5431 Southwest Airlines Co. PHX
ABQ Albuquerque International Sunport 5355 SkyWest Airlines Inc. SLC

A right join works in the same way.

References

Baumer, Benjamin S., Daniel T. Kaplan, and Nicholas J. Horton. 2021. Modern Data Science with R. 2nd ed. Boca Raton: Chapman; Hall/CRC Press. https://www.routledge.com/Modern-Data-Science-with-R/Baumer-Kaplan-Horton/p/book/9780367191498.