Joins 1

Jake Feldman

Basic Joins

Let see examples in action on synthetic data frames.

#Creating the doctor df
doctor_id = c(210,211,212,213)
doc_name = c("Jon", "Peter", "Ke", "Pat")
degree = c("MD", "MBBS", "MD", "MD")
doctors = data.frame(doctor_id, doc_name, degree)

#Creating the visits df
doctor_id = c(210,214,215,212,212)
patient_name = c("Julia", "TJ", "John", "James", "Jason")
year = c(2015, 2014, 2015, 2016, 2012)
visits = data.frame(doctor_id, patient_name, year)

doctors 
  doctor_id doc_name degree
1       210      Jon     MD
2       211    Peter   MBBS
3       212       Ke     MD
4       213      Pat     MD
visits
  doctor_id patient_name year
1       210        Julia 2015
2       214           TJ 2014
3       215         John 2015
4       212        James 2016
5       212        Jason 2012

For every visit, I would like to create a table with the doctor that saw the patient (If one exists).

# Full outer join.  We get every pair of rows for the two dfs.
#I always start at the outer join and then filter
sqldf("SELECT * FROM visits JOIN doctors")
   doctor_id patient_name year doctor_id doc_name degree
1        210        Julia 2015       210      Jon     MD
2        210        Julia 2015       211    Peter   MBBS
3        210        Julia 2015       212       Ke     MD
4        210        Julia 2015       213      Pat     MD
5        214           TJ 2014       210      Jon     MD
6        214           TJ 2014       211    Peter   MBBS
7        214           TJ 2014       212       Ke     MD
8        214           TJ 2014       213      Pat     MD
9        215         John 2015       210      Jon     MD
10       215         John 2015       211    Peter   MBBS
11       215         John 2015       212       Ke     MD
12       215         John 2015       213      Pat     MD
13       212        James 2016       210      Jon     MD
14       212        James 2016       211    Peter   MBBS
15       212        James 2016       212       Ke     MD
16       212        James 2016       213      Pat     MD
17       212        Jason 2012       210      Jon     MD
18       212        Jason 2012       211    Peter   MBBS
19       212        Jason 2012       212       Ke     MD
20       212        Jason 2012       213      Pat     MD
#Matching patients to the doctors they see (INNER JOIN)
#Note how we reference columns in different tables
sqldf("SELECT * FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id ")
  doctor_id patient_name year doctor_id doc_name degree
1       210        Julia 2015       210      Jon     MD
2       212        James 2016       212       Ke     MD
3       212        Jason 2012       212       Ke     MD
#Same thing but getting rid of redundant columns
sqldf("SELECT D.doctor_id, D.doc_name, V.patient_name, V.year FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id ")
  doctor_id doc_name patient_name year
1       210      Jon        Julia 2015
2       212       Ke        James 2016
3       212       Ke        Jason 2012

For every visit, I would like to create a table with the doctor that saw the patient (If one exists) in 2015.

#Add second condition in the WHERE, which filters after
#the join has been executed
sqldf("SELECT D.doctor_id, D.doc_name, V.patient_name, V.year FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id WHERE year = 2015 ")
  doctor_id doc_name patient_name year
1       210      Jon        Julia 2015

Find the number of patients served by each doctor

#Join with a group by. Gives the number of patients each doctor saw
sqldf("SELECT D.doctor_id, D.doc_name, COUNT(*) AS num_visits FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id GROUP BY D.doctor_id ")
  doctor_id doc_name num_visits
1       210      Jon          1
2       212       Ke          2

For every visit, I would like to create a table with the doctor that saw the patient (Put blank if one does not exist).

#Getting all patients with doctors even if we don't know the doctor with a LEFT JOIN (We will get a null or blank in this case)
sqldf("SELECT  D.doc_name, V.patient_name, V.year FROM doctors D LEFT JOIN visits V ON D.doctor_id = V.doctor_id ")
  doc_name patient_name year
1      Jon        Julia 2015
2    Peter         <NA>   NA
3       Ke        James 2016
4       Ke        Jason 2012
5      Pat         <NA>   NA

-Read in the teams data frame and add the names of the teams to your standings.

Next Lecture + HW2

-We will get more practice with Joins (no joins on HW2).

-We will learn to write generic queries.

-HW2 considers data from passengers on the Titanic.