Jake Feldman
Let see examples in action on synthetic data frames.
#Creating the doctor df
doctor_id = c(210,211,212,213)
doc_name = c("Jon", "Peter", "Ke", "Pat")
degree = c("MD", "MBBS", "MD", "MD")
doctors = data.frame(doctor_id, doc_name, degree)
#Creating the visits df
doctor_id = c(210,214,215,212,212)
patient_name = c("Julia", "TJ", "John", "James", "Jason")
year = c(2015, 2014, 2015, 2016, 2012)
visits = data.frame(doctor_id, patient_name, year)
doctors
doctor_id doc_name degree
1 210 Jon MD
2 211 Peter MBBS
3 212 Ke MD
4 213 Pat MD
visits
doctor_id patient_name year
1 210 Julia 2015
2 214 TJ 2014
3 215 John 2015
4 212 James 2016
5 212 Jason 2012
For every visit, I would like to create a table with the doctor that saw the patient (If one exists).
# Full outer join. We get every pair of rows for the two dfs.
#I always start at the outer join and then filter
sqldf("SELECT * FROM visits JOIN doctors")
doctor_id patient_name year doctor_id doc_name degree
1 210 Julia 2015 210 Jon MD
2 210 Julia 2015 211 Peter MBBS
3 210 Julia 2015 212 Ke MD
4 210 Julia 2015 213 Pat MD
5 214 TJ 2014 210 Jon MD
6 214 TJ 2014 211 Peter MBBS
7 214 TJ 2014 212 Ke MD
8 214 TJ 2014 213 Pat MD
9 215 John 2015 210 Jon MD
10 215 John 2015 211 Peter MBBS
11 215 John 2015 212 Ke MD
12 215 John 2015 213 Pat MD
13 212 James 2016 210 Jon MD
14 212 James 2016 211 Peter MBBS
15 212 James 2016 212 Ke MD
16 212 James 2016 213 Pat MD
17 212 Jason 2012 210 Jon MD
18 212 Jason 2012 211 Peter MBBS
19 212 Jason 2012 212 Ke MD
20 212 Jason 2012 213 Pat MD
#Matching patients to the doctors they see (INNER JOIN)
#Note how we reference columns in different tables
sqldf("SELECT * FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id ")
doctor_id patient_name year doctor_id doc_name degree
1 210 Julia 2015 210 Jon MD
2 212 James 2016 212 Ke MD
3 212 Jason 2012 212 Ke MD
#Same thing but getting rid of redundant columns
sqldf("SELECT D.doctor_id, D.doc_name, V.patient_name, V.year FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id ")
doctor_id doc_name patient_name year
1 210 Jon Julia 2015
2 212 Ke James 2016
3 212 Ke Jason 2012
For every visit, I would like to create a table with the doctor that saw the patient (If one exists) in 2015.
#Add second condition in the WHERE, which filters after
#the join has been executed
sqldf("SELECT D.doctor_id, D.doc_name, V.patient_name, V.year FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id WHERE year = 2015 ")
doctor_id doc_name patient_name year
1 210 Jon Julia 2015
Find the number of patients served by each doctor
#Join with a group by. Gives the number of patients each doctor saw
sqldf("SELECT D.doctor_id, D.doc_name, COUNT(*) AS num_visits FROM visits V JOIN doctors D ON V.doctor_id = D.doctor_id GROUP BY D.doctor_id ")
doctor_id doc_name num_visits
1 210 Jon 1
2 212 Ke 2
For every visit, I would like to create a table with the doctor that saw the patient (Put blank if one does not exist).
#Getting all patients with doctors even if we don't know the doctor with a LEFT JOIN (We will get a null or blank in this case)
sqldf("SELECT D.doc_name, V.patient_name, V.year FROM doctors D LEFT JOIN visits V ON D.doctor_id = V.doctor_id ")
doc_name patient_name year
1 Jon Julia 2015
2 Peter <NA> NA
3 Ke James 2016
4 Ke Jason 2012
5 Pat <NA> NA
-Read in the teams data frame and add the names of the teams to your standings.
-We will get more practice with Joins (no joins on HW2).
-We will learn to write generic queries.
-HW2 considers data from passengers on the Titanic.
We want to be able to predict which passengers survived
We’ll try to figure out which features are most important in making this prediction. This step is a prerequisite to building some sort of machine learning algorithm.