Data Computing Chapter 10
October 5, 2016
Glyph-ready data often combines data from different sources.
MedicareProviders
: Name and locationDirectRecoveryGroups
: Descriptions of standardized medical proceduresMedicareCharges
: Charges and payments for different DRGs by different providersZipDemography
: Population and age structure in each ZIP code.Storing data in separate tables can be beneficial even when the data are coming from the same source:
Strategy: Don’t even try to smash all data into one big table. Instead, join related tables as needed.
Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid | grade | sessionID | |
---|---|---|---|
2197 | S31680 | B | session3518 |
259 | S31242 | B | session2897 |
4188 | S32127 | A | session2002 |
3880 | S32058 | A- | session2952 |
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|
640 | session2568 | J | 100 | FA2002 | 15 | inst223 |
76 | session1940 | d | 100 | FA2000 | 16 | inst409 |
1218 | session3242 | m | 200 | SP2004 | 30 | inst476 |
A join is a data verb that combines two tables.
There are several kinds of join.
Goal: Figure out the average class size seen by each student at a small college.
enroll
comes from Courses
table.sid
) comes from Grades
.sessionID
is in both tables.Grades %>%
left_join(Courses) %>% sample_n(size=4)
sid | grade | sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|---|---|
4043 | S32061 | B+ | session3644 | K | 300 | SP2005 | 18 | inst239 |
1536 | S31518 | AU | session2901 | J | 100 | FA2003 | 22 | inst224 |
1844 | S31587 | B- | session3191 | b | 200 | SP2004 | 10 | inst403 |
3593 | S31962 | B | session3821 | W | 300 | SP2005 | 24 | inst138 |
Once Courses
and Grades
are joined, it’s straightforward to find the average enrollment seen by each student.
AveClassEachStudent <- Grades %>%
left_join(Courses) %>%
group_by(sid) %>%
summarise(ave_enroll = mean(enroll, na.rm=TRUE))
sid | ave_enroll |
---|---|
S31626 | 27.64286 |
S32007 | 22.23077 |
S31905 | 26.73333 |
A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.
Example:
Grades %>%
left_join(Courses, by = c(sessionID = "sessionID")) %>%
head(4)
sid | grade | sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|---|---|
S31185 | D+ | session1784 | M | 100 | FA1991 | 22 | inst265 |
S31185 | B+ | session1785 | k | 100 | FA1991 | 52 | inst458 |
S31185 | A- | session1791 | J | 100 | FA1993 | 22 | inst223 |
S31185 | B+ | session1792 | J | 300 | FA1993 | 20 | inst235 |
The default value of by=
is all variables with the same names in both tables.
This is not reliable unless you’ve checked.
Different kinds of join have different answers to these questions.
Most popular joins: left_join()
and inner_join()
left_join()
Keep the left case and fill in the new variables with NAinner_join()
Discard the left case.Less popular joins:
full_join()
Keep left case as well as unmatched right cases.semi_join()
Discard the left case.anti_join()
Keep the left case but discard any left case with a match in the right tableleft_join()
and inner_join()
do the same thing:
left_join()
Keep all combinations.inner_join()
Keep all combinations.Less popular joins:
full_join()
Keep all combinations.semi_join()
Keep just one copy of the left case.anti_join()
Discard the left case.Here are three data tables relating student grades in courses at their college
Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid | grade | sessionID |
---|---|---|
S31185 | D+ | session1784 |
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|
session1784 | M | 100 | FA1991 | 22 | inst265 |
GradePoint <- read.file("http://tiny.cc/mosaic/grade-to-number.csv")
grade | gradepoint |
---|---|
A | 4 |
For each of these, say what tables you would need to join and what the corresponding variables will be.
The following steps must be clearly shown in the file you submit so they can be graded as follows:
Assignment is worth a total of 10 points
OrdwayBirds
& number of distinct species in OrdwaySpeciesNames