Combining data from different sources

Glyph-ready data often combines data from different sources.


Example: Medicare data


Relational databases

Storing data in separate tables can be beneficial even when the data are coming from the same source:

Strategy: Don’t even try to smash all data into one big table. Instead, join related tables as needed.


Example: Grades and Enrollment

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid grade sessionID
2197 S31680 B session3518
259 S31242 B session2897
4188 S32127 A session2002
3880 S32058 A- session2952
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID dept level sem enroll iid
640 session2568 J 100 FA2002 15 inst223
76 session1940 d 100 FA2000 16 inst409
1218 session3242 m 200 SP2004 30 inst476

Joins

A join is a data verb that combines two tables.

There are several kinds of join.


Example: Average class size

Goal: Figure out the average class size seen by each student at a small college.

Grades %>%
  left_join(Courses) %>% sample_n(size=4)
sid grade sessionID dept level sem enroll iid
4043 S32061 B+ session3644 K 300 SP2005 18 inst239
1536 S31518 AU session2901 J 100 FA2003 22 inst224
1844 S31587 B- session3191 b 200 SP2004 10 inst403
3593 S31962 B session3821 W 300 SP2005 24 inst138

Once Courses and Grades are joined, it’s straightforward to find the average enrollment seen by each student.

AveClassEachStudent <- Grades %>% 
  left_join(Courses) %>%
  group_by(sid) %>%
  summarise(ave_enroll = mean(enroll, na.rm=TRUE))
sid ave_enroll
S31626 27.64286
S32007 22.23077
S31905 26.73333

Establishing a match between cases

A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.

Example:

Grades %>% 
  left_join(Courses, by = c(sessionID = "sessionID")) %>%
  head(4)
sid grade sessionID dept level sem enroll iid
S31185 D+ session1784 M 100 FA1991 22 inst265
S31185 B+ session1785 k 100 FA1991 52 inst458
S31185 A- session1791 J 100 FA1993 22 inst223
S31185 B+ session1792 J 300 FA1993 20 inst235

The default value of by= is all variables with the same names in both tables.

This is not reliable unless you’ve checked.


Kinds of join

Different kinds of join have different answers to these questions.

Most popular joins: left_join() and inner_join()


No match of a left case to a right case

Less popular joins:


Multiple matches of right cases to a left case

left_join() and inner_join() do the same thing:

Less popular joins:


Example: Grade-point averages

Here are three data tables relating student grades in courses at their college

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid grade sessionID
S31185 D+ session1784
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID dept level sem enroll iid
session1784 M 100 FA1991 22 inst265
GradePoint <- read.file("http://tiny.cc/mosaic/grade-to-number.csv")
grade gradepoint
A 4

Exercise: Which to Join?

For each of these, say what tables you would need to join and what the corresponding variables will be.

  1. How many students in each department?
  2. What fraction of grades are below B+ in each department?
  3. What’s the grade-point average (GPA) for each student?
  4. Grade-point average for each department or instructor

Activity: Bird Species

The following steps must be clearly shown in the file you submit so they can be graded as follows:

Assignment is worth a total of 10 points

Homework


teaching | stat 184 home | syllabus | piazza | canvas