Wide vs Narrow Data Tables

Data Computing Chapter 11

October 19, 2016

Cases, Variables, and Values

A data table is comprises cases and variables.

Each variable comprises values (or levels).

There is no hard distinction between a variable and a value. What’s a variable in one situation may be a value in another, and vice versa.

A data table

Students
who x y dorm
Alice 7 English Sproul
Lesley 19 Mandarin Bigler
Yu 23 French Snyder

Two formats

Narrow

who dorm key value
Alice Sproul x 7
Lesley Bigler x 19
Yu Snyder x 23
Alice Sproul y English
Lesley Bigler y Mandarin
Yu Snyder y French

Wide

who x y dorm
Alice 7 English Sproul
Lesley 19 Mandarin Bigler
Yu 23 French Snyder

Gather — from Wide to Narrow

Syntax:

WideInput %>% 
  gather(key_name, value_name, ...)

The ... are the variables to be gathered together, e.g.

StudentsNarrow <- Students %>% gather(key, value, x, y)
StudentsNarrow
who dorm key value
Alice Sproul x 7
Lesley Bigler x 19
Yu Snyder x 23
Alice Sproul y English
Lesley Bigler y Mandarin
Yu Snyder y French

Cases in Narrow data

Aside from Key and Value, all the other variables identify the case.

The gathering makes multiple rows for each row in the wide form. The variables not used for narrowing are copied into the new multiple cases.

Why?

Some operations are easy in wide format, but hard in narrow and vice versa

 

Excerpt from BabyNames

name sex count year
Alice F 593 1998
Alice F 650 1999
Lesley F 695 1998
Lesley M 15 1998
Lesley F 682 1999
Lesley M 21 1999
Yu F 8 1998
Yu M 9 1998
Yu F 11 1999
Yu M 10 1999

Questions:

  1. How many babies of each name and sex?
  2. For each name, is it primarily given to girls or boys? Which names are gender neutral?

In narrow format

  1. How many babies of each name and sex?
BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count))
name sex total
Alice F 546020
Alice M 1926
Lesley F 33604
Lesley M 4784
Yu F 354
Yu M 340

Easy!

In Wide format

  1. For each name, is it primarily given to girls or boys? Which names are gender neutral?
BabyTotalsWide <- BabyTotals %>% 
  spread(sex, total)
BabyTotalsWide
name F M
Alice 546020 1926
Lesley 33604 4784
Yu 354 340

With sexes side by side …

BabyTotalsWide %>% 
  mutate(gender_specificity = pmax(F/(M+F), M/(M+F)))
name F M gender_specificity
Alice 546020 1926 0.9964851
Lesley 33604 4784 0.8753777
Yu 354 340 0.5100865

Homework

 

Activity: Stocks & Dividends (DC p. 170)

Note: SKIP “Dividends” section beginning on p. 173

Assignment is worth a total of 10 points.