Cases, Variables, and Values

A data table is comprises cases and variables.

Each variable comprises values (or levels).

There is no hard distinction between a variable and a value. What’s a variable in one situation may be a value in another, and vice versa.

A data table

Students
who x y dorm
Alice 7 English Sproul
Lesley 19 Mandarin Bigler
Yu 23 French Snyder

Cases, Variables, and Values

who x y dorm
Alice 7 English Sproul
Lesley 19 Mandarin Bigler
Yu 23 French Snyder

Two formats

Narrow

who dorm key value
Alice Sproul x 7
Lesley Bigler x 19
Yu Snyder x 23
Alice Sproul y English
Lesley Bigler y Mandarin
Yu Snyder y French

Wide

who x y dorm
Alice 7 English Sproul
Lesley 19 Mandarin Bigler
Yu 23 French Snyder

Gather — from Wide to Narrow

Syntax:

WideInput %>% 
  gather(key_name, value_name, ...)

The ... are the variables to be gathered together, e.g.

StudentsNarrow <- Students %>% gather(key, value, x, y)
who dorm key value
Alice Sproul x 7
Lesley Bigler x 19
Yu Snyder x 23
Alice Sproul y English
Lesley Bigler y Mandarin
Yu Snyder y French

Cases in Narrow data

Aside from Key and Value, all the other variables identify the case.

The gathering makes multiple rows for each row in the wide form. The variables not used for narrowing are copied into the new multiple cases.

Spread — from Narrow to Wide

Syntax:

NarrowInput %>% spread(key, value)

Process:

  1. Group by all variables other than Key and Value These groups become the cases
  2. Create new variables for each level in Key
  3. Within each group, spread out the Values into the new variables.
StudentsNarrow %>% spread(key, value)
who dorm x y
Alice Sproul 7 English
Lesley Bigler 19 Mandarin
Yu Snyder 23 French

Why?

Some operations are easy in wide format, but hard in narrow and vice versa

Excerpt from BabyNames

name sex count year
Alice F 593 1998
Alice F 650 1999
Lesley F 695 1998
Lesley M 15 1998
Lesley F 682 1999
Lesley M 21 1999
Yu F 8 1998
Yu M 9 1998
Yu F 11 1999
Yu M 10 1999

Questions:

  1. How many babies of each name and sex?
  2. For each name, is it primarily given to girls or boys? Which names are gender neutral?

In narrow format

  1. How many babies of each name and sex?
BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count))
name sex total
Alice F 546020
Alice M 1926
Lesley F 33604
Lesley M 4784
Yu F 354
Yu M 340

Easy!

In Wide format

  1. For each name, is it primarily given to girls or boys? Which names are gender neutral?
BabyTotalsWide <- BabyTotals %>% 
  spread(sex, total)
BabyTotalsWide
name F M
Alice 546020 1926
Lesley 33604 4784
Yu 354 340

With sexes side by side …

BabyTotalsWide %>% 
  mutate(gender_specificity = pmax(F/(M+F), M/(M+F)))
name F M gender_specificity
Alice 546020 1926 0.9964851
Lesley 33604 4784 0.8753777
Yu 354 340 0.5100865

Homework

Activity: Stocks & Dividends (DC p. 170)

Note: SKIP Dividends section beginning on p. 173

Assignment is worth a total of 10 points.

  • [2 points] Turn in HTML with embedded .Rmd file (e.g. “DataComputing simple” template)
  • [1 points] Use ggplot to chart stock price over time for a few stocks (stocks chosen may vary)
  • [2 points] Proper join of Prices table and Actions table
  • [2 points] Calculate Stock Profits and show result in “Wide” format of SalesDifference table
  • [2 points] Proper join of Prices table and Reference table
  • [1 points] Use ggplot to chart index over time for chosen stocks (stocks chosen may vary)

teaching | stat 184 home | syllabus | piazza | canvas