Cases, Variables, and Values

A data table is comprises cases and variables.

Each variable comprises values (or levels).

There is no hard distinction between a variable and a value. What’s a variable in one situation may be a value in another, and vice versa.

A data table

Students

who	x	y	dorm
Alice	7	English	Sproul
Lesley	19	Mandarin	Bigler
Yu	23	French	Snyder

Cases, Variables, and Values

Variables: Who, X, and Y
- Values:
  - Who is a person’s name
  - X is numeric
  - Y is a language name
  - dorm is a building name
Cases: Alice, Lesley, Yu

who	x	y	dorm
Alice	7	English	Sproul
Lesley	19	Mandarin	Bigler
Yu	23	French	Snyder

Two formats

Data in Key/Value format are narrow
The corresponding wide format has
- separate variables for each level in Key
- sets the values for those variables from the info in Value

Narrow

who	dorm	key	value
Alice	Sproul	x	7
Lesley	Bigler	x	19
Yu	Snyder	x	23
Alice	Sproul	y	English
Lesley	Bigler	y	Mandarin
Yu	Snyder	y	French

Wide

who	x	y	dorm
Alice	7	English	Sproul
Lesley	19	Mandarin	Bigler
Yu	23	French	Snyder

Gather — from Wide to Narrow

Syntax:

WideInput %>% 
  gather(key_name, value_name, ...)

The ... are the variables to be gathered together, e.g.

StudentsNarrow <- Students %>% gather(key, value, x, y)

who	dorm	key	value
Alice	Sproul	x	7
Lesley	Bigler	x	19
Yu	Snyder	x	23
Alice	Sproul	y	English
Lesley	Bigler	y	Mandarin
Yu	Snyder	y	French

Cases in Narrow data

Aside from Key and Value, all the other variables identify the case.

The gathering makes multiple rows for each row in the wide form. The variables not used for narrowing are copied into the new multiple cases.

Spread — from Narrow to Wide

Syntax:

NarrowInput %>% spread(key, value)

Process:

Group by all variables other than Key and Value These groups become the cases
Create new variables for each level in Key
Within each group, spread out the Values into the new variables.

StudentsNarrow %>% spread(key, value)

who	dorm	x	y
Alice	Sproul	7	English
Lesley	Bigler	19	Mandarin
Yu	Snyder	23	French

Why?

Some operations are easy in wide format, but hard in narrow and vice versa

Excerpt from `BabyNames`

name	sex	count	year
Alice	F	593	1998
Alice	F	650	1999
Lesley	F	695	1998
Lesley	M	15	1998
Lesley	F	682	1999
Lesley	M	21	1999
Yu	F	8	1998
Yu	M	9	1998
Yu	F	11	1999
Yu	M	10	1999

Questions:

How many babies of each name and sex?
For each name, is it primarily given to girls or boys? Which names are gender neutral?

In narrow format

How many babies of each name and sex?

BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count))

name	sex	total
Alice	F	546020
Alice	M	1926
Lesley	F	33604
Lesley	M	4784
Yu	F	354
Yu	M	340

Easy!

In Wide format

For each name, is it primarily given to girls or boys? Which names are gender neutral?

sex is the key variable
total is the value variable

BabyTotalsWide <- BabyTotals %>% 
  spread(sex, total)
BabyTotalsWide

name	F	M
Alice	546020	1926
Lesley	33604	4784
Yu	354	340

With sexes side by side …

BabyTotalsWide %>% 
  mutate(gender_specificity = pmax(F/(M+F), M/(M+F)))

name	F	M	gender_specificity
Alice	546020	1926	0.9964851
Lesley	33604	4784	0.8753777
Yu	354	340	0.5100865

Homework

Activity: Stocks & Dividends (DC p. 170)
- SKIP Dividends section beginning on p. 173
DC Ch 11 Exercises: 11.1, 11.3, 11.4, 11.5
DC chapter 12 & 15 reading quiz on Canvas
Start thinking about your Final Project

Activity: Stocks & Dividends (DC p. 170)

Note: SKIP Dividends section beginning on p. 173

Assignment is worth a total of 10 points.

[2 points] Turn in HTML with embedded .Rmd file (e.g. “DataComputing simple” template)
[1 points] Use ggplot to chart stock price over time for a few stocks (stocks chosen may vary)
[2 points] Proper join of Prices table and Actions table
[2 points] Calculate Stock Profits and show result in “Wide” format of SalesDifference table
[2 points] Proper join of Prices table and Reference table
[1 points] Use ggplot to chart index over time for chosen stocks (stocks chosen may vary)

teaching | stat 184 home | syllabus | piazza | canvas

Week 9 Class Notes

Cases, Variables, and Values

Cases, Variables, and Values

Two formats

Narrow

Wide

Gather — from Wide to Narrow

Cases in Narrow data

Spread — from Narrow to Wide

Why?

Excerpt from BabyNames

In narrow format

In Wide format

With sexes side by side …

Homework

Activity: Stocks & Dividends (DC p. 170)

Excerpt from `BabyNames`