Cases, Variables, and Values
A data table is comprised of cases and
variables.
Each variable comprises values (or levels).
There is no hard distinction between a variable and a value. What’s a
variable in one situation may be a value in another, and vice versa.
A data table
Beckmans
Cases, Variables, and Values
- Variables:
who
, age
, sex
,
indoors
, outdoors
, tv
,
breakfast
- Values:
who
is a name
age
is numeric age
sex
is sex of the person
indoors
is a hobby while indoors
outdoors
is a hobby while outdoors
tv
is a favorite television show (…other than
YouTube)
breakfast
is typical meal preference before school
- Cases: Beckmans {Eden, Jack, Hazel}
So what?
- This allows us to easily redefine how rows are presented in the data
- Possibly motivated by the research question
- Possibly motivated by desire to join two data tables with different
case definitions
- Possibly motivated by a data visualization
- Also, some operations are easy in wide format, but hard in narrow
and vice versa
- We need tools that make it easy to switch back and forth
Note about terms “wide” and “narrow”
It’s actually possible for a “wider” form of the data and a “narrow”
form of the data to have the same number of columns! For example, when
the “key” only has two outcomes.
# this is "wide"--with 5 variables
Beckmans %>%
select(who, age, sex, indoors, outdoors)
# this is "narrow"--with 5 variables
Beckmans %>%
select(who, age, sex, indoors, outdoors) %>%
pivot_longer(cols = c(indoors, outdoors), names_to = "type", values_to = "preference")
NA
Questions:
RQ 1. How many babies of each name and sex? RQ 2. For each name, is
it primarily given to girls or boys? Which names are gender neutral?
With sexes side by side…
We can easily calculate balance associated with names
BabyTotalsWide <-
BabyTotalsWide %>%
rename(fem = F, male = M) %>% # `F` is a terrible variable name (why?)
mutate(prop_fem = fem / (male + fem),
prop_male = male / (male + fem),
name_specificity = pmax(prop_fem, prop_male)) # what does `pmax()` do?
BabyTotalsWide
pivot_longer( )
—when you have “Wide” and want
“Narrow”
Syntax:
NarrowOutput <-
WideInput %>%
pivot_longer(cols = c(wide_var1, wide_var2, ...), names_to = "long_var1", values_to "long_var2")
- The
cols
are the variables we want to combine (a.k.a.
melt, stack, fold, gather)
- e.g.
prop_fem
and prop_male
in this
case
BabyTotalsNarrow <-
BabyTotalsWide %>%
select(prop_fem, prop_male) %>%
pivot_longer(cols = c(prop_fem, prop_male), names_to = "sex", values_to = "proportion")
Adding missing grouping variables: `name`
BabyTotalsNarrow
With sexes stacked again…
We can make an intuitive bar chart (though some clean up is
needed…)
BabyTotalsNarrow %>%
ggplot() +
geom_bar(aes(x = name, fill = sex, weight = proportion))

NA
With some improvements
- clean up labels of sexes
- add title, source, & better axis labels (default y-axis label
was flat wrong)
# first, clean up the labels in `sex` for plotting
BabyTotalsNarrow %>%
mutate(sex = if_else(sex == "prop_fem",
true = "female",
false = if_else(sex == "prop_male",
true = "male",
false = "unk") # end of "inner" if_else()
) # ends the "outer" if_else()
) %>% # ends the mutate()
ggplot() +
geom_bar(aes(x = name, fill = sex, weight = proportion)) +
ggtitle("Gender Balance among Names of Beckman Kids",
subtitle = "source: U.S. Social Security Administration") +
xlab("Name") +
ylab("Proportion")

NA
