Cases, Variables, and Values

A data table is comprised of cases and variables.

Each variable comprises values (or levels).

There is no hard distinction between a variable and a value. What’s a variable in one situation may be a value in another, and vice versa.

A data table

Beckmans

Cases, Variables, and Values

Two formats

Narrow

Q: What IS different? Q: What is NOT different?

Beckmans %>% 
  pivot_longer(cols = c(indoors, outdoors, tv, breakfast), 
               names_to = "type", values_to = "preference")

(Too) Narrow

None of the data has actually been lost, but it’s not a helpful form since there isn’t a useful definition of “case”.

Wide

So what?

Note about terms “wide” and “narrow”

It’s actually possible for a “wider” form of the data and a “narrow” form of the data to have the same number of columns! For example, when the “key” only has two outcomes.

# this is "wide"--with 5 variables
Beckmans %>%
  select(who, age, sex, indoors, outdoors)
# this is "narrow"--with 5 variables
Beckmans %>% 
  select(who, age, sex, indoors, outdoors) %>%
  pivot_longer(cols = c(indoors, outdoors), names_to = "type", values_to = "preference")
NA

Excerpt from BabyNames

Questions:

RQ 1. How many babies of each name and sex? RQ 2. For each name, is it primarily given to girls or boys? Which names are gender neutral?

In narrow format

data("BabyNames", package = "dcData")

BabyNames <- 
  BabyNames %>%
  filter( name %in% c("Eden", "Jack", "Hazel")) 

RQ 1. How many babies of each name and sex?

BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count, na.rm = TRUE))
`summarise()` has grouped output by 'name'. You can override using the `.groups` argument.

Easy!

In Wide format

RQ 2. Which names are most gender neutral?

WideOutput <- 
  NarrowInput %>% 
  pivot_wider(names_from = var1, values_from = var2, values_fill = 0)
BabyTotalsWide <- 
  BabyTotals %>% 
  pivot_wider(names_from = sex, values_from = total, values_fill = 0)

BabyTotalsWide

With sexes side by side…

We can easily calculate balance associated with names


BabyTotalsWide <- 
  BabyTotalsWide %>% 
  rename(fem = F, male = M) %>%         # `F` is a terrible variable name (why?)
  mutate(prop_fem  = fem  / (male + fem), 
         prop_male = male / (male + fem),
         name_specificity = pmax(prop_fem, prop_male))    # what does `pmax()` do?

BabyTotalsWide

pivot_longer( )—when you have “Wide” and want “Narrow”

Syntax:

NarrowOutput <- 
  WideInput %>% 
  pivot_longer(cols = c(wide_var1, wide_var2, ...), names_to = "long_var1", values_to "long_var2")
BabyTotalsNarrow <- 
  BabyTotalsWide %>% 
  select(prop_fem, prop_male) %>%
  pivot_longer(cols = c(prop_fem, prop_male), names_to = "sex", values_to = "proportion") 
Adding missing grouping variables: `name`
BabyTotalsNarrow

With sexes stacked again…

We can make an intuitive bar chart (though some clean up is needed…)

BabyTotalsNarrow %>%
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) 

NA

With some improvements

  • clean up labels of sexes
  • add title, source, & better axis labels (default y-axis label was flat wrong)
# first, clean up the labels in `sex` for plotting
BabyTotalsNarrow %>%
  mutate(sex = if_else(sex == "prop_fem", 
                       true = "female", 
                       false = if_else(sex == "prop_male", 
                                       true = "male", 
                                       false = "unk")  # end of "inner" if_else()
                       )                               # ends the "outer" if_else()
         ) %>%                                         # ends the mutate() 
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) + 
  ggtitle("Gender Balance among Names of Beckman Kids", 
          subtitle = "source: U.S. Social Security Administration") + 
  xlab("Name") + 
  ylab("Proportion")

NA
