Cases, Variables, and Values
A data table is comprised of cases and
variables.
Each variable comprises values (or levels).
There is no hard distinction between a variable and a value. What’s a
variable in one situation may be a value in another, and vice versa.
A data table
Beckmans
Cases, Variables, and Values
- Variables:
who
, age
, sex
,
indoors
, outdoors
, tv
,
breakfast
- Values:
who
is a name
age
is numeric age
sex
is sex of the person
indoors
is a hobby while indoors
outdoors
is a hobby while outdoors
tv
is a favorite television show (…other than
YouTube)
breakfast
is typical meal preference before school
- Cases: Beckmans {Eden, Jack, Hazel}
So what?
- This allows us to easily redefine how rows are presented in the data
- Possibly motivated by the research question
- Possibly motivated by desire to join two data tables with different
case definitions
- Possibly motivated by a data visualization
- Also, some operations are easy in wide format, but hard in narrow
and vice versa
- We need tools that make it easy to switch back and forth
Note about terms “wide” and “narrow”
It’s actually possible for a “wider” form of the data and a “narrow”
form of the data to have the same number of columns! For example, when
the “key” only has two outcomes.
# this is "wide"--with 5 variables
Beckmans %>%
select(who, age, sex, indoors, outdoors)
# this is "narrow"--with 5 variables
Beckmans %>%
select(who, age, sex, indoors, outdoors) %>%
pivot_longer(cols = c(indoors, outdoors), names_to = "type", values_to = "preference")
NA
Questions:
RQ 1. How many babies of each name and sex? RQ 2. For each name, is
it primarily given to girls or boys? Which names are gender neutral?
With sexes side by side…
We can easily calculate balance associated with names
BabyTotalsWide <-
BabyTotalsWide %>%
rename(fem = F, male = M) %>% # `F` is a terrible variable name (why?)
mutate(prop_fem = fem / (male + fem),
prop_male = male / (male + fem),
name_specificity = pmax(prop_fem, prop_male)) # what does `pmax()` do?
BabyTotalsWide
pivot_longer( )
—when you have “Wide” and want
“Narrow”
Syntax:
NarrowOutput <-
WideInput %>%
pivot_longer(cols = c(wide_var1, wide_var2, ...), names_to = "long_var1", values_to "long_var2")
- The
cols
are the variables we want to combine (a.k.a.
melt, stack, fold, gather)
- e.g.
prop_fem
and prop_male
in this
case
BabyTotalsNarrow <-
BabyTotalsWide %>%
select(prop_fem, prop_male) %>%
pivot_longer(cols = c(prop_fem, prop_male), names_to = "sex", values_to = "proportion")
Adding missing grouping variables: `name`
BabyTotalsNarrow
With sexes stacked again…
We can make an intuitive bar chart (though some clean up is
needed…)
BabyTotalsNarrow %>%
ggplot() +
geom_bar(aes(x = name, fill = sex, weight = proportion))

NA
With some improvements
- clean up labels of sexes
- add title, source, & better axis labels (default y-axis label
was flat wrong)
# first, clean up the labels in `sex` for plotting
BabyTotalsNarrow %>%
mutate(sex = if_else(sex == "prop_fem",
true = "female",
false = if_else(sex == "prop_male",
true = "male",
false = "unk") # end of "inner" if_else()
) # ends the "outer" if_else()
) %>% # ends the mutate()
ggplot() +
geom_bar(aes(x = name, fill = sex, weight = proportion)) +
ggtitle("Gender Balance among Names of Beckman Kids",
subtitle = "source: U.S. Social Security Administration") +
xlab("Name") +
ylab("Proportion")

NA
---
title: "Redefining rows: Wide vs narrow data organization"
subtitle: "Data Computing Chapter 12"
author: "Prof Matthew Beckman"
output: 
  slidy_presentation: default
  html_notebook: default
---

```{r include=FALSE}
library(tidyverse)
library(printr)
library(mosaic)
library(dcData)
options(width = 80)
```

## Cases, Variables, and Values

A data table is comprised of *cases* and *variables*.

Each *variable* comprises *values* (or levels).

There is no hard distinction between a variable and a value.  What's a variable in one situation may be a value in another, and vice versa.

```{r echo=FALSE}

Beckmans <- 
  tribble(~who, ~age, ~sex, ~indoors, ~outdoors, ~tv, ~breakfast, 
          "Eden", 11, "F", "Gaming", "Swimming", "AFV", "sandwich", 
          "Jack", 8, "M", "Legos", "Camping", "Hawkeye", "cereal",
          "Hazel", 4, "F", "Puzzles", "Playground", "Bluey", "bagel or toast")

```

A data table

```{r}
Beckmans
```


## Cases, Variables, and Values

- Variables: `who`, `age`, `sex`, `indoors`, `outdoors`, `tv`, `breakfast`
    - Values: 
        - `who` is a name 
        - `age` is numeric age 
        - `sex` is sex of the person
        - `indoors` is a hobby while indoors
        - `outdoors` is a hobby while outdoors
        - `tv` is a favorite television show (...other than YouTube)
        - `breakfast` is typical meal preference before school
- Cases: Beckmans {Eden, Jack, Hazel}

```{r echo=FALSE}
Beckmans
```


## Two formats

- Data in Key/Value format are **narrow**
    - possible to get *too* narrow if the meaning of case becomes awkward
- The corresponding **wide** format has 
    - separate variables for each level in `key`
    - sets the values for those variables from the info in `value`

### Narrow

Q: What IS different?
Q: What is NOT different?

```{r}
Beckmans %>% 
  pivot_longer(cols = c(indoors, outdoors, tv, breakfast), 
               names_to = "type", values_to = "preference")
```



### (Too) Narrow

None of the data has actually been lost, but it's not a helpful form since there isn't a useful definition of "case".  

```{r echo=FALSE, warning=FALSE}
Beckmans %>% 
  mutate(age = as.character(age)) %>%  
  pivot_longer(cols = c(who, age, sex, indoors, outdoors, tv, breakfast), 
               names_to = "key", values_to = "value")
```

### Wide

```{r echo=FALSE, warning=FALSE}
Beckmans
```




## So what?

- This allows us to easily redefine how rows are presented in the data
    - Possibly motivated by the research question
    - Possibly motivated by desire to join two data tables with different case definitions
    - Possibly motivated by a data visualization
- Also, some operations are easy in wide format, but hard in narrow and *vice versa*
- We need tools that make it easy to switch back and forth


## Note about terms "wide" and "narrow"

It's actually possible for a "wider" form of the data and a "narrow" form of the data to have the same number of columns!  For example, when the "key" only has two outcomes. 

```{r}
# this is "wide"--with 5 variables
Beckmans %>%
  select(who, age, sex, indoors, outdoors)
```

```{r}
# this is "narrow"--with 5 variables
Beckmans %>% 
  select(who, age, sex, indoors, outdoors) %>%
  pivot_longer(cols = c(indoors, outdoors), names_to = "type", values_to = "preference")

```




### Excerpt from `BabyNames`

```{r echo=FALSE}
data("BabyNames", package = "dcData")

ShortBabyNames <- 
  BabyNames %>%
  filter( name %in% c("Eden", "Jack", "Hazel"), 
          year %in% 2012:2013 ) %>% 
  arrange(name, year)
```

```{r echo=FALSE}
ShortBabyNames
```

### Questions:

RQ 1. How many babies of each name and sex?
RQ 2. For each name, is it primarily given to girls or boys?  Which names are gender neutral?



## In narrow format

```{r}
data("BabyNames", package = "dcData")

BabyNames <- 
  BabyNames %>%
  filter( name %in% c("Eden", "Jack", "Hazel")) 
```

RQ 1. How many babies of each name and sex?

```{r}
BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count, na.rm = TRUE))
```

```{r echo=FALSE}
BabyTotals
```

Easy!



## In Wide format

RQ 2. Which names are most gender neutral?


```
WideOutput <- 
  NarrowInput %>% 
  pivot_wider(names_from = var1, values_from = var2, values_fill = 0)
```

- we want a new column for each category of `sex`, so `names_from = sex` 
    - we will "unstack" each available category as a new variable (a.k.a. cast, spread, unfold)
    - categories of `sex` were "F" and "M" in this example
- the values/entries for our new variables are coming from `total`, so `values_from = total`
    - `values_fill = 0` specifies a default value
    - e.g., in 2012 no one in `BabyNames` was assigned {Hazel, M}, so it should be zero


```{r}
BabyTotalsWide <- 
  BabyTotals %>% 
  pivot_wider(names_from = sex, values_from = total, values_fill = 0)

BabyTotalsWide
```

## With sexes side by side... 

We can easily calculate balance associated with names


```{r}

BabyTotalsWide <- 
  BabyTotalsWide %>% 
  rename(fem = F, male = M) %>%         # `F` is a terrible variable name (why?)
  mutate(prop_fem  = fem  / (male + fem), 
         prop_male = male / (male + fem),
         name_specificity = pmax(prop_fem, prop_male))    # what does `pmax()` do?

BabyTotalsWide
```

## `pivot_longer( )`---when you have "Wide" and want "Narrow"

Syntax:

```
NarrowOutput <- 
  WideInput %>% 
  pivot_longer(cols = c(wide_var1, wide_var2, ...), names_to = "long_var1", values_to "long_var2")
```

- The `cols` are the variables we want to combine (a.k.a. melt, stack, fold, gather)
- e.g. `prop_fem` and `prop_male` in this case

```{r echo=TRUE}
BabyTotalsNarrow <- 
  BabyTotalsWide %>% 
  select(prop_fem, prop_male) %>%
  pivot_longer(cols = c(prop_fem, prop_male), names_to = "sex", values_to = "proportion") 

BabyTotalsNarrow
```

## With sexes stacked again...

We can make an intuitive bar chart (though some clean up is needed...)

```{r}
BabyTotalsNarrow %>%
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) 
  
```

#### With some improvements 

- clean up labels of sexes
- add title, source, & better axis labels (default y-axis label was flat wrong)

```{r}
# first, clean up the labels in `sex` for plotting
BabyTotalsNarrow %>%
  mutate(sex = if_else(sex == "prop_fem", 
                       true = "female", 
                       false = if_else(sex == "prop_male", 
                                       true = "male", 
                                       false = "unk")  # end of "inner" if_else()
                       )                               # ends the "outer" if_else()
         ) %>%                                         # ends the mutate() 
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) + 
  ggtitle("Gender Balance among Names of Beckman Kids", 
          subtitle = "source: U.S. Social Security Administration") + 
  xlab("Name") + 
  ylab("Proportion")
  
```
