Cases, Variables, and Values

A data table is comprised of cases and variables.

Each variable comprises values (or levels).

There is no hard distinction between a variable and a value. What’s a variable in one situation may be a value in another, and vice versa.

A data table

Beckmans

Cases, Variables, and Values

Two formats

Narrow

Q: What IS different? Q: What is NOT different?

Beckmans %>% 
  pivot_longer(cols = c(indoors, outdoors, tv, breakfast), 
               names_to = "type", values_to = "preference")

(Too) Narrow

None of the data has actually been lost, but it’s not a helpful form since there isn’t a useful definition of “case”.

Wide

So what?

Note about terms “wide” and “narrow”

It’s actually possible for a “wider” form of the data and a “narrow” form of the data to have the same number of columns! For example, when the “key” only has two outcomes.

# this is "wide"--with 5 variables
Beckmans %>%
  select(who, age, sex, indoors, outdoors)
# this is "narrow"--with 5 variables
Beckmans %>% 
  select(who, age, sex, indoors, outdoors) %>%
  pivot_longer(cols = c(indoors, outdoors), names_to = "type", values_to = "preference")
NA

Excerpt from BabyNames

Questions:

RQ 1. How many babies of each name and sex? RQ 2. For each name, is it primarily given to girls or boys? Which names are gender neutral?

In narrow format

data("BabyNames", package = "dcData")

BabyNames <- 
  BabyNames %>%
  filter( name %in% c("Eden", "Jack", "Hazel")) 

RQ 1. How many babies of each name and sex?

BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count, na.rm = TRUE))
`summarise()` has grouped output by 'name'. You can override using the `.groups` argument.

Easy!

In Wide format

RQ 2. Which names are most gender neutral?

WideOutput <- 
  NarrowInput %>% 
  pivot_wider(names_from = var1, values_from = var2, values_fill = 0)
BabyTotalsWide <- 
  BabyTotals %>% 
  pivot_wider(names_from = sex, values_from = total, values_fill = 0)

BabyTotalsWide

With sexes side by side…

We can easily calculate balance associated with names


BabyTotalsWide <- 
  BabyTotalsWide %>% 
  rename(fem = F, male = M) %>%         # `F` is a terrible variable name (why?)
  mutate(prop_fem  = fem  / (male + fem), 
         prop_male = male / (male + fem),
         name_specificity = pmax(prop_fem, prop_male))    # what does `pmax()` do?

BabyTotalsWide

pivot_longer( )—when you have “Wide” and want “Narrow”

Syntax:

NarrowOutput <- 
  WideInput %>% 
  pivot_longer(cols = c(wide_var1, wide_var2, ...), names_to = "long_var1", values_to "long_var2")
BabyTotalsNarrow <- 
  BabyTotalsWide %>% 
  select(prop_fem, prop_male) %>%
  pivot_longer(cols = c(prop_fem, prop_male), names_to = "sex", values_to = "proportion") 
Adding missing grouping variables: `name`
BabyTotalsNarrow

With sexes stacked again…

We can make an intuitive bar chart (though some clean up is needed…)

BabyTotalsNarrow %>%
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) 

NA

With some improvements

  • clean up labels of sexes
  • add title, source, & better axis labels (default y-axis label was flat wrong)
# first, clean up the labels in `sex` for plotting
BabyTotalsNarrow %>%
  mutate(sex = if_else(sex == "prop_fem", 
                       true = "female", 
                       false = if_else(sex == "prop_male", 
                                       true = "male", 
                                       false = "unk")  # end of "inner" if_else()
                       )                               # ends the "outer" if_else()
         ) %>%                                         # ends the mutate() 
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) + 
  ggtitle("Gender Balance among Names of Beckman Kids", 
          subtitle = "source: U.S. Social Security Administration") + 
  xlab("Name") + 
  ylab("Proportion")

NA
---
title: "Redefining rows: Wide vs narrow data organization"
subtitle: "Data Computing Chapter 12"
author: "Prof Matthew Beckman"
output: 
  slidy_presentation: default
  html_notebook: default
---

```{r include=FALSE}
library(tidyverse)
library(printr)
library(mosaic)
library(dcData)
options(width = 80)
```

## Cases, Variables, and Values

A data table is comprised of *cases* and *variables*.

Each *variable* comprises *values* (or levels).

There is no hard distinction between a variable and a value.  What's a variable in one situation may be a value in another, and vice versa.

```{r echo=FALSE}

Beckmans <- 
  tribble(~who, ~age, ~sex, ~indoors, ~outdoors, ~tv, ~breakfast, 
          "Eden", 11, "F", "Gaming", "Swimming", "AFV", "sandwich", 
          "Jack", 8, "M", "Legos", "Camping", "Hawkeye", "cereal",
          "Hazel", 4, "F", "Puzzles", "Playground", "Bluey", "bagel or toast")

```

A data table

```{r}
Beckmans
```


## Cases, Variables, and Values

- Variables: `who`, `age`, `sex`, `indoors`, `outdoors`, `tv`, `breakfast`
    - Values: 
        - `who` is a name 
        - `age` is numeric age 
        - `sex` is sex of the person
        - `indoors` is a hobby while indoors
        - `outdoors` is a hobby while outdoors
        - `tv` is a favorite television show (...other than YouTube)
        - `breakfast` is typical meal preference before school
- Cases: Beckmans {Eden, Jack, Hazel}

```{r echo=FALSE}
Beckmans
```


## Two formats

- Data in Key/Value format are **narrow**
    - possible to get *too* narrow if the meaning of case becomes awkward
- The corresponding **wide** format has 
    - separate variables for each level in `key`
    - sets the values for those variables from the info in `value`

### Narrow

Q: What IS different?
Q: What is NOT different?

```{r}
Beckmans %>% 
  pivot_longer(cols = c(indoors, outdoors, tv, breakfast), 
               names_to = "type", values_to = "preference")
```



### (Too) Narrow

None of the data has actually been lost, but it's not a helpful form since there isn't a useful definition of "case".  

```{r echo=FALSE, warning=FALSE}
Beckmans %>% 
  mutate(age = as.character(age)) %>%  
  pivot_longer(cols = c(who, age, sex, indoors, outdoors, tv, breakfast), 
               names_to = "key", values_to = "value")
```

### Wide

```{r echo=FALSE, warning=FALSE}
Beckmans
```




## So what?

- This allows us to easily redefine how rows are presented in the data
    - Possibly motivated by the research question
    - Possibly motivated by desire to join two data tables with different case definitions
    - Possibly motivated by a data visualization
- Also, some operations are easy in wide format, but hard in narrow and *vice versa*
- We need tools that make it easy to switch back and forth


## Note about terms "wide" and "narrow"

It's actually possible for a "wider" form of the data and a "narrow" form of the data to have the same number of columns!  For example, when the "key" only has two outcomes. 

```{r}
# this is "wide"--with 5 variables
Beckmans %>%
  select(who, age, sex, indoors, outdoors)
```

```{r}
# this is "narrow"--with 5 variables
Beckmans %>% 
  select(who, age, sex, indoors, outdoors) %>%
  pivot_longer(cols = c(indoors, outdoors), names_to = "type", values_to = "preference")

```




### Excerpt from `BabyNames`

```{r echo=FALSE}
data("BabyNames", package = "dcData")

ShortBabyNames <- 
  BabyNames %>%
  filter( name %in% c("Eden", "Jack", "Hazel"), 
          year %in% 2012:2013 ) %>% 
  arrange(name, year)
```

```{r echo=FALSE}
ShortBabyNames
```

### Questions:

RQ 1. How many babies of each name and sex?
RQ 2. For each name, is it primarily given to girls or boys?  Which names are gender neutral?



## In narrow format

```{r}
data("BabyNames", package = "dcData")

BabyNames <- 
  BabyNames %>%
  filter( name %in% c("Eden", "Jack", "Hazel")) 
```

RQ 1. How many babies of each name and sex?

```{r}
BabyTotals <-
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count, na.rm = TRUE))
```

```{r echo=FALSE}
BabyTotals
```

Easy!



## In Wide format

RQ 2. Which names are most gender neutral?


```
WideOutput <- 
  NarrowInput %>% 
  pivot_wider(names_from = var1, values_from = var2, values_fill = 0)
```

- we want a new column for each category of `sex`, so `names_from = sex` 
    - we will "unstack" each available category as a new variable (a.k.a. cast, spread, unfold)
    - categories of `sex` were "F" and "M" in this example
- the values/entries for our new variables are coming from `total`, so `values_from = total`
    - `values_fill = 0` specifies a default value
    - e.g., in 2012 no one in `BabyNames` was assigned {Hazel, M}, so it should be zero


```{r}
BabyTotalsWide <- 
  BabyTotals %>% 
  pivot_wider(names_from = sex, values_from = total, values_fill = 0)

BabyTotalsWide
```

## With sexes side by side... 

We can easily calculate balance associated with names


```{r}

BabyTotalsWide <- 
  BabyTotalsWide %>% 
  rename(fem = F, male = M) %>%         # `F` is a terrible variable name (why?)
  mutate(prop_fem  = fem  / (male + fem), 
         prop_male = male / (male + fem),
         name_specificity = pmax(prop_fem, prop_male))    # what does `pmax()` do?

BabyTotalsWide
```

## `pivot_longer( )`---when you have "Wide" and want "Narrow"

Syntax:

```
NarrowOutput <- 
  WideInput %>% 
  pivot_longer(cols = c(wide_var1, wide_var2, ...), names_to = "long_var1", values_to "long_var2")
```

- The `cols` are the variables we want to combine (a.k.a. melt, stack, fold, gather)
- e.g. `prop_fem` and `prop_male` in this case

```{r echo=TRUE}
BabyTotalsNarrow <- 
  BabyTotalsWide %>% 
  select(prop_fem, prop_male) %>%
  pivot_longer(cols = c(prop_fem, prop_male), names_to = "sex", values_to = "proportion") 

BabyTotalsNarrow
```

## With sexes stacked again...

We can make an intuitive bar chart (though some clean up is needed...)

```{r}
BabyTotalsNarrow %>%
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) 
  
```

#### With some improvements 

- clean up labels of sexes
- add title, source, & better axis labels (default y-axis label was flat wrong)

```{r}
# first, clean up the labels in `sex` for plotting
BabyTotalsNarrow %>%
  mutate(sex = if_else(sex == "prop_fem", 
                       true = "female", 
                       false = if_else(sex == "prop_male", 
                                       true = "male", 
                                       false = "unk")  # end of "inner" if_else()
                       )                               # ends the "outer" if_else()
         ) %>%                                         # ends the mutate() 
  ggplot() + 
  geom_bar(aes(x = name, fill = sex, weight = proportion)) + 
  ggtitle("Gender Balance among Names of Beckman Kids", 
          subtitle = "source: U.S. Social Security Administration") + 
  xlab("Name") + 
  ylab("Proportion")
  
```
