Agenda

Homework

  • (late) Generic R Notebook
  • (late) GitHub Startup
  • (friday) Programming Notebooks: MDSR Chapter 04
  • (sunday) Programming Notebooks: MDSR Chapter 03
  • (next week) MDSR Chap 3 Exercises
  • (next week) Programming Notebooks: MDSR Chapter 05

STAT 380 workflow

Git / GitHub

Dr. Beckman’s Directory to Organize GitHub Repositories

Dr. Beckman’s Directory to Organize GitHub Repositories

Git / GitHub

Git / GitHub

Markdown / R Markdown

Why RMarkdown?

RMarkdown lets us separate content from formatting

  • Alternative to WYSIWYG’s like MS Word & Google Docs
  • Easy to configure headings, lists, TOC, figures, captions, links, images, etc
  • Change formatting globally with a css (cascading style sheet)
  • Good-looking tables are easy
  • Typeset nice-looking mathematics using LaTeX (and some preview)

STAT 501 excerpt (inline math)…

Here’s the model:

\(log(\frac{\hat{p}_i}{1-\hat{p}_i}) = 3.309 - 0.288(dist_i)\)

where \(\hat{p}_i\) is the proportion that voted “Yes” for community i.

Since the relationship on a logarithmic scale is hard to interpret, we back transform before interpreting coefficients.

Note: If \(e^{\beta_1} = 1\), then it means the odds are 1:1 which translates to a 50-50 chance and means there would be no relationship between the explanatory variable and the odds of “success” (however that’s defined in the context of the study).

STAT 501 excerpt (aligned to = sign)…

\[\begin{align*} \mathit{SSE} &= \Sigma_{i} \left( y_{i} - b_{0} - b_{1}x_{i} \right)^2 \\ \\ \frac{\partial \left( SSE \right)}{\partial b_{0}} &= \Sigma_{i} \left( y_{i} - b_{0} - b_{1}x_{i} \right)^2 \\ &= \left( -2 \right) \Sigma_{i} \left( y_{i} - b_{0} - b_{1}x_{i} \right) \\ &= \left( -2 \right) \left( \Sigma y_{i} - \Sigma b_{0} - \Sigma b_{1}x_{i} \right) \\ \\ \frac{\partial \left( SSE \right)}{\partial b_{1}} &= \Sigma_{i} \left( y_{i} - b_{0} - b_{1}x_{i} \right)^2 \\ &= \Sigma_{i} \left( -2x_{i} \right) \left( y_{i} - b_{0} - b_{1}x_{i} \right) \\ &= \left( -2\right) \Sigma_{i} \left( x_{i} y_{i} - x_{i} b_{0} - b_{1}x_{i}^2 \right) \\ &= \left( -2\right) \left( \Sigma x_{i} y_{i} - \Sigma x_{i} b_{0} - \Sigma b_{1}x_{i}^2 \right) \\ \end{align*}\]

(my) STAT 184 Philosophy

Individual lego bricks are simple.1 A complex model made of lego bricks 2
Bricks Trafalgar Legoland

Tidy Data

A page from Francis Galton’s notebook.

A page from Francis Galton’s notebook.

# package
require(mosaicData)  

# intake data from `mosaicData` package
data("Galton")

head(Galton, 10)

tidyverse command chains

# package
require(DataComputing)
# intake data
data("BabyNames")
# a command chain
Hazels <- 
  BabyNames %>%
  filter(grepl("Hazel", name)) %>%
  group_by(year) %>%
  summarise(total = sum(count))

Parts of speech

Discussion question

Hazels <- 
  BabyNames %>%
  filter(grepl("Hazel", name)) %>%
  group_by(year) %>%
  summarise(total = sum(count))

Just from the syntax, you should be able to tell which of the five different kinds of object each of these things is:

  • Hazels
  • BabyNames
  • filter
  • grepl
  • "Hazel"
  • name
  • group_by
  • year
  • summarise
  • total
  • sum
  • count

Small group discussion:

Kinds of join

Different joins have different answers to these questions.

Popular join types:


IF no right cases match the left case…

  • left_join(): Keep the left case and fill in the new variables (from the right table) with NA
  • inner_join(): Discard the left case.


IF multiple right cases match the left case…

left_join() and inner_join() do the same thing:

  • left_join(): Keep all combinations.
  • inner_join(): Keep all combinations.


Other useful joins:

  • full_join() Keep left case as well as unmatched right cases.
  • semi_join() Discard left case corresponding to unmatched right case.
  • anti_join() Keep the left case but discard any left case with a match in the right table

Reshaping data

require(dplyr)
# From http://stackoverflow.com/questions/1181060
Stocks <- tibble(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)
# inspect data
Stocks
# gather/stack/melt--wide to narrow
StocksNarrow <- 
  Stocks %>% 
  gather(key = stock, value = price, X, Y, Z)  
StocksNarrow 
# spread/unstack/cast--narrow to wide
StocksWide <- 
  StocksNarrow %>% 
  spread(key = stock, value = price)
StocksWide

Three Important Concepts

  1. Data can be usefully organized into tables with “cases” and “variables.”
    • In “tidy data” every case is the same sort of thing (e.g. a person, a car, a year, a country in a year)
    • We sometimes even modify data in order to change what the cases represent in order to better represent a point.
  2. Data graphics and “glyph-ready” data
    • each case corresponds to a “glyph” (mark) on the graph
    • each variable to a graphical attribute of that glyph such as x- or y-position, color, size, length, shape, etc.
    • same is true for more technical tools (e.g., models)
  3. When data are not yet in glyph-ready form, you can transform (i.e. wrangle) them into glyph-ready form.
    • Such transformations are accomplished by performing one or more of a small set of basic operations on data tables
    • This is the work of data “verbs”

Grammar of graphics

Anatomy of a data visualization

  • Frame
  • Glyph
  • Aesthetic
  • Scale
  • Guide (legend)
  • Facet

SAT Scores & Student Spending by State in the US

require(mdsr)
data("SAT_2010")
# 2010 SAT Scores grouped by participation rate
SAT_2010 <- 
  SAT_2010 %>%
  mutate(SAT_rate = cut(sat_pct, breaks = c(0, 30, 60, 100), 
                        labels = c("low", "medium", "high")))
# initialize scatter plot
g <- 
  SAT_2010 %>%
  ggplot(aes(x = expenditure, y = math)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = 0) + 
  xlab("Average expenditure per student ($1000)") + 
  ylab("Average score on math SAT")
# base plot showing SAT Math vs Student expenditure
g

# color scatter plot by SAT_rate
g + 
  aes(color = SAT_rate)

Some Graphics Components

glyph

The basic graphical unit that represents one case. Other terms used include mark and symbol.

aesthetic

  • a visual property of a glyph such as position, size, shape, color, etc.
  • may be mapped based on data values: sex -> color
  • may be set to particular non-data related values: color is black

scale

  • A mapping that translates data values into aesthetics.
  • example: male -> blue; female -> pink

frame

  • The position scale describing how data are mapped to x and y

guide

  • An indication for the human viewer of the scale. This allows the viewer to translate aesthetics back into data values.
  • Examples: x- and y-axes, various sorts of legends

Facets

# facet by SAT participation rate
g + 
  facet_wrap( ~ SAT_rate)

Layers

Medicare costs in Pennsylvania among other states

require(mdsr)
data("MedicareCharges", package = "mdsr")   # from mdsr, not DataComputing
# Pennsylvania medicare charges
ChargesPA <- 
  MedicareCharges %>%
  filter(stateProvider == "PA")
# Plot Pennsylvania data
p <- 
  ChargesPA %>%
  ggplot(aes(x = reorder(drg, mean_charge), y = mean_charge)) + 
  geom_bar(fill = "gray", stat = "identity") +   # stat = "identity" ==> value dictates bar height
  ylab("Statewide Average Charges ($)") + 
  xlab("Medical Procedure (DRG)") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

# add layer to show other states for reference
p + 
  geom_point(data = MedicareCharges, size = 1, alpha = 0.3)

Fivethirtyeight reproduction (MDSR p. 50)

MDSR Reproduction

require(mdsr)
require(Hmisc)
BabynamesDist <- make_babynames_dist()  # from mdsr
Joseph <- BabynamesDist %>%
  filter(name == "Joseph" & sex == "M")
# constructing the base plot
name_plot <- 
  Joseph %>%
  ggplot(aes(x = year)) + 
  geom_bar(stat = "identity", aes(y = count_thousands * alive_prob), 
           fill = "#b2d7e9", color = "white") + 
  geom_line(aes(y = count_thousands), size = 2) + 
  ylab("Number of People (thousands)") + 
  xlab(NULL)
# base plot
name_plot

median_yob <- wtd.quantile(x = Joseph$year, weights = Joseph$est_alive_today, probs = 0.5)
median_yob
 50% 
1975 
name_plot <- 
  name_plot + 
  geom_bar(stat = "identity", color = "white", fill = "#008fd5", 
           aes(y = ifelse(year == median_yob, est_alive_today / 1000, 0))) 
# Figure 3.22: Josephs
name_plot + 
  ggtitle(label = "Age Distribution of American Boys Named Joseph", subtitle = "By year of birth") + 
  geom_text(x = 1935, y = 40, size = 3.5, family = "mono", label = "Number of Josephs \n born each year") + 
  geom_text(x = 1915, y = 13, size = 3.5, family = "mono", color = "#008fd5", 
            label = "Number of Josephs \n born each year \n estimated to be alive \n on Jan. 1, 2014") + 
  geom_text(x = 2003, y = 40, size = 3.5, family = "sans", color = "darkgray",
            label = "The median\nliving Joseph\nis 37 years old.") + 
  geom_curve(x = 1995, xend = 1974, y = 40, yend = 24, 
             arrow = arrow(length = unit(0.3, "cm")), curvature = 0.5) + 
  ylim(0, 42) + 
  theme_fivethirtyeight()

NA

Three Important Concepts

  1. Data can be usefully organized into tables with “cases” and “variables.”
    • In “tidy data” every case is the same sort of thing (e.g. a person, a car, a year, a country in a year)
    • We sometimes even modify data in order to change what the cases represent in order to better represent a point.
  2. Data graphics and “glyph-ready” data
    • each case corresponds to a “glyph” (mark) on the graph
    • each variable to a graphical attribute of that glyph such as x- or y-position, color, size, length, shape, etc.
    • same is true for more technical tools (e.g., models)
  3. When data are not yet in glyph-ready form, you can transform (i.e. wrangle) them into glyph-ready form.
    • Such transformations are accomplished by performing one or more of a small set of basic operations on data tables
    • This is the work of data “verbs”

  1. Source : “Lego Color Bricks” by Alan Chia - Lego Color Bricks. Licensed under CC BY-SA 2.0 via Wikimedia Commons

  2. Source: Trafalgar Legoland 2003 by Kaihsu Tai - Kaihsu Tai. Licensed under CC BY-SA 3.0 via Wikimedia Commons

