Small Group Discussion:
- What was the muddiest point from this chapter?
- Why do you think it’s valuable to hash things out by hand first before writing any computer commands?
Three Important Concepts
- As we’ve been discussing, Data can be usefully organized into tables with “cases” and “variables.”
- In “tidy data,” every case is the same sort of thing, e.g. a person, a car, a year, a country in a year.
- We sometimes even modify data in order to change what the cases represent in order to better represent a point.
Data graphics can be constructed easily when each case corresponds to a “glyph” (mark) on the graph, and each variable to a graphical attribute of that glyph such as x- or y-position, color, size, length, shape, etc. Such data is called “glyph-ready.” (The same is true for more technical presentations of data, e.g., models, predictions, etc. — once the data are set up with appropriate cases and variables, the presentation is straightforward.)
- When data are not yet in glyph-ready form, you can transform them into glyph-ready form.
- Such transformations are accomplished by performing one or more of a small set of basic operations on data tables
- This is the work of data “verbs”
Today’s Agenda
- Introduce some software and commands that …
- make it easy to access data tables and see how they are structured
- For example:
data()
, View()
, help()
,
- (more coming in Chapter 9)
- implement two of the data verbs.
group_by()
, summarise()
- Work in groups on HELPrct activity
Learning about the raw data
Let’s use the following commands to learn about our data:
data()
: if your data are part of a package, this loads the data set into your R environment
View()
: run in the console of RStudio to open a spreadsheet of the raw data set
help()
: if your data are part of a package, this opens a help window with details about the data
Here are three (unrelated) data sets:
HELPrct
(from the mosaicData
library)
Minneapolis2013
(from the DataComputing
library)
CountryData
(from the DataComputing
library)
- What is the setting for the data? That is, what are they about?
- How many cases are there?
- How many variables are there? What are their names?
- Pick out three of the variables and say whether
- the variable is quantitative or categorical
- if categorical, what are some levels of the variable
- if quantitative, what are the units of measurement of the variable.
- Describe, in everyday terms, what kind of thing cases represent in each of the data tables.
Why we wrangle
Consider the Minneapolis 2013 election data. Here’s a bar chart that might be used to show the election results:

This graph reflects the following data table (only part of which is shown):
## # A tibble: 6 x 2
## First votes
## <chr> <int>
## 1 BETSY HODGES 28935
## 2 MARK ANDREW 19584
## 3 DON SAMUELS 8335
## 4 CAM WINTON 7511
## 5 JACKIE CHERRYHOMES 3524
## 6 BOB FINE 2094
Compare the Minneapolis2013
data table and the data table printed above.
- Do they have the same number of cases?
- Do the cases in the two tables represent the same sort of thing?
- Do the two tables have any variable(s) in common?
- Speculate on how the two tables are related to one another.
Activity: Data verbs for summarizing and grouping (HELPrct)
Instructions:
- complete the three tasks below as a group
- submit an HTML file with embedded .Rmd (i.e. use class template) to “Activity: Data Verbs (HELPrct)” on Canvas
- Submit one for each group by Friday at 11:59pm
Set Up:
# The HELPrct data are available in the mosaicData package
library(mosaicData)
# Load the HELPrct data set into our RStudio environment
data("HELPrct")
# Also, use View(HELPrct) in the console to open a tab in RStudio and see the data set
Task 1:
summarise()
: Find an expression involving summarize()
and HELPrct
that will produce the following.
- number of people (cases) in
HELPrct
study
- total number of times in the past 6 months entered a detox program (measured at baseline) for all the people in
HELPrct
(silly)
- mean time (in days) to first use of any substance post-detox for all the people in
HELPrct
Task 2:
group_by()
: repeat task 1 above, but calculate the results group-by-group and write a sentence or two about what you observe in the results for each of the following:
- males versus females
- homeless or not
- substance
- break down the homeless versus housed further, by sex
- break down the homeless versus housed further, by substance
Task 3:
Include one or more interesting plots of the data involving at least 3 variables per plot. Write a few sentances to explain the story that your plot tells about these data. You can use one of the relationships that you studied in Task 2, or you can explore a different group of variables in the HELPrct that show something interesting.
Homework
- Submit Activity: Data Verbs (HELPrct) on Canvas by midnight Friday
- DC Ch 7 Exercises (upload HTML to Canvas): 7.3, 7.4 (a. & c.), 7.7 (skip d.), 7.10 (skip 3rd bullet), 7.11
- Note: Problem 7.3 should say: “… transformation function or a reduction function or neither.”
- DC chapter 9 reading quiz on Canvas
teaching | stat 184 home | syllabus | piazza | canvas