Chapter 16: Key Ideas

Some Exploits in the Land of Regex

Rejecting Footnotes with Regex

page <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'

table_list <- page %>%
  read_html() %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)

IAAFmen <- table_list[[4]] 
head(IAAFmen, 3)
Time Auto Athlete Nationality Date Venue
4:14.4 John Paul Jones United States 31 May 1913[5] Allston, Mass.
4:12.6 Norman Taber United States 16 July 1915[5] Allston, Mass.
4:10.4 Paavo Nurmi Finland 23 August 1923[5] Stockholm

Now we can use mutate() & gsub() to help us clean up the footnotes from Date:

IAAFmen %>%
  mutate(Date = gsub("\\[.\\]$", "", Date)) %>%
  head(3)
Time Auto Athlete Nationality Date Venue
4:14.4 John Paul Jones United States 31 May 1913 Allston, Mass.
4:12.6 Norman Taber United States 16 July 1915 Allston, Mass.
4:10.4 Paavo Nurmi Finland 23 August 1923 Stockholm

How to Survive in the Land of Regex

Homework

Activity: Street or Road?

Grading

The assignment is worth a total of 10 points.

  • [2 points] Turn in HTML with embedded .Rmd file (e.g. “DataComputing simple” template)
  • [2 points] Work through the “Solved Example” section showing progress with each step
  • Back to the Streets:
    • [2 points] Your Turn #1: explain each line of code in English (either narrative or commented code)
    • [1 point] Your Turn #2: expand to match several more patterns (at least 12 total)
    • [1 point] Your Turn #2: provide a table in descending order of popularity for the street name identifiers you found
    • [2 points] Your Turn #2: use ggplot to construct a bar chart in descending order of popularity for the street name identifiers you found.

Assignment Remarks:

Two data sets are provided. One includes 15,000 street addresses of registered voters in Wake County, North Carolina. The other includes over 900,000 street addresses of Medicare Service Providers. You can use either data set (or both!) for the activity.

Note: There’s nothing to do in the “For the professional…” section at the very end except to be impressed.


teaching | stat 184 home | syllabus | piazza | canvas