Regular Expressions

Data Computing

November 9, 2016

Agenda

Announcements

Here’s How Much Time is Left…

# Date the script was last refreshed
today()
## [1] "2016-11-09"
# Data Set Review
mdy("11-13-2016") - ymd(today())
## Time difference of 4 days
# Sumbit Draft for Peer Review
mdy("11-25-2016") - ymd(today())
## Time difference of 16 days
# Total Remaining
mdy("12-09-2016") - ymd(today())
## Time difference of 30 days

Final Projects - Class Feedback

The offer…

Any Takers?

Chapter 16: Key Ideas

Some Exploits in the Land of Regex

Rejecting Footnotes with Regex

page <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'

table_list <- page %>%
  read_html() %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)

IAAFmen <- table_list[[4]] 
head(IAAFmen, 3)
##     Time Auto         Athlete    Nationality              Date
## 1 4:14.4      John Paul Jones  United States    31 May 1913[5]
## 2 4:12.6         Norman Taber  United States   16 July 1915[5]
## 3 4:10.4          Paavo Nurmi        Finland 23 August 1923[5]
##            Venue
## 1 Allston, Mass.
## 2 Allston, Mass.
## 3      Stockholm

Now we can use mutate() & gsub() to help us clean up the footnotes from Date:

IAAFmen %>%
  mutate(Date = gsub("\\[.\\]$", "", Date)) %>%
  head(3)
##     Time Auto         Athlete    Nationality           Date          Venue
## 1 4:14.4      John Paul Jones  United States    31 May 1913 Allston, Mass.
## 2 4:12.6         Norman Taber  United States   16 July 1915 Allston, Mass.
## 3 4:10.4          Paavo Nurmi        Finland 23 August 1923      Stockholm

How to Survive in the Land of Regex

How to Survive in the Land of Regex

Homework

Activity: Street or Road?

Grading

The assignment is worth a total of 10 points.

Assignment Remarks:

Two data sets are provided. One includes 15,000 street addresses of registered voters in Wake County, North Carolina. The other includes over 900,000 street addresses of Medicare Service Providers. You can use either data set (or both!) for the activity.

Note: There’s nothing to do in the “For the professional…” section at the very end except to be impressed.