Announcements

Chapters 12 & 15 Fly-by

Chapter 12

  • rank() is pretty useful
  • row_number() is too

Chapter 15

  • There are a ton of ways to get data into R

    • CSV (comma-separated-values) is a really common format
      • several functions read CSV files into R, but they are not all created equal… (DC likes read.file())
      • file.choose() It’s interactive so you can’t use it in .Rmd, but it’s a great way to get file paths
    • Lots of software export to CSV, but R can handle lots of proprietary formats too (see foreign package)
    • R can query relational databases like MS Access, Oracle, SAP, mySQL, etc (see rodbc package and others)
    • Scraping the web (we’ll come back to that)
  • A word about some object classes…

    • data.frame
    • list
    • variable types and coercion
      • factors
      • numeric
      • character
      • POSIXct (see lubridate package)

Scraping Mile Run Records from Wikipedia

Steps

  1. Locate webpage

  2. Identify data table(s) to scrape

  3. Right click on the table you want, choose “Inspect Element”

  4. Roll cursor over the HTML code (even if you don’t understand it) until you see the whole table that you want appear highlighted. Click on the row that highlights the whole table.

  5. Right click the highlighted row >> Copy >> XPath

  6. Edit the R code chunk below to paste the XPath with SINGLE quotes around it, and URL with quotes around it as shown.

  7. Execute the code chunk!

library("rvest")

page_url <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'

table_list <- 
  page_url %>%
  read_html() %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)

XPATH help: Instructions on getting the xpath to an element on a web page (in Chrome).

Scraping Mile Run Records from Wikipedia

Let’s say we want to scrape Mile Run Records from Wikipedia…

Here’s the page: https://en.wikipedia.org/wiki/Mile_run_world_record_progression

Scraping Mile Run Records from Wikipedia

Using our handy template, we replace the page_url & XPATH

page <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'

table_list <- page %>%
  read_html() %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)

str(table_list)
## List of 6
##  $ :'data.frame':    11 obs. of  5 variables:
##   ..$ Time       : chr [1:11] "4:28" "4:28" "4:23" "4:22¼" ...
##   ..$ Athlete    : chr [1:11] "Charles Westhall" "Thomas Horspool" "Thomas Horspool" "Siah Albison" ...
##   ..$ Nationality: chr [1:11] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:11] "26 July 1855" "28 September 1857" "12 July 1858" "27 October 1860" ...
##   ..$ Venue      : chr [1:11] "London" "Manchester" "Manchester" "Manchester" ...
##  $ :'data.frame':    16 obs. of  5 variables:
##   ..$ Time       : chr [1:16] "4:55" "4:49" "4:46" "4:33" ...
##   ..$ Athlete    : chr [1:16] "J. Heaviside" "J. Heaviside" "Matthew Greene" "George Farran" ...
##   ..$ Nationality: chr [1:16] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:16] "1 April 1861" "27 May 1861" "27 May 1861" "23 May 1862" ...
##   ..$ Venue      : chr [1:16] "Dublin" "Dublin" "Dublin" "Dublin" ...
##  $ :'data.frame':    5 obs. of  5 variables:
##   ..$ Time       : chr [1:5] "4:52" "4:45" "4:45" "4:40" ...
##   ..$ Athlete    : chr [1:5] "Cadet Marshall" "Thomas Finch" "St. Vincent Hammick" "Gerald Surman" ...
##   ..$ Nationality: chr [1:5] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:5] "2 September 1852" "3 November 1858" "15 November 1858" "24 November 1859" ...
##   ..$ Venue      : chr [1:5] "Addiscome" "Oxford" "Oxford" "Oxford" ...
##  $ :'data.frame':    32 obs. of  6 variables:
##   ..$ Time       : chr [1:32] "4:14.4" "4:12.6" "4:10.4" "4:09.2" ...
##   ..$ Auto       : chr [1:32] "" "" "" "" ...
##   ..$ Athlete    : chr [1:32] "John Paul Jones" "Norman Taber" "Paavo Nurmi" "Jules Ladoumègue" ...
##   ..$ Nationality: chr [1:32] " United States" " United States" " Finland" " France" ...
##   ..$ Date       : chr [1:32] "31 May 1913[5]" "16 July 1915[5]" "23 August 1923[5]" "4 October 1931[5]" ...
##   ..$ Venue      : chr [1:32] "Allston, Mass." "Allston, Mass." "Stockholm" "Paris" ...
##  $ :'data.frame':    18 obs. of  5 variables:
##   ..$ Time       : chr [1:18] "6:13.2" "5:27.5" "5:24.0" "5:23.0" ...
##   ..$ Athlete    : chr [1:18] "Elizabeth Atkinson" "Ruth Christmas" "Gladys Lunn" "Gladys Lunn" ...
##   ..$ Nationality: chr [1:18] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:18] "24 June 1921" "20 August 1932" "1 June 1936" "18 July 1936" ...
##   ..$ Venue      : chr [1:18] "Manchester" "London" "Brentwood" "London" ...
##  $ :'data.frame':    13 obs. of  6 variables:
##   ..$ Time       : chr [1:13] "4:37.0" "4:36.8" "4:35.3" "4:29.5" ...
##   ..$ Auto       : chr [1:13] "" "" "" "" ...
##   ..$ Athlete    : chr [1:13] "Anne Smith" "Maria Gommers" "Ellen Tittel" "Paola Pigni" ...
##   ..$ Nationality: chr [1:13] " United Kingdom" " Netherlands" " West Germany" " Italy" ...
##   ..$ Date       : chr [1:13] "3 June 1967[6]" "14 June 1969[6]" "20 August 1971[6]" "8 August 1973[6]" ...
##   ..$ Venue      : chr [1:13] "London" "Leicester" "Sittard" "Viareggio" ...

Scraping Mile Run Records from Wikipedia

# Look at the structure (look for how many tables are in the list; verify they are "data.frame" format)
str(table_list)
## List of 6
##  $ :'data.frame':    11 obs. of  5 variables:
##   ..$ Time       : chr [1:11] "4:28" "4:28" "4:23" "4:22¼" ...
##   ..$ Athlete    : chr [1:11] "Charles Westhall" "Thomas Horspool" "Thomas Horspool" "Siah Albison" ...
##   ..$ Nationality: chr [1:11] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:11] "26 July 1855" "28 September 1857" "12 July 1858" "27 October 1860" ...
##   ..$ Venue      : chr [1:11] "London" "Manchester" "Manchester" "Manchester" ...
##  $ :'data.frame':    16 obs. of  5 variables:
##   ..$ Time       : chr [1:16] "4:55" "4:49" "4:46" "4:33" ...
##   ..$ Athlete    : chr [1:16] "J. Heaviside" "J. Heaviside" "Matthew Greene" "George Farran" ...
##   ..$ Nationality: chr [1:16] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:16] "1 April 1861" "27 May 1861" "27 May 1861" "23 May 1862" ...
##   ..$ Venue      : chr [1:16] "Dublin" "Dublin" "Dublin" "Dublin" ...
##  $ :'data.frame':    5 obs. of  5 variables:
##   ..$ Time       : chr [1:5] "4:52" "4:45" "4:45" "4:40" ...
##   ..$ Athlete    : chr [1:5] "Cadet Marshall" "Thomas Finch" "St. Vincent Hammick" "Gerald Surman" ...
##   ..$ Nationality: chr [1:5] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:5] "2 September 1852" "3 November 1858" "15 November 1858" "24 November 1859" ...
##   ..$ Venue      : chr [1:5] "Addiscome" "Oxford" "Oxford" "Oxford" ...
##  $ :'data.frame':    32 obs. of  6 variables:
##   ..$ Time       : chr [1:32] "4:14.4" "4:12.6" "4:10.4" "4:09.2" ...
##   ..$ Auto       : chr [1:32] "" "" "" "" ...
##   ..$ Athlete    : chr [1:32] "John Paul Jones" "Norman Taber" "Paavo Nurmi" "Jules Ladoumègue" ...
##   ..$ Nationality: chr [1:32] " United States" " United States" " Finland" " France" ...
##   ..$ Date       : chr [1:32] "31 May 1913[5]" "16 July 1915[5]" "23 August 1923[5]" "4 October 1931[5]" ...
##   ..$ Venue      : chr [1:32] "Allston, Mass." "Allston, Mass." "Stockholm" "Paris" ...
##  $ :'data.frame':    18 obs. of  5 variables:
##   ..$ Time       : chr [1:18] "6:13.2" "5:27.5" "5:24.0" "5:23.0" ...
##   ..$ Athlete    : chr [1:18] "Elizabeth Atkinson" "Ruth Christmas" "Gladys Lunn" "Gladys Lunn" ...
##   ..$ Nationality: chr [1:18] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:18] "24 June 1921" "20 August 1932" "1 June 1936" "18 July 1936" ...
##   ..$ Venue      : chr [1:18] "Manchester" "London" "Brentwood" "London" ...
##  $ :'data.frame':    13 obs. of  6 variables:
##   ..$ Time       : chr [1:13] "4:37.0" "4:36.8" "4:35.3" "4:29.5" ...
##   ..$ Auto       : chr [1:13] "" "" "" "" ...
##   ..$ Athlete    : chr [1:13] "Anne Smith" "Maria Gommers" "Ellen Tittel" "Paola Pigni" ...
##   ..$ Nationality: chr [1:13] " United Kingdom" " Netherlands" " West Germany" " Italy" ...
##   ..$ Date       : chr [1:13] "3 June 1967[6]" "14 June 1969[6]" "20 August 1971[6]" "8 August 1973[6]" ...
##   ..$ Venue      : chr [1:13] "London" "Leicester" "Sittard" "Viareggio" ...
# Inspect the first table in the list (IAAF Men from the Wikipedia Page)
IAAFtimes <- table_list[[4]]
tail(IAAFtimes)
Time Auto Athlete Nationality Date Venue
27 3:48.53 Sebastian Coe United Kingdom 19 August 1981[5] Zürich
28 3:48.40 Steve Ovett United Kingdom 26 August 1981[5] Koblenz
29 3:47.33 Sebastian Coe United Kingdom 28 August 1981[5] Brussels
30 3:46.32 Steve Cram United Kingdom 27 July 1985[5] Oslo
31 3:44.39 Noureddine Morceli Algeria 5 September 1993[5] Rieti
32 3:43.13 Hicham El Guerrouj Morocco 7 July 1999[5] Rome
source
## function (file, local = FALSE, echo = verbose, print.eval = echo, 
##     verbose = getOption("verbose"), prompt.echo = getOption("prompt"), 
##     max.deparse.length = 150, chdir = FALSE, encoding = getOption("encoding"), 
##     continue.echo = getOption("continue"), skip.echo = 0, keep.source = getOption("keep.source")) 
## {
##     envir <- if (isTRUE(local)) {
##         parent.frame()
##     }
##     else if (identical(local, FALSE)) {
##         .GlobalEnv
##     }
##     else if (is.environment(local)) {
##         local
##     }
##     else stop("'local' must be TRUE, FALSE or an environment")
##     have_encoding <- !missing(encoding) && encoding != "unknown"
##     if (!missing(echo)) {
##         if (!is.logical(echo)) 
##             stop("'echo' must be logical")
##         if (!echo && verbose) {
##             warning("'verbose' is TRUE, 'echo' not; ... coercing 'echo <- TRUE'")
##             echo <- TRUE
##         }
##     }
##     if (verbose) {
##         cat("'envir' chosen:")
##         print(envir)
##     }
##     ofile <- file
##     from_file <- FALSE
##     srcfile <- NULL
##     if (is.character(file)) {
##         if (identical(encoding, "unknown")) {
##             enc <- utils::localeToCharset()
##             encoding <- enc[length(enc)]
##         }
##         else enc <- encoding
##         if (length(enc) > 1L) {
##             encoding <- NA
##             owarn <- options(warn = 2)
##             for (e in enc) {
##                 if (is.na(e)) 
##                   next
##                 zz <- file(file, encoding = e)
##                 res <- tryCatch(readLines(zz, warn = FALSE), 
##                   error = identity)
##                 close(zz)
##                 if (!inherits(res, "error")) {
##                   encoding <- e
##                   break
##                 }
##             }
##             options(owarn)
##         }
##         if (is.na(encoding)) 
##             stop("unable to find a plausible encoding")
##         if (verbose) 
##             cat(gettextf("encoding = \"%s\" chosen", encoding), 
##                 "\n", sep = "")
##         if (file == "") {
##             file <- stdin()
##             srcfile <- "<stdin>"
##         }
##         else {
##             filename <- file
##             file <- file(filename, "r", encoding = encoding)
##             on.exit(close(file))
##             if (isTRUE(keep.source)) {
##                 lines <- readLines(file, warn = FALSE)
##                 on.exit()
##                 close(file)
##                 srcfile <- srcfilecopy(filename, lines, file.mtime(filename)[1], 
##                   isFile = TRUE)
##             }
##             else {
##                 from_file <- TRUE
##                 srcfile <- filename
##             }
##             loc <- utils::localeToCharset()[1L]
##             encoding <- if (have_encoding) 
##                 switch(loc, `UTF-8` = "UTF-8", `ISO8859-1` = "latin1", 
##                   "unknown")
##             else "unknown"
##         }
##     }
##     else {
##         lines <- readLines(file, warn = FALSE)
##         if (isTRUE(keep.source)) 
##             srcfile <- srcfilecopy(deparse(substitute(file)), 
##                 lines)
##         else srcfile <- deparse(substitute(file))
##     }
##     exprs <- if (!from_file) {
##         if (length(lines)) 
##             .Internal(parse(stdin(), n = -1, lines, "?", srcfile, 
##                 encoding))
##         else expression()
##     }
##     else .Internal(parse(file, n = -1, NULL, "?", srcfile, encoding))
##     on.exit()
##     if (from_file) 
##         close(file)
##     Ne <- length(exprs)
##     if (verbose) 
##         cat("--> parsed", Ne, "expressions; now eval(.)ing them:\n")
##     if (chdir) {
##         if (is.character(ofile)) {
##             if (grepl("^(ftp|http|file)://", ofile)) 
##                 warning("'chdir = TRUE' makes no sense for a URL")
##             else if ((path <- dirname(ofile)) != ".") {
##                 owd <- getwd()
##                 if (is.null(owd)) 
##                   stop("cannot 'chdir' as current directory is unknown")
##                 on.exit(setwd(owd), add = TRUE)
##                 setwd(path)
##             }
##         }
##         else {
##             warning("'chdir = TRUE' makes no sense for a connection")
##         }
##     }
##     if (echo) {
##         sd <- "\""
##         nos <- "[^\"]*"
##         oddsd <- paste0("^", nos, sd, "(", nos, sd, nos, sd, 
##             ")*", nos, "$")
##         trySrcLines <- function(srcfile, showfrom, showto) {
##             lines <- tryCatch(suppressWarnings(getSrcLines(srcfile, 
##                 showfrom, showto)), error = function(e) e)
##             if (inherits(lines, "error")) 
##                 character()
##             else lines
##         }
##     }
##     yy <- NULL
##     lastshown <- 0
##     srcrefs <- attr(exprs, "srcref")
##     for (i in seq_len(Ne + echo)) {
##         tail <- i > Ne
##         if (!tail) {
##             if (verbose) 
##                 cat("\n>>>> eval(expression_nr.", i, ")\n\t\t =================\n")
##             ei <- exprs[i]
##         }
##         if (echo) {
##             nd <- 0
##             srcref <- if (tail) 
##                 attr(exprs, "wholeSrcref")
##             else if (i <= length(srcrefs)) 
##                 srcrefs[[i]]
##             if (!is.null(srcref)) {
##                 if (i == 1) 
##                   lastshown <- min(skip.echo, srcref[3L] - 1)
##                 if (lastshown < srcref[3L]) {
##                   srcfile <- attr(srcref, "srcfile")
##                   dep <- trySrcLines(srcfile, lastshown + 1, 
##                     srcref[3L])
##                   if (length(dep)) {
##                     leading <- if (tail) 
##                       length(dep)
##                     else srcref[1L] - lastshown
##                     lastshown <- srcref[3L]
##                     while (length(dep) && grepl("^[[:blank:]]*$", 
##                       dep[1L])) {
##                       dep <- dep[-1L]
##                       leading <- leading - 1L
##                     }
##                     dep <- paste0(rep.int(c(prompt.echo, continue.echo), 
##                       c(leading, length(dep) - leading)), dep, 
##                       collapse = "\n")
##                     nd <- nchar(dep, "c")
##                   }
##                   else srcref <- NULL
##                 }
##             }
##             if (is.null(srcref)) {
##                 if (!tail) {
##                   dep <- substr(paste(deparse(ei, control = "showAttributes"), 
##                     collapse = "\n"), 12L, 1000000L)
##                   dep <- paste0(prompt.echo, gsub("\n", paste0("\n", 
##                     continue.echo), dep))
##                   nd <- nchar(dep, "c") - 1L
##                 }
##             }
##             if (nd) {
##                 do.trunc <- nd > max.deparse.length
##                 dep <- substr(dep, 1L, if (do.trunc) 
##                   max.deparse.length
##                 else nd)
##                 cat("\n", dep, if (do.trunc) 
##                   paste(if (grepl(sd, dep) && grepl(oddsd, dep)) 
##                     " ...\" ..."
##                   else " ....", "[TRUNCATED] "), "\n", sep = "")
##             }
##         }
##         if (!tail) {
##             yy <- withVisible(eval(ei, envir))
##             i.symbol <- mode(ei[[1L]]) == "name"
##             if (!i.symbol) {
##                 curr.fun <- ei[[1L]][[1L]]
##                 if (verbose) {
##                   cat("curr.fun:")
##                   utils::str(curr.fun)
##                 }
##             }
##             if (verbose >= 2) {
##                 cat(".... mode(ei[[1L]])=", mode(ei[[1L]]), "; paste(curr.fun)=")
##                 utils::str(paste(curr.fun))
##             }
##             if (print.eval && yy$visible) {
##                 if (isS4(yy$value)) 
##                   methods::show(yy$value)
##                 else print(yy$value)
##             }
##             if (verbose) 
##                 cat(" .. after ", sQuote(deparse(ei, control = c("showAttributes", 
##                   "useSource"))), "\n", sep = "")
##         }
##     }
##     invisible(yy)
## }
## <bytecode: 0x7f8c89773590>
## <environment: namespace:base>

Penn State Football Receiving Statistics

  1. Google Penn State Football Statistics: http://bfy.tw/88gl

  2. Identify a data table to scrape (for example, “receiving statistics”)

  3. Right click on the table you want, choose “Inspect Element”

  4. Roll cursor over the HTML code (even if you don’t understand it) until you see the whole table that you want appear highlighted. Click on the row that highlights the whole table.

  5. Right click the highlighted row >> Copy >> XPath

  6. Edit the R code chunk below to paste the XPath with SINGLE quotes around it, and URL with quotes around it as shown.

  7. Execute the code chunk!

library("rvest")
url <- "http://www.espn.com/college-football/team/stats/_/id/213/penn-state-nittany-lions"
XPATH <- '//*[@id="my-players-table"]/div[4]/div/table'

Table1 <- url %>%
  html() %>%
  html_nodes(xpath='XPATH') %>%
  html_table()
Table1[[1]]

Penn State Football Receiving Statistics

url <- "http://www.espn.com/college-football/team/stats/_/id/213/penn-state-nittany-lions"
XPATH <- '//*[@id="my-players-table"]/div[4]/div/table'

Table2 <- url %>%
  read_html(header = TRUE) %>%
  html_nodes(xpath=XPATH) %>%
  html_table()
# R stores the result as a "list" object, so the double square brackets select the first 
#    element of the list, and we store it at a data table called FootballStatsRaw

FootballStatsRaw <- Table2[[1]]

# Inspect the Data Table
FootballStatsRaw
X1 X2 X3 X4 X5 X6
Receiving Statistics Receiving Statistics Receiving Statistics Receiving Statistics Receiving Statistics Receiving Statistics
NAME REC YDS AVG LONG TD
Chris Godwin 25 364 14.6 52 (TD) 3
DeAndre Thompkins 18 328 18.2 70 (TD) 1
Mike Gesicki 27 323 12.0 53 2
DaeSean Hamilton 19 238 12.5 45 1
Saquon Barkley 11 143 13.0 40 (TD) 1
Irvin Charles 1 80 80.0 80 (TD) 1
Saeed Blacknall 3 59 19.7 35 0
Juwan Johnson 1 27 27.0 27 0
Brandon Polk 2 18 9.0 14 0
Mark Allen 2 5 2.5 4 0
Miles Sanders 1 3 3.0 3 0
Andre Robinson 1 2 2.0 2 0
Totals 111 1590 14.3 80 9
# Tidy up the data table & rename variables
FootballStatsClean <- 
  FootballStatsRaw %>%  
  rename(name = X1, receptions = X2, total_yds = X3, avg_yds = X4, longest = X5, touchdowns = X6) %>%
  filter(row_number() > 2, name != "Totals")   
  
# Inspect FootballStatsClean
FootballStatsClean
name receptions total_yds avg_yds longest touchdowns
Chris Godwin 25 364 14.6 52 (TD) 3
DeAndre Thompkins 18 328 18.2 70 (TD) 1
Mike Gesicki 27 323 12.0 53 2
DaeSean Hamilton 19 238 12.5 45 1
Saquon Barkley 11 143 13.0 40 (TD) 1
Irvin Charles 1 80 80.0 80 (TD) 1
Saeed Blacknall 3 59 19.7 35 0
Juwan Johnson 1 27 27.0 27 0
Brandon Polk 2 18 9.0 14 0
Mark Allen 2 5 2.5 4 0
Miles Sanders 1 3 3.0 3 0
Andre Robinson 1 2 2.0 2 0

Homework

Activity: Scraping Nuclear Reactors

Grading

Assignment is worth a total of 10 points.

  • [2 points] Turn in HTML with embedded .Rmd file (e.g. “DataComputing simple” template)
  • [1 points] Successfully scrape raw data for Japan Reactors from Wikipedia
  • [2 points] Your Turn #1: Tidy Data & Data Cleaning (Japan)
  • [1 points] Your Turn #2: Plot Net Generation Capacity vs Construction Date
  • [2 points] Your Turn #3: Scrape & Clean China Data (then Merge with Japan)
  • [2 points] Your Turn #5: Reconstruct Info Graphic of Japan Reactors (or other country of interest)

Assignment Sections & Hints:

  • Scrape Japan Data from Wikipedia
    • Hint: if you’re having trouble with “subscript out of bounds”, just get the xpath for the Japan table directly, rather than reading all tables at once
  • Your Turn #1: Tidy Data & Data Cleaning
  • Your Turn #2: Plot Net Generation Capacity vs Construction Date
  • [SKIP] Your Turn #3: Scrape & Clean China Data (then Merge with Japan)
    • Hint: use mutate() to create the new variable “country”
    • Hint: use either cbind() or rbind() to combine the data tables together (be careful to match column order)
  • [SKIP] Your Turn #4: Scrape USA Reactors
  • Your Turn #5: Reconstruct Info Graphic of Japan Reactors
    • Hint: use paste() function to combine the name and reactor variables to create an informative unique case ID

teaching | stat 184 home | syllabus | piazza | canvas