Data Scraping

Data Computing

Oct 26, 2016

Announcements

Chapters 12 & 15 Fly-by

Chapter 12

Chapter 15

Scraping Mile Run Records from Wikipedia

Steps

  1. Locate webpage

  2. Identify data table(s) to scrape

  3. Right click on the table you want, choose “Inspect Element”

  4. Roll cursor over the HTML code (even if you don’t understand it) until you see the whole table that you want appear highlighted. Click on the row that highlights the whole table.

  5. Right click the highlighted row >> Copy >> XPath

  6. Edit the R code chunk below to paste the XPath with SINGLE quotes around it, and URL with quotes around it as shown.

  7. Execute the code chunk!

library("rvest")

page_url <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'

table_list <- 
  page_url %>%
  read_html() %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)

XPATH help: Instructions on getting the xpath to an element on a web page (in Chrome).

Scraping Mile Run Records from Wikipedia

Let’s say we want to scrape Mile Run Records from Wikipedia…

Here’s the page: https://en.wikipedia.org/wiki/Mile_run_world_record_progression

Scraping Mile Run Records from Wikipedia

Using our handy template, we replace the page_url & XPATH

page <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'


table_list <- page %>%
  read_html() %>%
  html_nodes(xpath = XPATH) %>%
  html_table(fill = TRUE)

Scraping Mile Run Records from Wikipedia

# Look at the structure (look for how many tables are in the list; verify they are "data.frame" format)
str(table_list)
## List of 6
##  $ :'data.frame':    11 obs. of  5 variables:
##   ..$ Time       : chr [1:11] "4:28" "4:28" "4:23" "4:22¼" ...
##   ..$ Athlete    : chr [1:11] "Charles Westhall" "Thomas Horspool" "Thomas Horspool" "Siah Albison" ...
##   ..$ Nationality: chr [1:11] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:11] "26 July 1855" "28 September 1857" "12 July 1858" "27 October 1860" ...
##   ..$ Venue      : chr [1:11] "London" "Manchester" "Manchester" "Manchester" ...
##  $ :'data.frame':    16 obs. of  5 variables:
##   ..$ Time       : chr [1:16] "4:55" "4:49" "4:46" "4:33" ...
##   ..$ Athlete    : chr [1:16] "J. Heaviside" "J. Heaviside" "Matthew Greene" "George Farran" ...
##   ..$ Nationality: chr [1:16] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:16] "1 April 1861" "27 May 1861" "27 May 1861" "23 May 1862" ...
##   ..$ Venue      : chr [1:16] "Dublin" "Dublin" "Dublin" "Dublin" ...
##  $ :'data.frame':    5 obs. of  5 variables:
##   ..$ Time       : chr [1:5] "4:52" "4:45" "4:45" "4:40" ...
##   ..$ Athlete    : chr [1:5] "Cadet Marshall" "Thomas Finch" "St. Vincent Hammick" "Gerald Surman" ...
##   ..$ Nationality: chr [1:5] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:5] "2 September 1852" "3 November 1858" "15 November 1858" "24 November 1859" ...
##   ..$ Venue      : chr [1:5] "Addiscome" "Oxford" "Oxford" "Oxford" ...
##  $ :'data.frame':    32 obs. of  6 variables:
##   ..$ Time       : chr [1:32] "4:14.4" "4:12.6" "4:10.4" "4:09.2" ...
##   ..$ Auto       : chr [1:32] "" "" "" "" ...
##   ..$ Athlete    : chr [1:32] "John Paul Jones" "Norman Taber" "Paavo Nurmi" "Jules Ladoumègue" ...
##   ..$ Nationality: chr [1:32] " United States" " United States" " Finland" " France" ...
##   ..$ Date       : chr [1:32] "31 May 1913[5]" "16 July 1915[5]" "23 August 1923[5]" "4 October 1931[5]" ...
##   ..$ Venue      : chr [1:32] "Allston, Mass." "Allston, Mass." "Stockholm" "Paris" ...
##  $ :'data.frame':    18 obs. of  5 variables:
##   ..$ Time       : chr [1:18] "6:13.2" "5:27.5" "5:24.0" "5:23.0" ...
##   ..$ Athlete    : chr [1:18] "Elizabeth Atkinson" "Ruth Christmas" "Gladys Lunn" "Gladys Lunn" ...
##   ..$ Nationality: chr [1:18] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
##   ..$ Date       : chr [1:18] "24 June 1921" "20 August 1932" "1 June 1936" "18 July 1936" ...
##   ..$ Venue      : chr [1:18] "Manchester" "London" "Brentwood" "London" ...
##  $ :'data.frame':    13 obs. of  6 variables:
##   ..$ Time       : chr [1:13] "4:37.0" "4:36.8" "4:35.3" "4:29.5" ...
##   ..$ Auto       : chr [1:13] "" "" "" "" ...
##   ..$ Athlete    : chr [1:13] "Anne Smith" "Maria Gommers" "Ellen Tittel" "Paola Pigni" ...
##   ..$ Nationality: chr [1:13] " United Kingdom" " Netherlands" " West Germany" " Italy" ...
##   ..$ Date       : chr [1:13] "3 June 1967[6]" "14 June 1969[6]" "20 August 1971[6]" "8 August 1973[6]" ...
##   ..$ Venue      : chr [1:13] "London" "Leicester" "Sittard" "Viareggio" ...
# Inspect the first table in the list (IAAF Men from the Wikipedia Page)
IAAFtimes <- table_list[[4]]
tail(IAAFtimes)
##       Time Auto            Athlete     Nationality                Date
## 27 3:48.53           Sebastian Coe  United Kingdom   19 August 1981[5]
## 28 3:48.40             Steve Ovett  United Kingdom   26 August 1981[5]
## 29 3:47.33           Sebastian Coe  United Kingdom   28 August 1981[5]
## 30 3:46.32              Steve Cram  United Kingdom     27 July 1985[5]
## 31 3:44.39      Noureddine Morceli         Algeria 5 September 1993[5]
## 32 3:43.13      Hicham El Guerrouj         Morocco      7 July 1999[5]
##       Venue
## 27   Zürich
## 28  Koblenz
## 29 Brussels
## 30     Oslo
## 31    Rieti
## 32     Rome

Penn State Football Receiving Statistics

  1. Google Penn State Football Statistics: http://bfy.tw/88gl

  2. Identify a data table to scrape (for example, “receiving statistics”)

  3. Right click on the table you want, choose “Inspect Element”

  4. Roll cursor over the HTML code (even if you don’t understand it) until you see the whole table that you want appear highlighted. Click on the row that highlights the whole table.

  5. Right click the highlighted row >> Copy >> XPath

  6. Edit the R code chunk below to paste the XPath with SINGLE quotes around it, and URL with quotes around it as shown.

  7. Execute the code chunk!

library("rvest")
url <- "http://www.espn.com/college-football/team/stats/_/id/213/penn-state-nittany-lions"
XPATH <- '//*[@id="my-players-table"]/div[4]/div/table'

Table1 <- url %>%
  html() %>%
  html_nodes(xpath='XPATH') %>%
  html_table()
Table1[[1]]

Penn State Football Receiving Statistics

url <- "http://www.espn.com/college-football/team/stats/_/id/213/penn-state-nittany-lions"
XPATH <- '//*[@id="my-players-table"]/div[4]/div/table'

Table2 <- url %>%
  read_html(header = TRUE) %>%
  html_nodes(xpath=XPATH) %>%
  html_table()
# R stores the result as a "list" object, so the double square brackets select the first 
#    element of the list, and we store it at a data table called FootballStatsRaw

FootballStatsRaw <- Table2[[1]]

# Inspect the Data Table
FootballStatsRaw
##                      X1                   X2                   X3
## 1  Receiving Statistics Receiving Statistics Receiving Statistics
## 2                  NAME                  REC                  YDS
## 3          Chris Godwin                   25                  364
## 4     DeAndre Thompkins                   18                  328
## 5          Mike Gesicki                   27                  323
## 6      DaeSean Hamilton                   19                  238
## 7        Saquon Barkley                   11                  143
## 8         Irvin Charles                    1                   80
## 9       Saeed Blacknall                    3                   59
## 10        Juwan Johnson                    1                   27
## 11         Brandon Polk                    2                   18
## 12           Mark Allen                    2                    5
## 13        Miles Sanders                    1                    3
## 14       Andre Robinson                    1                    2
## 15               Totals                  111                 1590
##                      X4                   X5                   X6
## 1  Receiving Statistics Receiving Statistics Receiving Statistics
## 2                   AVG                 LONG                   TD
## 3                  14.6              52 (TD)                    3
## 4                  18.2              70 (TD)                    1
## 5                  12.0                   53                    2
## 6                  12.5                   45                    1
## 7                  13.0              40 (TD)                    1
## 8                  80.0              80 (TD)                    1
## 9                  19.7                   35                    0
## 10                 27.0                   27                    0
## 11                  9.0                   14                    0
## 12                  2.5                    4                    0
## 13                  3.0                    3                    0
## 14                  2.0                    2                    0
## 15                 14.3                   80                    9
# Tidy up the data table & rename variables
FootballStatsClean <- 
  FootballStatsRaw %>%  
  rename(name = X1, receptions = X2, total_yds = X3, avg_yds = X4, longest = X5, touchdowns = X6) %>%
  filter(row_number() > 2, name != "Totals")   
  
# Inspect FootballStatsClean
FootballStatsClean
##                 name receptions total_yds avg_yds longest touchdowns
## 1       Chris Godwin         25       364    14.6 52 (TD)          3
## 2  DeAndre Thompkins         18       328    18.2 70 (TD)          1
## 3       Mike Gesicki         27       323    12.0      53          2
## 4   DaeSean Hamilton         19       238    12.5      45          1
## 5     Saquon Barkley         11       143    13.0 40 (TD)          1
## 6      Irvin Charles          1        80    80.0 80 (TD)          1
## 7    Saeed Blacknall          3        59    19.7      35          0
## 8      Juwan Johnson          1        27    27.0      27          0
## 9       Brandon Polk          2        18     9.0      14          0
## 10        Mark Allen          2         5     2.5       4          0
## 11     Miles Sanders          1         3     3.0       3          0
## 12    Andre Robinson          1         2     2.0       2          0

Homework

Activity: Scraping Nuclear Reactors

Grading

Assignment is worth a total of 10 points.

Assignment Sections & Hints: