rank()
is pretty usefulrow_number()
is tooThere are a ton of ways to get data into R
read.file()
)file.choose()
It’s interactive so you can’t use it in .Rmd, but it’s a great way to get file pathsforeign
package)rodbc
package and others)A word about some object classes…
lubridate
package)Locate webpage
Identify data table(s) to scrape
Right click on the table you want, choose “Inspect Element”
Roll cursor over the HTML code (even if you don’t understand it) until you see the whole table that you want appear highlighted. Click on the row that highlights the whole table.
Right click the highlighted row >> Copy >> XPath
Edit the R code chunk below to paste the XPath with SINGLE quotes around it, and URL with quotes around it as shown.
Execute the code chunk!
library("rvest")
page_url <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'
table_list <-
page_url %>%
read_html() %>%
html_nodes(xpath = XPATH) %>%
html_table(fill = TRUE)
XPATH help: Instructions on getting the xpath to an element on a web page (in Chrome).
Let’s say we want to scrape Mile Run Records from Wikipedia…
Here’s the page: https://en.wikipedia.org/wiki/Mile_run_world_record_progression
Using our handy template, we replace the page_url & XPATH
page <- "https://en.wikipedia.org/wiki/Mile_run_world_record_progression"
XPATH <- '//*[@id="mw-content-text"]/table'
table_list <- page %>%
read_html() %>%
html_nodes(xpath = XPATH) %>%
html_table(fill = TRUE)
str(table_list)
## List of 6
## $ :'data.frame': 11 obs. of 5 variables:
## ..$ Time : chr [1:11] "4:28" "4:28" "4:23" "4:22¼" ...
## ..$ Athlete : chr [1:11] "Charles Westhall" "Thomas Horspool" "Thomas Horspool" "Siah Albison" ...
## ..$ Nationality: chr [1:11] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:11] "26 July 1855" "28 September 1857" "12 July 1858" "27 October 1860" ...
## ..$ Venue : chr [1:11] "London" "Manchester" "Manchester" "Manchester" ...
## $ :'data.frame': 16 obs. of 5 variables:
## ..$ Time : chr [1:16] "4:55" "4:49" "4:46" "4:33" ...
## ..$ Athlete : chr [1:16] "J. Heaviside" "J. Heaviside" "Matthew Greene" "George Farran" ...
## ..$ Nationality: chr [1:16] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:16] "1 April 1861" "27 May 1861" "27 May 1861" "23 May 1862" ...
## ..$ Venue : chr [1:16] "Dublin" "Dublin" "Dublin" "Dublin" ...
## $ :'data.frame': 5 obs. of 5 variables:
## ..$ Time : chr [1:5] "4:52" "4:45" "4:45" "4:40" ...
## ..$ Athlete : chr [1:5] "Cadet Marshall" "Thomas Finch" "St. Vincent Hammick" "Gerald Surman" ...
## ..$ Nationality: chr [1:5] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:5] "2 September 1852" "3 November 1858" "15 November 1858" "24 November 1859" ...
## ..$ Venue : chr [1:5] "Addiscome" "Oxford" "Oxford" "Oxford" ...
## $ :'data.frame': 32 obs. of 6 variables:
## ..$ Time : chr [1:32] "4:14.4" "4:12.6" "4:10.4" "4:09.2" ...
## ..$ Auto : chr [1:32] "" "" "" "" ...
## ..$ Athlete : chr [1:32] "John Paul Jones" "Norman Taber" "Paavo Nurmi" "Jules Ladoumègue" ...
## ..$ Nationality: chr [1:32] " United States" " United States" " Finland" " France" ...
## ..$ Date : chr [1:32] "31 May 1913[5]" "16 July 1915[5]" "23 August 1923[5]" "4 October 1931[5]" ...
## ..$ Venue : chr [1:32] "Allston, Mass." "Allston, Mass." "Stockholm" "Paris" ...
## $ :'data.frame': 18 obs. of 5 variables:
## ..$ Time : chr [1:18] "6:13.2" "5:27.5" "5:24.0" "5:23.0" ...
## ..$ Athlete : chr [1:18] "Elizabeth Atkinson" "Ruth Christmas" "Gladys Lunn" "Gladys Lunn" ...
## ..$ Nationality: chr [1:18] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:18] "24 June 1921" "20 August 1932" "1 June 1936" "18 July 1936" ...
## ..$ Venue : chr [1:18] "Manchester" "London" "Brentwood" "London" ...
## $ :'data.frame': 13 obs. of 6 variables:
## ..$ Time : chr [1:13] "4:37.0" "4:36.8" "4:35.3" "4:29.5" ...
## ..$ Auto : chr [1:13] "" "" "" "" ...
## ..$ Athlete : chr [1:13] "Anne Smith" "Maria Gommers" "Ellen Tittel" "Paola Pigni" ...
## ..$ Nationality: chr [1:13] " United Kingdom" " Netherlands" " West Germany" " Italy" ...
## ..$ Date : chr [1:13] "3 June 1967[6]" "14 June 1969[6]" "20 August 1971[6]" "8 August 1973[6]" ...
## ..$ Venue : chr [1:13] "London" "Leicester" "Sittard" "Viareggio" ...
# Look at the structure (look for how many tables are in the list; verify they are "data.frame" format)
str(table_list)
## List of 6
## $ :'data.frame': 11 obs. of 5 variables:
## ..$ Time : chr [1:11] "4:28" "4:28" "4:23" "4:22¼" ...
## ..$ Athlete : chr [1:11] "Charles Westhall" "Thomas Horspool" "Thomas Horspool" "Siah Albison" ...
## ..$ Nationality: chr [1:11] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:11] "26 July 1855" "28 September 1857" "12 July 1858" "27 October 1860" ...
## ..$ Venue : chr [1:11] "London" "Manchester" "Manchester" "Manchester" ...
## $ :'data.frame': 16 obs. of 5 variables:
## ..$ Time : chr [1:16] "4:55" "4:49" "4:46" "4:33" ...
## ..$ Athlete : chr [1:16] "J. Heaviside" "J. Heaviside" "Matthew Greene" "George Farran" ...
## ..$ Nationality: chr [1:16] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:16] "1 April 1861" "27 May 1861" "27 May 1861" "23 May 1862" ...
## ..$ Venue : chr [1:16] "Dublin" "Dublin" "Dublin" "Dublin" ...
## $ :'data.frame': 5 obs. of 5 variables:
## ..$ Time : chr [1:5] "4:52" "4:45" "4:45" "4:40" ...
## ..$ Athlete : chr [1:5] "Cadet Marshall" "Thomas Finch" "St. Vincent Hammick" "Gerald Surman" ...
## ..$ Nationality: chr [1:5] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:5] "2 September 1852" "3 November 1858" "15 November 1858" "24 November 1859" ...
## ..$ Venue : chr [1:5] "Addiscome" "Oxford" "Oxford" "Oxford" ...
## $ :'data.frame': 32 obs. of 6 variables:
## ..$ Time : chr [1:32] "4:14.4" "4:12.6" "4:10.4" "4:09.2" ...
## ..$ Auto : chr [1:32] "" "" "" "" ...
## ..$ Athlete : chr [1:32] "John Paul Jones" "Norman Taber" "Paavo Nurmi" "Jules Ladoumègue" ...
## ..$ Nationality: chr [1:32] " United States" " United States" " Finland" " France" ...
## ..$ Date : chr [1:32] "31 May 1913[5]" "16 July 1915[5]" "23 August 1923[5]" "4 October 1931[5]" ...
## ..$ Venue : chr [1:32] "Allston, Mass." "Allston, Mass." "Stockholm" "Paris" ...
## $ :'data.frame': 18 obs. of 5 variables:
## ..$ Time : chr [1:18] "6:13.2" "5:27.5" "5:24.0" "5:23.0" ...
## ..$ Athlete : chr [1:18] "Elizabeth Atkinson" "Ruth Christmas" "Gladys Lunn" "Gladys Lunn" ...
## ..$ Nationality: chr [1:18] " United Kingdom" " United Kingdom" " United Kingdom" " United Kingdom" ...
## ..$ Date : chr [1:18] "24 June 1921" "20 August 1932" "1 June 1936" "18 July 1936" ...
## ..$ Venue : chr [1:18] "Manchester" "London" "Brentwood" "London" ...
## $ :'data.frame': 13 obs. of 6 variables:
## ..$ Time : chr [1:13] "4:37.0" "4:36.8" "4:35.3" "4:29.5" ...
## ..$ Auto : chr [1:13] "" "" "" "" ...
## ..$ Athlete : chr [1:13] "Anne Smith" "Maria Gommers" "Ellen Tittel" "Paola Pigni" ...
## ..$ Nationality: chr [1:13] " United Kingdom" " Netherlands" " West Germany" " Italy" ...
## ..$ Date : chr [1:13] "3 June 1967[6]" "14 June 1969[6]" "20 August 1971[6]" "8 August 1973[6]" ...
## ..$ Venue : chr [1:13] "London" "Leicester" "Sittard" "Viareggio" ...
# Inspect the first table in the list (IAAF Men from the Wikipedia Page)
IAAFtimes <- table_list[[4]]
tail(IAAFtimes)
Time | Auto | Athlete | Nationality | Date | Venue | |
---|---|---|---|---|---|---|
27 | 3:48.53 | Sebastian Coe | United Kingdom | 19 August 1981[5] | Zürich | |
28 | 3:48.40 | Steve Ovett | United Kingdom | 26 August 1981[5] | Koblenz | |
29 | 3:47.33 | Sebastian Coe | United Kingdom | 28 August 1981[5] | Brussels | |
30 | 3:46.32 | Steve Cram | United Kingdom | 27 July 1985[5] | Oslo | |
31 | 3:44.39 | Noureddine Morceli | Algeria | 5 September 1993[5] | Rieti | |
32 | 3:43.13 | Hicham El Guerrouj | Morocco | 7 July 1999[5] | Rome |
source
## function (file, local = FALSE, echo = verbose, print.eval = echo,
## verbose = getOption("verbose"), prompt.echo = getOption("prompt"),
## max.deparse.length = 150, chdir = FALSE, encoding = getOption("encoding"),
## continue.echo = getOption("continue"), skip.echo = 0, keep.source = getOption("keep.source"))
## {
## envir <- if (isTRUE(local)) {
## parent.frame()
## }
## else if (identical(local, FALSE)) {
## .GlobalEnv
## }
## else if (is.environment(local)) {
## local
## }
## else stop("'local' must be TRUE, FALSE or an environment")
## have_encoding <- !missing(encoding) && encoding != "unknown"
## if (!missing(echo)) {
## if (!is.logical(echo))
## stop("'echo' must be logical")
## if (!echo && verbose) {
## warning("'verbose' is TRUE, 'echo' not; ... coercing 'echo <- TRUE'")
## echo <- TRUE
## }
## }
## if (verbose) {
## cat("'envir' chosen:")
## print(envir)
## }
## ofile <- file
## from_file <- FALSE
## srcfile <- NULL
## if (is.character(file)) {
## if (identical(encoding, "unknown")) {
## enc <- utils::localeToCharset()
## encoding <- enc[length(enc)]
## }
## else enc <- encoding
## if (length(enc) > 1L) {
## encoding <- NA
## owarn <- options(warn = 2)
## for (e in enc) {
## if (is.na(e))
## next
## zz <- file(file, encoding = e)
## res <- tryCatch(readLines(zz, warn = FALSE),
## error = identity)
## close(zz)
## if (!inherits(res, "error")) {
## encoding <- e
## break
## }
## }
## options(owarn)
## }
## if (is.na(encoding))
## stop("unable to find a plausible encoding")
## if (verbose)
## cat(gettextf("encoding = \"%s\" chosen", encoding),
## "\n", sep = "")
## if (file == "") {
## file <- stdin()
## srcfile <- "<stdin>"
## }
## else {
## filename <- file
## file <- file(filename, "r", encoding = encoding)
## on.exit(close(file))
## if (isTRUE(keep.source)) {
## lines <- readLines(file, warn = FALSE)
## on.exit()
## close(file)
## srcfile <- srcfilecopy(filename, lines, file.mtime(filename)[1],
## isFile = TRUE)
## }
## else {
## from_file <- TRUE
## srcfile <- filename
## }
## loc <- utils::localeToCharset()[1L]
## encoding <- if (have_encoding)
## switch(loc, `UTF-8` = "UTF-8", `ISO8859-1` = "latin1",
## "unknown")
## else "unknown"
## }
## }
## else {
## lines <- readLines(file, warn = FALSE)
## if (isTRUE(keep.source))
## srcfile <- srcfilecopy(deparse(substitute(file)),
## lines)
## else srcfile <- deparse(substitute(file))
## }
## exprs <- if (!from_file) {
## if (length(lines))
## .Internal(parse(stdin(), n = -1, lines, "?", srcfile,
## encoding))
## else expression()
## }
## else .Internal(parse(file, n = -1, NULL, "?", srcfile, encoding))
## on.exit()
## if (from_file)
## close(file)
## Ne <- length(exprs)
## if (verbose)
## cat("--> parsed", Ne, "expressions; now eval(.)ing them:\n")
## if (chdir) {
## if (is.character(ofile)) {
## if (grepl("^(ftp|http|file)://", ofile))
## warning("'chdir = TRUE' makes no sense for a URL")
## else if ((path <- dirname(ofile)) != ".") {
## owd <- getwd()
## if (is.null(owd))
## stop("cannot 'chdir' as current directory is unknown")
## on.exit(setwd(owd), add = TRUE)
## setwd(path)
## }
## }
## else {
## warning("'chdir = TRUE' makes no sense for a connection")
## }
## }
## if (echo) {
## sd <- "\""
## nos <- "[^\"]*"
## oddsd <- paste0("^", nos, sd, "(", nos, sd, nos, sd,
## ")*", nos, "$")
## trySrcLines <- function(srcfile, showfrom, showto) {
## lines <- tryCatch(suppressWarnings(getSrcLines(srcfile,
## showfrom, showto)), error = function(e) e)
## if (inherits(lines, "error"))
## character()
## else lines
## }
## }
## yy <- NULL
## lastshown <- 0
## srcrefs <- attr(exprs, "srcref")
## for (i in seq_len(Ne + echo)) {
## tail <- i > Ne
## if (!tail) {
## if (verbose)
## cat("\n>>>> eval(expression_nr.", i, ")\n\t\t =================\n")
## ei <- exprs[i]
## }
## if (echo) {
## nd <- 0
## srcref <- if (tail)
## attr(exprs, "wholeSrcref")
## else if (i <= length(srcrefs))
## srcrefs[[i]]
## if (!is.null(srcref)) {
## if (i == 1)
## lastshown <- min(skip.echo, srcref[3L] - 1)
## if (lastshown < srcref[3L]) {
## srcfile <- attr(srcref, "srcfile")
## dep <- trySrcLines(srcfile, lastshown + 1,
## srcref[3L])
## if (length(dep)) {
## leading <- if (tail)
## length(dep)
## else srcref[1L] - lastshown
## lastshown <- srcref[3L]
## while (length(dep) && grepl("^[[:blank:]]*$",
## dep[1L])) {
## dep <- dep[-1L]
## leading <- leading - 1L
## }
## dep <- paste0(rep.int(c(prompt.echo, continue.echo),
## c(leading, length(dep) - leading)), dep,
## collapse = "\n")
## nd <- nchar(dep, "c")
## }
## else srcref <- NULL
## }
## }
## if (is.null(srcref)) {
## if (!tail) {
## dep <- substr(paste(deparse(ei, control = "showAttributes"),
## collapse = "\n"), 12L, 1000000L)
## dep <- paste0(prompt.echo, gsub("\n", paste0("\n",
## continue.echo), dep))
## nd <- nchar(dep, "c") - 1L
## }
## }
## if (nd) {
## do.trunc <- nd > max.deparse.length
## dep <- substr(dep, 1L, if (do.trunc)
## max.deparse.length
## else nd)
## cat("\n", dep, if (do.trunc)
## paste(if (grepl(sd, dep) && grepl(oddsd, dep))
## " ...\" ..."
## else " ....", "[TRUNCATED] "), "\n", sep = "")
## }
## }
## if (!tail) {
## yy <- withVisible(eval(ei, envir))
## i.symbol <- mode(ei[[1L]]) == "name"
## if (!i.symbol) {
## curr.fun <- ei[[1L]][[1L]]
## if (verbose) {
## cat("curr.fun:")
## utils::str(curr.fun)
## }
## }
## if (verbose >= 2) {
## cat(".... mode(ei[[1L]])=", mode(ei[[1L]]), "; paste(curr.fun)=")
## utils::str(paste(curr.fun))
## }
## if (print.eval && yy$visible) {
## if (isS4(yy$value))
## methods::show(yy$value)
## else print(yy$value)
## }
## if (verbose)
## cat(" .. after ", sQuote(deparse(ei, control = c("showAttributes",
## "useSource"))), "\n", sep = "")
## }
## }
## invisible(yy)
## }
## <bytecode: 0x7f8c89773590>
## <environment: namespace:base>
Google Penn State Football Statistics: http://bfy.tw/88gl
Identify a data table to scrape (for example, “receiving statistics”)
Right click on the table you want, choose “Inspect Element”
Roll cursor over the HTML code (even if you don’t understand it) until you see the whole table that you want appear highlighted. Click on the row that highlights the whole table.
Right click the highlighted row >> Copy >> XPath
Edit the R code chunk below to paste the XPath with SINGLE quotes around it, and URL with quotes around it as shown.
Execute the code chunk!
library("rvest")
url <- "http://www.espn.com/college-football/team/stats/_/id/213/penn-state-nittany-lions"
XPATH <- '//*[@id="my-players-table"]/div[4]/div/table'
Table1 <- url %>%
html() %>%
html_nodes(xpath='XPATH') %>%
html_table()
Table1[[1]]
url <- "http://www.espn.com/college-football/team/stats/_/id/213/penn-state-nittany-lions"
XPATH <- '//*[@id="my-players-table"]/div[4]/div/table'
Table2 <- url %>%
read_html(header = TRUE) %>%
html_nodes(xpath=XPATH) %>%
html_table()
# R stores the result as a "list" object, so the double square brackets select the first
# element of the list, and we store it at a data table called FootballStatsRaw
FootballStatsRaw <- Table2[[1]]
# Inspect the Data Table
FootballStatsRaw
X1 | X2 | X3 | X4 | X5 | X6 |
---|---|---|---|---|---|
Receiving Statistics | Receiving Statistics | Receiving Statistics | Receiving Statistics | Receiving Statistics | Receiving Statistics |
NAME | REC | YDS | AVG | LONG | TD |
Chris Godwin | 25 | 364 | 14.6 | 52 (TD) | 3 |
DeAndre Thompkins | 18 | 328 | 18.2 | 70 (TD) | 1 |
Mike Gesicki | 27 | 323 | 12.0 | 53 | 2 |
DaeSean Hamilton | 19 | 238 | 12.5 | 45 | 1 |
Saquon Barkley | 11 | 143 | 13.0 | 40 (TD) | 1 |
Irvin Charles | 1 | 80 | 80.0 | 80 (TD) | 1 |
Saeed Blacknall | 3 | 59 | 19.7 | 35 | 0 |
Juwan Johnson | 1 | 27 | 27.0 | 27 | 0 |
Brandon Polk | 2 | 18 | 9.0 | 14 | 0 |
Mark Allen | 2 | 5 | 2.5 | 4 | 0 |
Miles Sanders | 1 | 3 | 3.0 | 3 | 0 |
Andre Robinson | 1 | 2 | 2.0 | 2 | 0 |
Totals | 111 | 1590 | 14.3 | 80 | 9 |
# Tidy up the data table & rename variables
FootballStatsClean <-
FootballStatsRaw %>%
rename(name = X1, receptions = X2, total_yds = X3, avg_yds = X4, longest = X5, touchdowns = X6) %>%
filter(row_number() > 2, name != "Totals")
# Inspect FootballStatsClean
FootballStatsClean
name | receptions | total_yds | avg_yds | longest | touchdowns |
---|---|---|---|---|---|
Chris Godwin | 25 | 364 | 14.6 | 52 (TD) | 3 |
DeAndre Thompkins | 18 | 328 | 18.2 | 70 (TD) | 1 |
Mike Gesicki | 27 | 323 | 12.0 | 53 | 2 |
DaeSean Hamilton | 19 | 238 | 12.5 | 45 | 1 |
Saquon Barkley | 11 | 143 | 13.0 | 40 (TD) | 1 |
Irvin Charles | 1 | 80 | 80.0 | 80 (TD) | 1 |
Saeed Blacknall | 3 | 59 | 19.7 | 35 | 0 |
Juwan Johnson | 1 | 27 | 27.0 | 27 | 0 |
Brandon Polk | 2 | 18 | 9.0 | 14 | 0 |
Mark Allen | 2 | 5 | 2.5 | 4 | 0 |
Miles Sanders | 1 | 3 | 3.0 | 3 | 0 |
Andre Robinson | 1 | 2 | 2.0 | 2 | 0 |
Assignment is worth a total of 10 points.
mutate()
to create the new variable “country”cbind()
or rbind()
to combine the data tables together (be careful to match column order)paste()
function to combine the name
and reactor
variables to create an informative unique case IDteaching | stat 184 home | syllabus | piazza | canvas