Simple Webscraping

Now that we’ve spent a significant amount of time on descriptive statistics its time we took a look at some data we don’t know anything about and see what we can figure out.

Analyzing information about sports, particularly baseball, is a proud tradition in statistics. Unfortunately I don’t know much about baseball. We’re going to use information about the heights and weights of a bunch of baseball players. In order to do so we have to get that data. Right now it’s in a table on a website.

The data is currently stored in a table here:
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights

There are a few ways to get at this data.

First you can copy and paste the table in to a program like Excel and then load that data into R. I’ve done this and, if the site moves or is taken down, you get get a .csv file with the data in a google doc I made.

For simple data this is probably the easiest way. It takes just a few seconds. Copy ā€“ Paste ā€“ Save As.

There are times, however, when an alternative process known as webscraping is more effective. Let’s take the chance to webscrape this simple dataset. The XML package by Duncan Temple Lang provides a bunch of useful tools.

First we store the address of the site as an object in R then we use the htmlParse() function to get all of the HTML code from the page.

site <- "http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights"
site.doc <- htmlParse(site)

It happens that there are three tables on that page and trying to read the whole thing directly produces some crazy results. Fortunately, we can extract each table separately since the HTML code specifies where each one begins and ends.

tableNodes = getNodeSet(site.doc, "//table")

The final steps are to read the data clean it up. With the readHTMLTable() function from the XML package we can convert the HTML code into a dataframe. In that command we need to tell it the correct class for each column.

tab <- readHTMLTable(tableNodes[[2]], 
	colClasses = c("character","factor","factor",
		"integer","integer","numeric"),
	stringsAsFactors=FALSE)

After we have the dataframe we need to rename the variables everything. Why? Because tables aren’t designed like dataframes. We can use the names() function to see just how bad it is. We’ve got spaces in the names of all of them, height and weight have parenthesis that will cause confusion, and the age variable actually contains a bit of code in its name that tells R to make a new line.

Madness!

Easily rectified madness!

names(tab)
[1] " Name "           " Team "           " Position "       " Height(inches) "
[5] " Weight(pounds) " " Age\n"
names(tab) <- c('name','team','position','height','weight','age')

str(tab)
'data.frame':   1034 obs. of  6 variables:
 $ name    : chr  "Adam_Donachie" "Paul_Bako" "Ramon_Hernandez" "Kevin_Millar" ...
 $ team    : Factor w/ 30 levels "ANA","ARZ","ATL",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ position: Factor w/ 9 levels "Catcher","Designated_Hitter",..: 1 1 1 3 3 6 7 9 9 4 ...
 $ height  : int  74 74 72 72 73 69 69 71 76 71 ...
 $ weight  : int  180 215 210 210 188 176 209 200 231 180 ...
 $ age     : num  23 34.7 30.8 35.4 35.7 ...

And now we have a beautiful dataframe that is just like anything else we’ve ever seen.

Next time we’ll get around to checking out what the data tells us about the players.

Leave a comment