There are many more descriptive statistics than we’ve discussed so far with the measures of central tendency and the measures of spread but the most important and most common techniques have been covered. In the last post we acquired some data from the internet that I know very little about. Let’s use what we’ve learned in order to get an understanding of it.
The best place to start is with the str() and summary() functions. We used str() last time to get an idea of what data we have. It describes 1034 baseball players by name, team position, height, weight, and age.
summary(tab) name team position height Length:1034 NYM : 38 Relief_Pitcher :315 Min. :67.0 Class :character ATL : 37 Starting_Pitcher:221 1st Qu.:72.0 Mode :character DET : 37 Outfielder :194 Median :74.0 OAK : 37 Catcher : 76 Mean :73.7 BOS : 36 Second_Baseman : 58 3rd Qu.:75.0 CHC : 36 First_Baseman : 55 Max. :83.0 (Other):813 (Other) :115 weight age Min. :150.0 Min. :20.90 1st Qu.:187.0 1st Qu.:25.44 Median :200.0 Median :27.93 Mean :201.7 Mean :28.74 3rd Qu.:215.0 3rd Qu.:31.23 Max. :290.0 Max. :48.52 NA's :1
The most important things we see here are that:
a) The teams are equally represented
b) The positions are not equally represented
c) Height, weight, and age have their mean and median close together, which is usually a good sign.
d) There is one missing value for weight!
That last one is especially important since R will return an error if it encounters an NA while doing math. We can order R to remove NAs if we want.
sd(tab$weight) NA sd(tab$weight, na.rm=TRUE) 20.99149
Since there is only a single missing value we should probably just drop that line rather than include na.rm=TRUE over and over again. The is.na() function will find where our missing data is. We want the subset of tab where it is FALSE that data is missing.
tab <- tab[is.na(tab$weight) == FALSE,]
The summary() function already showed us the mean, median, and quartiles of the player’s weights. Let’s look at the MAD and the SD.
sd(tab$weight, na.rm=TRUE) 20.99149 mad(tab$weight, na.rm=TRUE) 22.239
How would we, in colloquial terms, describe the weights of the players? Probably something like “According to this sample baseball players tend to weight around 200 pounds, give or take twenty pounds”. We lose a lot of precision here but its much easier to understand.
If our goal is to report the data, however, we need to do the best job we can. A traditional option would be along the lines of “This sample of baseball players (N=1033, one missing value excluded) had a mean weight of 201.7 pounds (SD = 21)”. A better option, if we have the space for it is a visualization. Since weight is continuous a density is appropriate.
How about if we divide things up by position? A boxplot or violin plot is a good way to look at the information that way. A boxplot gives us the most information. Median, IQR, and any outliers.
There’s just one problem with using a boxplot (or even a violin plot) here. The positions don’t all have the same number of players.
with(tab,tapply(position,position,length)) Catcher Designated_Hitter First_Baseman Outfielder 76 18 55 194 Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher 315 58 52 221 Third_Baseman 45
There were only 18 Designated Hitters and there were 315 Relief Pitchers. Most plots will disguise that fact. A dotplot won’t, although it lacks much of the information that other methods offer.
Ah, now we see something truly strange about this data. There are striations. Look at how a few values have far more data points than the ones around them. If you look even more closely its visible that these striations happen on multiples of five.
What’s going on? That’s not random! Multiple of five are not important to nature. This is obviously an effect created by human beings. Yet we do have data points in between.
If you track down where this data came from you’ll find that it’s compiled from more than one source. Evidently some of these sources recorded weights only as multiples of five and others as any unit value.
Next time we’ll take a look at some of the other data we have in the dataframe and examine the data across multiple variables.
Here we see the code to make the images in this post with Hadley Wickham’s ggplot2 package.
p <- ggplot(tab) p + geom_density(aes(x=weight), size=1) + theme(panel.grid.major = element_line(size = .8), panel.grid.minor = element_line(size = .8)) p + geom_boxplot(aes(x=position,y=weight)) + theme(panel.grid.major = element_line(size = .8), panel.grid.minor = element_line(size = .8), axis.text.x = element_text(angle=45,vjust=.6)) p + geom_dotplot(aes(x='identity',y=weight),binaxis="y",stackdir="center",binwidth=1) + theme(panel.grid.major = element_line(size = 1),legend.position="none") + scale_y_continuous(minor_breaks=NULL)