Summary – Descriptive Statistics

There are many more descriptive statistics than we’ve discussed so far with the measures of central tendency and the measures of spread but the most important and most common techniques have been covered. In the last post we acquired some data from the internet that I know very little about. Let’s use what we’ve learned in order to get an understanding of it.

The best place to start is with the str() and summary() functions. We used str() last time to get an idea of what data we have. It describes 1034 baseball players by name, team position, height, weight, and age.

summary(tab)
     name                team                 position       height    
 Length:1034        NYM    : 38   Relief_Pitcher  :315   Min.   :67.0  
 Class :character   ATL    : 37   Starting_Pitcher:221   1st Qu.:72.0  
 Mode  :character   DET    : 37   Outfielder      :194   Median :74.0  
                    OAK    : 37   Catcher         : 76   Mean   :73.7  
                    BOS    : 36   Second_Baseman  : 58   3rd Qu.:75.0  
                    CHC    : 36   First_Baseman   : 55   Max.   :83.0  
                    (Other):813   (Other)         :115                 
     weight           age       
 Min.   :150.0   Min.   :20.90  
 1st Qu.:187.0   1st Qu.:25.44  
 Median :200.0   Median :27.93  
 Mean   :201.7   Mean   :28.74  
 3rd Qu.:215.0   3rd Qu.:31.23  
 Max.   :290.0   Max.   :48.52  
 NA's   :1

The most important things we see here are that:
a) The teams are equally represented
b) The positions are not equally represented
c) Height, weight, and age have their mean and median close together, which is usually a good sign.
d) There is one missing value for weight!

That last one is especially important since R will return an error if it encounters an NA while doing math. We can order R to remove NAs if we want.

sd(tab$weight)
NA
sd(tab$weight, na.rm=TRUE)
20.99149

Since there is only a single missing value we should probably just drop that line rather than include na.rm=TRUE over and over again. The is.na() function will find where our missing data is. We want the subset of tab where it is FALSE that data is missing.

tab <- tab[is.na(tab$weight) == FALSE,]

The summary() function already showed us the mean, median, and quartiles of the player’s weights. Let’s look at the MAD and the SD.

sd(tab$weight, na.rm=TRUE)
 20.99149

mad(tab$weight, na.rm=TRUE)
 22.239

How would we, in colloquial terms, describe the weights of the players? Probably something like “According to this sample baseball players tend to weight around 200 pounds, give or take twenty pounds”. We lose a lot of precision here but its much easier to understand.

If our goal is to report the data, however, we need to do the best job we can. A traditional option would be along the lines of “This sample of baseball players (N=1033, one missing value excluded) had a mean weight of 201.7 pounds (SD = 21)”. A better option, if we have the space for it is a visualization. Since weight is continuous a density is appropriate.

players

How about if we divide things up by position? A boxplot or violin plot is a good way to look at the information that way. A boxplot gives us the most information. Median, IQR, and any outliers.

playerpos

There’s just one problem with using a boxplot (or even a violin plot) here. The positions don’t all have the same number of players.

with(tab,tapply(position,position,length))
          Catcher Designated_Hitter     First_Baseman        Outfielder 
               76                18                55               194 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
              315                58                52               221 
    Third_Baseman 
               45 

There were only 18 Designated Hitters and there were 315 Relief Pitchers. Most plots will disguise that fact. A dotplot won’t, although it lacks much of the information that other methods offer.

playerposdot

Ah, now we see something truly strange about this data. There are striations. Look at how a few values have far more data points than the ones around them. If you look even more closely its visible that these striations happen on multiples of five.

What’s going on? That’s not random! Multiple of five are not important to nature. This is obviously an effect created by human beings. Yet we do have data points in between.

If you track down where this data came from you’ll find that it’s compiled from more than one source. Evidently some of these sources recorded weights only as multiples of five and others as any unit value.

Next time we’ll take a look at some of the other data we have in the dataframe and examine the data across multiple variables.

Here we see the code to make the images in this post with Hadley Wickham’s ggplot2 package.

p <- ggplot(tab)
p + geom_density(aes(x=weight), size=1) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8))

p + geom_boxplot(aes(x=position,y=weight)) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8),
		axis.text.x = element_text(angle=45,vjust=.6))

p + geom_dotplot(aes(x='identity',y=weight),binaxis="y",stackdir="center",binwidth=1) +
	theme(panel.grid.major = element_line(size = 1),legend.position="none") +
	scale_y_continuous(minor_breaks=NULL)

Leave a comment