Summary – Descriptive Statistics Part 2

There is more information is the dataframe we have named tab than just the weights of the players. We might also want to look at the other information. Let’s get a quick look at the relationships between the three numeric variables of height, weight, and age. If you try using plot() raw in order to get a scatterplot matrix a bunch of errors will occur, you can’t make a scatterplot of data that isn’t numeric.

Fortunately we can extract just the numeric variables.

plot(tab[,c(4,5,6)])

baseballmatrix

All we see here is a slight positive relationship between height and weight. Taller players tend to weigh more. One thing that scatterplots have trouble with is called overplotting which happens when two points are identical or very close together. Overplotting can make it difficult to see where the data clusters.

There are a few ways to death with this:

One very aesthetically pleasing option is to carefully design bubble plot that merges nearby points into large ones.

More simply we could “jitter” the points, make all of them move a small random distance. Points that are close or identical now fill a space.

In the same vein we might make the points translucent, the darker areas would then be where many points overlap. This turns out to be computationally taxing and not very easy to read.

Finally we could make a two dimensional histogram. Since the bins can’t show height we use color instead.

Let’s take a look at these options and then decide how we might want to describe the data. The ggplot2 code will appear at the end of the post.

A two dimensional histogram is easy to make with just our basic data at the geom_hex() layer in ggplot2. It has to be tuned a bit, though. The bin argument tells R the function how many hexes wide each axis is. I like to use square numbers to find a good number of bins. Here is how it looks with 9, 16, and 25 hexes.

hexbins

I think 16 is about right.

A jittered plot is also easy to make with ggplot2 since it has its own layer, geom_jitter.

jitter

The results are not very impressive. While the striations in the height data are removed, it has only unit values in a narrow range, the jittering ends up showing us the striations in the weight data! It’s a mess that isn’t worth trying to clean up here. A translucent plot is even worse.

The kind of bubble plots we want doesn’t come built in to R. Thankfully the brilliant Nina Zumel has a function that does most of the work for us.

bplot_stats = function(xcol, ycol) {
  nrows = length(xcol)
  ymeans = aggregate(ycol, by=list(xcol), FUN=mean)
  xcounts = aggregate(numeric(nrows)+1, by=list(xcol), FUN=sum)
  data.frame(x = xcounts$Group.1, wt = sqrt(xcounts$x), y=ymeans$x)
}

What it does is find the mean value of y for each x and compress all of the points into a single bubble. Since height and weight have an obvious relationship let’s compare age and weight instead. We will have to do a little work first. Age needs to be discretized in order to work since it was recorded down to the hundredth of a year. Let’s convert it to tenths of a year with the cut() function.

breaks <- seq(20,50,.1)
agedisc <- cut(tab$age, breaks=breaks)

The output of cut() is always a factor so we need to make it into a numeric.

levels(agedisc) <- breaks
agedisc <- as.numeric(as.character(agedisc))

Now we can feed it into the bplot_stats() function. Here’s what the resulting dataframe looks like.

ageweight <- bplot_stats(agedisc, tab$weight)

str(ageweight)
'data.frame':   187 obs. of  3 variables:
 $ x : num  20.8 21.4 21.5 21.7 21.8 22 22.1 22.2 22.3 22.4 ...
 $ wt: num  1 1 1.41 1 1.41 ...
 $ y : num  225 205 195 210 196 ...

Obviously x is x and y is y while wt will be the size of the bubble. Here’s what it looks like.

bubble

The relationship between the age of players and their weight (namely that there’s isn’t one) and how they’re distributed is much more obvious than it was on the ordinary scatter plot.

The ggplot2 code used for graphics:

p <- ggplot(tab)

# jitter plot
p + geom_jitter(aes(x=weight,y=height))

# three different hexbins
p + geom_hex(aes(x=weight,y=height),bins=9) + 
	theme(legend.position="none") + labs(title = "Nine Bins")
p + geom_hex(aes(x=weight,y=height),bins=16) + 
	theme(legend.position="none") + labs(title = "Sixteen Bins")
p + geom_hex(aes(x=weight,y=height),bins=25) + 
	theme(legend.position="none") + labs(title = "Twenty Five Bins")

# Making a bubbleplot
breaks <- seq(20,50,.1)
agedisc <- cut(tab$age, breaks=breaks, include.lowest=TRUE)
levels(agedisc) <- breaks
agedisc <- as.numeric(as.character(agedisc))
ageweight <- bplot_stats(agedisc, tab$weight)

q <- ggplot(a, aes(x=x,y=y,size=wt))
q + geom_point() + labs(x="Age", y="Weight") + theme(legend.position="none")
This entry was tagged . Bookmark the permalink.

Leave a comment