Boxes and Violins

Boxes and Violins

While making this post it occurred to me that its very inelegant to calculate the mean, standard error, and groups the way we’ve been doing it. A simple function will telescope all the work for us and add a bit more functionality.

The argument var will be be the variable we want to know about, group will be the column of the data where the group names are, and int will be the size of the interval (set to one standard error by default but an interval of two standard error is also popular). The output is the standard error, the mean, and the groups.

mean.se <- function(var,group,int=1) {
	se <- tapply(var,group,sd)/sqrt(length(var))
	se <- se*int
	m <- tapply(var,group,mean)
	group <- levels(group)

	data.frame(se,m,group)
}

Much easier to reference later! We’ll get very good use out of it on Friday.

Last time we looked at barplots and dotplots as ways to graphical represent the frequently called for information of the mean and the standard error. Its hard to represent this information more efficiently than either of those methods. As a result the main alternatives aim to represent much more information.

Let's take a look at adaptations of a boxplot and a violin plot.
boxandviolin

In the violin plot we essentially just put a dotplot over the density. In the boxplot we have shaded the range which covers one standard error.

The biggest advantage of these methods, in terms of the standard error, is that they’re not arbitrarily magnified. Both of them force the range of the plot to be approximately the range of the data. Nonetheless they add in the potential pitfalls of their base plots. In particular a violin plot tends to imply that there is a relatively large amount of data so it isn’t really appropriate with only ten data points like we have here.

As advertized they also show a lot more information than a simpler plot would. This is much more useful when the data happens to be skewed or strangely shaped. In that spirit lets check out some data made from a lognormal distribution.

dist <- c(rlnorm(30, .15 , .35), rlnorm(30, .3, .5))
group <- c(rep("a",30),rep("b",30))
dat <- data.frame(dist,group)

boxandviolin2

The data here is radically skewed so having the density or boxplot available is extremely useful compared to just presenting the mean and standard error.

The graphics code.

Thanks to our custom mean.se() function we can streamline the process very nicely. The important thing to keep in mind is that the violin plot and boxplot require sending all of the data to the graphics code. We have to feed it the mean and standard error separately.

For the PlantGrowth data:

plants <- with(PlantGrowth,mean.se(weight,group))

# Create a base with all of the stylistic code we want. Black and white theme with larger lines.
base <- ggplot(PlantGrowth,aes(x=group, y=weight)) +
	theme_bw() +
	theme(panel.grid.major = element_line(size = 1))

# Add a violin layer to the base then a pointrange layer that brings in the new data.
base + geom_violin(size=1) + 
	geom_pointrange(data=plants,a es(x=group, y=m, ymax=m+se, ymin=m-se),
		size=1)

# Add a boxplot layer to the base then a crossbar layer with that brings in the new data.
base + geom_boxplot(size=1) + 
	geom_crossbar(data = plants, aes(x=group, y=m, ymin=m-se, ymax=m+se), 
	color=NA, fill="grey", alpha=.5, width=.75)

For the lognormal data.

# Make some random lognormal data with the rlnorm() function.
# We also define slightly different shapes for them.
dist <- c(rlnorm(30, .15 , .35), rlnorm(30, .3, .5))
group <- c(rep("a",30),rep("b",30))

# Bring the data together into a dataframe.
dat <- data.frame(dist,group)

# Process it
datse <- mean.se(dat$dist,dat$group)

# The same basic code we used before.
base <- ggplot(dat, aes(x=group,y=dist)) +
	theme_bw() +
	theme(panel.grid.major = element_line(size = 1))

base +	geom_violin(size=1) +
	geom_pointrange(data=datse, aes(x=group, y=m, ymax=m+se, ymin=m-se),
		size=1)

base +	geom_boxplot(size=1) +
	geom_crossbar(data=datse, aes(x=group, y=m, ymax=m+se, ymin=m-se),
		fill='gray',alpha=.5,width=.75,color=NA)

On Friday we’ll pull this work together and learn more about making functions with R.

This entry was tagged . Bookmark the permalink.

Leave a comment