Graphics – ggplot2

The ggplot2 package is one of Hadley Wickham’s many wildly successful R packages.

Installing and using a package is easy.

To install:

install.packages(“ggplot2”)

To use once installed:

library(ggplot2)

The citation that Wickham prefers is available with the citation() function.

H. Wickham. ggplot2: elegant graphics for data analysis. Springer New
York, 2009.

The primary advantage of ggplot 2 over the base R graphics is the Wickham’s decision to use natural language rather than highly specific abbreviations. Changing the size of a line is done the same way as changing the size of a point. The second advantage is that ggplot objects store information inside of them where it is easy to reference when making an image. This makes switching techniques effortless.

Let’s look at an example using the iris data.

This is how we create the most common kind of ggplot object. Our data is drawn from the iris dataframe and our three aesthetics are set. The x-axis will be will show the length of sepals on the flowers and every data point will be colored based on its species.

p <- ggplot(iris, aes(x = Sepal.Length, fill=Species))

A simple visualization is a stacked histogram, which is how ggplot2 automatically deals with this kind of data. Layers are added with the + command that is usually reserved for the addition operation.

p + geom_histogram()

stackedgg1

That’s serviceable but we can do better. The first thing to do is change to a colorblind friendly color palate, this has no impact on readability and opens it up to more readers. Winston Chang provides a pre-packaged vector of hexadecimal color codes that do this for us.

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

The theme layer gives us very fine control over the image. We can save space by moving the legend inside of the plot. Since I make graphics for this site I also change the size of the grid lines so they’re easier to see when the image is small.

p + geom_histogram() +
	theme(legend.position = c(.9, .9)) +
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	scale_fill_manual(values=cbPalette)

stackedgg2

Unfortunately a stacked histogram doesn’t seem to be a good way of representing this information. We need another method to represent it. In base R this means going back to the drawing board and carefully calibrating the commands for another type of plot. With ggplot2, however, all the information we want has been stored inside of the object p that we made.

A density plot might work better. It’s going to overlap, though, so let’s adjust the transparency argument called alpha so we can see through the them. In fact we should throw in the theme elements from before.

p + geom_density(alpha = .5) +
	theme(legend.position = c(.9, .9)) +
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	scale_fill_manual(values=cbPalette)

stackedgg3

Let’s compare that to the code we would use for the same thing in base R.

dens <- tapply(iris$Sepal.Length, iris$Species, density)
plot(dens$setosa$x, dens$setosa$y, 
	xlim=c(4,8.5), 
	type='l', lwd=3, col="#999999",
	xlab='Sepal Length', ylab='Density')
lines(dens$versicolor$x, dens$versicolor$y, 
	lwd=3, col="#E69F00",)
lines(dens$virginica$x, dens$virginica$y, 
	lwd=3, col="#56B4E9")
grid()
text(8, c(1.2, 1.1 ,1), 
	c('Setosa', 'Versicolor', 'Virginica'), 
	col=c("#999999", "#E69F00", "#56B4E9"),
	cex=1.5)

stackedgg4

I actually like this style of density plot better than the one we just made in ggplot2 but it’s actually fairly easy to mimic in ggplot2 as well. Nonetheless this is a more more elaborate piece of code, a lesson in the R language all itself. We use tapply() to break down the dataframe the way we want. Then some elaborate subsetting notation gets us the pieces of that new object. We adjust the limits of the x-axis so it fits everything and rewrite the axis labels. The lines() function to add the second and third densities. Since there’s no legend we have to make one up with some clever use of the text() function.

On the other hand ggplot2 saves us a lot of work and does so with much clearer code. That clarity is very helpful if you ever look back at old code.

In any event this still might not be the idea way to represent these comparisons. A style of graphic called a violin plot is perfect for this situation. It isn’t available in base R but ggplot objects can create them easily. We’ll have to adjust some things first. The Species now needs to go on the x-axis and the, the length of the sepals will go on the y-axis. Let’s keep the color codes but get rid of the legend since we won’t need it.

p <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species))
p + geom_violin() +
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	theme(legend.position='none')

stackedgg5

I’d call that a significant improvement! We can see all three of the species without them interrupting each other. The colors are also also less of an issue. In fact we can use this kind of plot in black and white if necessary while the other version degrade significantly in readability.

Leave a comment