Measures of Variability – The Ranges

The measures of central tendency tell us how far from zero the data tends to be. The measures of variability tells us how far from the center the data tends to be. Obviously we know that all of the data isn’t exactly at the mean or median. Just for the sake of illustration let’s look at some sample data from the iris dataframe. This will also give us a chance to look at more ggplot2 code.

For clarity let’s start by defining a few objects we’ll be using later.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

versi will hold all the information about the versicolor irises, petal then extracts just the information about the length of the petals. y is just a sequence from 1 to 50 to help us keep track of individual observations. Finally we make versidf to hold just the petal lengths and the assigned numbers.

p <- ggplot(versidf, aes(y = y, x = petal))
p + geom_point(size = 3) + 
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	ylab("") +
	scale_y_continuous(labels=NULL) +
	theme(axis.ticks.y = element_blank()) +
	geom_vline(xintercept = 4.26, size = 1.5) +
	scale_x_continuous(limits=c(3,5.5))

versi1

Most of the ggplot code there is just getting the plot to look exactly the way I wanted. The familiar theme elements make the grid lines a nicer size. Next we have several lines getting rid of everything that would normally be on the y-axis since there there’s no information on the y-axis, we’re just separating out the various data points so we can see them. Finally we use geom_vline() to make a vertical line at 4.26 (the mean of the petal lengths).

Clearly the data isn’t all in the same place.

When we discussed the mean we touched on the idea of the mean absolute error which measures how far from mean the data is, on average. This could be used as a measure of variability but isn’t common. We’ll come back to a cousin of it later in our discussion of variability.

The simplest measure of variation is the range, the distance between the lowest value and the highest value. With the range() function we can retrieve the min() and max() values of the data. The diff() function finds the difference between them.

petal <- versi$Petal.Length

range(petal)
3.0 5.1

diff(range(petal))
2.1

This isn’t a very good description of variability. While it gives a vague notion of where the data is, everything is within the range, it doesn’t tell us where it clusters within the range. One mutant flower with a 4 centimeter petal would totally alter the range despite not reflecting where the majority of the data is.

The main reason that people look at the range is because in certain kinds of data missing information will be coded as extreme values like 9999 or in order to warn people that something has to be dealt with. This is done in order to avoid a program silently removing values that are coded as missing.

A more effective way to measure variability is with the interquartile range (often IQR). The 1st and 3rd quartile cut off the top 25% and bottom 25% of the data (the median is the 2nd quartile) so the middle half of the data is between them. It is somewhat better that the range but not enormously so.

The easiest way to get the IQR is with the the summary() and IQR() functions. With summary() you can see the 1st and 3rd quartile (along with the min, max, median, and mean).

summary(petal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00    4.00    4.35    4.26    4.60    5.10
 
IQR(petal)
0.6

The IQR has the advantage of of cutting off extreme values so a single mutant flower isn’t going to throw it off. In fact the IQR is so popular that is has been enshrined in the boxplot.

Let’s look at two ways to visualize the IQR. First we just highlight the data.

vedf <- with(versidf, versidf[petal >= 4 & petal <= 4.6,])

p <- ggplot(versidf, aes(y = y, x = petal))
p + geom_point(size = 3) + geom_vline(xintercept = 4.26, size = 1.5) +
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	ylab("") +
	scale_y_continuous(labels=NULL) +
	theme(axis.ticks.y = element_blank()) +
	geom_point(data = vedf, aes(y = y, x= petal),color='red',size=3)

INSERT IMAGE

As a boxplot.

p <- ggplot(versidf, aes(y = petal, x = ''))
p + geom_boxplot() +
	scale_x_continuous(breaks=NULL,minor_breaks=NULL)

versi2

versibox

The boxplot() function in base R and the geom_boxplot() layer in ggplot2 will automatically create a Tukey Boxplot. The edges of the box at the 1st and 3rd quartiles. The middle line is the median. The whiskers go to the data points that are no more than 1.5 times the IQR from the 1st and 3rd quartiles. Outliers are shown as dots.

Boxplots have both strengths and weaknesses. Compared to violin plots or densities they show us less information about the data, however they are less subjective and showing the IQR can be very useful. Identifying outliers via Tukey’s method is also a nice touch.