Measures of Variability – Highest Density Interval

All these previous measures of spread have one thing in common: They can easily be calculated by hand. Sure, it might be a tedious process for a large amount of data but that’s what graduate students are for. For essentially all of the history of mathematics the sole requirement of descriptive statistics was that a person be able to do the work with a pencil and paper.

This is no longer true, however. Computers can solve problems in seconds that would otherwise take months. We saw this before with probability density.

Just a reminder of our data from before.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

In fact using the density directly is what provides us with our final measure of spread, the highest density interval (or highest density region). The package hdrcde by Rob Hyndman provides us with this functionality.

install.packages(hdrcde)
library(hdrcde)

Let’s just try with the default settings.

hdr(petal)

$hdr
        [,1]     [,2]
99% 3.160326 5.253373
95% 3.377735 5.143393
50% 4.059244 4.700000

$mode
[1] 4.406846

$falpha
       1%        5%       50% 
0.1222823 0.1965873 0.6195948 

That’s a bunch of information. The hdrs are the edges of several different highest density intervals, the 99% interval goes from 3.16 to 5.25 for example. You should recognize the concept of the mode for this sort of data, though, it’s the point where the density is highest. The information recorded as falpha is the density at the edges of each interval interval.

There are a few ways we can plot this information. This command automatically creates a plot using the hdrcde package. The hdr.den() function outputs both an image and that information we saw before.

hdr.den(petal)

hdrden

You might notice something about that image, the density isn’t quite the same shape as the one we’ve seen before! It is a bit smoother. This occurs because Hyndman’s method of calculation is slightly different from the one R normally uses. The results are similar, as a comparison will show, but tend to produce a smoother estimate than the densitry() function does.

What the heck is a highest density region anyway?

Good question. The highest density region, as Hyndman uses the term and as I prefer to use it, is “the interval that contains a certain percentage of the data and for which all points in the region have higher density than those outside the region”. So for the 50% HDR half of the density is within the interval, this doesn’t necessarily mean half the data, though, like the interquartile range would.

How do we decide which HDR to use?

Also a good question. In inferential statistics, particularly Bayesian inference, the preferred interval is 95% because it sets a high standard without being useless. For a description, however, this isn’t what we want. The two realistic options are the 50% and 63.2%, which are analagous to the IQR and the standard deviation.

Let’s check them out.

hdr.den(petal, prob=c(50,63.2))

hdr5063

Let’s check out how these compare to our previous measures. It’s going to take a bit of work to make everything come out nice.

# Find the bandwidth the Hyndman uses
hdrbw(petal, HDRlevel=63.2)
0.2554508

# Caluclate the density using the bandwidth
d <- density(petal, bw=.255)

# Get the HDR infomration
h <- hdr(petal, prob = c(50,63.2))
h$hdr
         [,1]     [,2]
63.2% 4.00000 4.752259
50%   4.07637 4.700000

# Make two dataframes with the names and limits
spread1 <- data.frame(vals = c(4.076,4.7,4,4.6), group = c("50","50","IQR","IQR"))
spread2 <- data.frame(vals = c(4,4.752,3.79,4.73), group = c("63.2","63.2","SD","SD"))

# Make a blank ggplot object to work with
p <- ggplot()

# Add layers to the first plot
p + geom_line(aes(x=d$x, y=d$y)) +
	geom_vline(data=spread1, 
		aes(xintercept=vals, 
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8)) +
	labs(x="Petal Length", y="Density", title="IQR and 50% HDR")

# Open a new window to plot in
dev.new()

# Add layers to the second plot
p + geom_line(aes(x=d$x, y=d$y)) +
	geom_vline(data=spread2, 
		aes(xintercept=vals, 
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8)) +
	labs(x="Petal Length", y="Density", title="SD and 63.2% HDR")

hdrcomp

Measures of Variability – Standard Deviation

Just a reminder of our data from before.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

The <strong>variance</strong> is one of the most important descriptors of data in statistics, in fact it has the special title of <strong>second central moment</strong>. It is, in a sense, analogous to the mean which is the <strong>first raw moment</strong>.

The variance is an oddball in the realm of descriptive statistics because, while it is a measure of dispersion, it is not a true descriptive statistic! Here’s the equation that calculates the variance and the function the does it. With the var() function R will determine the corrected sample variance, which is very similar to the exact variance for a large sample and more reasonable for small samples.

mean((petal – mean(petal))^2)
0.2164

var(petal)
0.2208163

Notice that exponent sign in there. Because of that even though the length of the petals is measured in centimeters the variance is in square centimeters. Clearly that is not a description of the data. It is meaningless to say that the length of the petals varies by 0.2164 square centimeters! True, higher variance means that the data is more spread out and lower variance that it’s more compact but that’s about it.

So why does the variance exist at all?

Like many things the variance has filtered down from the realms of higher mathematics where its properties are regularly exploited. The history of linear regression, for instance, owes a great deal to the analytical properties of the variance. When evaluating a linear model the variance and related measures provides a method to reduce the entire process to a pair of simple equations.

Naturally it is somewhat less useful as a simple description.

Notice also that, like the mean to which it is related, the variance tends to give a lot of weight to outliers.

The standard deviation (also known as sigma or σ) is, by a wide margin, the most popular way of describing how spread out data is. It owes this to its pedigree as a descendant of the variance and to a variety of equations in which it is needed. In fact it is so popular that we usually say that the variance is the square of the SD. However it is more sensible to think of it the other way around, the SD is the square root of the variance. By taking the square root of the variance we solve the issue of having the description in units that make no sense. Square centimeters become centimeters. Square years become years.

If asked R will calculate the corrected sample standard deviation if you ask for it. Notice again that with a sample size of 50 the values are extremely close.

sqrt(0.2164)
0.4651881

sd(petal)
0.469911

How popular is the standard deviation, you ask? It’s so popular that the MAD, which we talked about in the last post, is usually scaled so that it approximates the SD in an “ideal” scenario. Specifically if the data follows what is known as a normal distribution then the adjusted MAD and the SD will be identical. We’ll talk about the normal distribution in a later post.

It’s not immediately clear why anyone would bother to change the MAD like that when the direct interpretation seems so clear. The reason is twofold, firstly it means that MAD and SD are now comparable measures, if there are enormously different it will be because of a trait of the data not because of how they are calculated. Secondly there certain are equations which need the SD in order to work and, by scaling the MAD to approximated the SD it can be used in the equations (with some degree of care).

Finally let’s plot all of our measures on the density.

The plusminus() custom function we made last time is extremely convenient to use with ggplot, which will read the dataframe quite naturally. We base the position of the MAD on the median and the position of the SD on the mean.

spread <- plusminus(c(4.3, 4.35, 4.26), c(.30, .52, 0.47), c("IQR", "MAD", "SD"))

p <- ggplot(versidf, aes(x=petal))
p + 	geom_density(size=1) +
	geom_vline(data=spread,
		aes(xintercept=vals,
		linetype= group,
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8))

Let’s break down that code really quickly. First we put the versidf dataframe into a ggplot object with petal lengths on the x-axis aesthetic. Next we start adding layers. First we have geom_density() to show us how the data is distributed, the size is increased to 1 for ease of reading. Second we have geom_vline() to create vertical lines, the data is set to read from the spread dataframe we made. The aesthetics of the lines are that they are positioned at each “val” and their color and type are defined by the “group” variable. Then we tell ggplot to show the legend for geom_vline() and made the lines size 1 so they’re easier to read. The scale_color_manual() layer references the colorblind palette developed by Masataka Okabe and Kei Ito and adapted for R by Winston Chang. Lastly in the theme() layer we make the grid lines slightly larger.

iqrmadsd

Measures of Variability – Absolute Deviation

The range, as we discussed last time, isn’t really used as a measure of variability these days and the IQR isn’t terribly informative.

Since I’m a big fan of checking the density of data as a descriptor let’s try that before we continue. Let’s grab those objects we made last time.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

Okay, let’s try it out with both ggplot2 and with the base R functions. Remember that you have to load ggplot2 with the library() function each time you start R!

Just for fun let’s compare the typical Gaussian density with the much more exotic rectangular density. It would be comforting to see that densities are essentially in agreement even when we change something as fundamental as the kernel. After all we know that no description of the data will be perfect so knowing that the description isn’t going to randomly be misleading is nice.

plot(density(petal),col="#E69F00")
lines(density(petal, kernel = “rectangular”),col="#000000"))

p <- ggplot(versidf, aes(x = petal))
p + 	geom_density(kernel = 'rectangular', size=1, color="#000000") +
	geom_density(size=1, color="#E69F00") +
	theme(panel.grid.major = element_line(size = 1),
		panel.grid.minor = element_line(size = .8))

rectgaus

Yep those are both in agreement about where the mass of the data is. There are a lot of other densities we could use like the Epanechnikov, triangular, biweight, cosine, and optcosine. They all have particular strengths and weaknesses but they do broadly agree in most cases.

Let’s go back to just using the Gaussian density for now since it’s much easier to read. We can put the measures we developed onto the plot in order to evaluate them. With base R and with ggplot2. Here’s the IQR shown against the density.

plot(density(petal))
abline(v=c(4,4.6), col = '#33CC33')

p <- ggplot(versidf, aes(x = petal))
p + geom_density(size = 1) +
	geom_vline(x = c(4,4.6), size = 1, color = '#33CC33') +
	theme(panel.grid.major = element_line(size = 1),
		panel.grid.minor = element_line(size = .8))

iqrden

So clearly the IQR is capturing a lot of the density (including the point of highest density) and it does represent a piece of information that is very easy to understand. Unfortunately it isn’t making using of much of the information in the data. Because of this if IQR is not very efficient, it doesn’t generalize very well without a very large sample.

The most popular measures of spread are ones that make use of all of the data points. Of these the most basic is the median absolute deviation. We saw the related concept of the mean absolute deviation in an earlier post when talking about the measures of central tendency. The MAD is the median distance from each data point to the median of the data. Like the median and the IQR the MAD is considered robust, not easily distorted by outliers. It is more efficient than the IQR.

It is somewhat rare to report the strict MAD, though, because the exact number of not of much interest. As a result it is usually scaled to mimic a more popular measure of variability called the standard deviation. With R we can get the strict MAD if we want but it gives us the traditional one by default.

mad(petal)
0.51891

mad(petal, constant = 1)
.35

median(petal)
4.35

For convenience later we will now make a function that gives us x+y and x-y as a conveniently structured dataframe. These sorts of convenience functions are extremely useful if you expect to do something many times.

plusminus <- function(p, q, groups = seq(1,max(length(x),length(y))) ) {
	a <- p-q
	b <- p+q
	c <- c(a,b)
	d <- data.frame(vals = c, group = groups)
	d[order(d$group),]
	}

The code here is very simple, you just have to keep in mind how R treats vectors. There are three arguments: p and q are numeric objects (either scalars or vectors); groups is a vector as the same size as the larger or p and q. In the function itself we subtract p from q, add p to q, and then make all of those results into a single vector. Notice that the vector “c” will contain all of the minus values followed by all of the plus values. Finally we put values into a dataframe along with groups. For the sake of neatness we than order the dataframe so that all of the groups are together.

plusminus(4.35,c(0.52,0.35))
  vals group
1 3.83     1
3 4.87     1
2 4.00     2
4 4.70     2

You can see immediately that, in this case, the strict MAD (group 2) is about the same as the IQR while the traditional MAD is obviously quite different. Let’s compare the traditional MAD with the IQR in an image. We saw the cbPalette vector of color codes a while back, I’ve deleted the first one so that the lines stand out a little better here.

spread <- plusminus(c(4.3,4.35), c(.3,.52), c('IQR','MAD'))
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

p <- ggplot(versidf, aes(x=petal))
p + 	geom_density(size=1) +
	geom_vline(data=spread,
		aes(xintercept=vals,
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = 1),
		panel.grid.minor = element_line(size = .8))

madiqr

Measures of Variability – The Ranges

The measures of central tendency tell us how far from zero the data tends to be. The measures of variability tells us how far from the center the data tends to be. Obviously we know that all of the data isn’t exactly at the mean or median. Just for the sake of illustration let’s look at some sample data from the iris dataframe. This will also give us a chance to look at more ggplot2 code.

For clarity let’s start by defining a few objects we’ll be using later.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

versi will hold all the information about the versicolor irises, petal then extracts just the information about the length of the petals. y is just a sequence from 1 to 50 to help us keep track of individual observations. Finally we make versidf to hold just the petal lengths and the assigned numbers.

p <- ggplot(versidf, aes(y = y, x = petal))
p + geom_point(size = 3) + 
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	ylab("") +
	scale_y_continuous(labels=NULL) +
	theme(axis.ticks.y = element_blank()) +
	geom_vline(xintercept = 4.26, size = 1.5) +
	scale_x_continuous(limits=c(3,5.5))

versi1

Most of the ggplot code there is just getting the plot to look exactly the way I wanted. The familiar theme elements make the grid lines a nicer size. Next we have several lines getting rid of everything that would normally be on the y-axis since there there’s no information on the y-axis, we’re just separating out the various data points so we can see them. Finally we use geom_vline() to make a vertical line at 4.26 (the mean of the petal lengths).

Clearly the data isn’t all in the same place.

When we discussed the mean we touched on the idea of the mean absolute error which measures how far from mean the data is, on average. This could be used as a measure of variability but isn’t common. We’ll come back to a cousin of it later in our discussion of variability.

The simplest measure of variation is the range, the distance between the lowest value and the highest value. With the range() function we can retrieve the min() and max() values of the data. The diff() function finds the difference between them.

petal <- versi$Petal.Length

range(petal)
3.0 5.1

diff(range(petal))
2.1

This isn’t a very good description of variability. While it gives a vague notion of where the data is, everything is within the range, it doesn’t tell us where it clusters within the range. One mutant flower with a 4 centimeter petal would totally alter the range despite not reflecting where the majority of the data is.

The main reason that people look at the range is because in certain kinds of data missing information will be coded as extreme values like 9999 or in order to warn people that something has to be dealt with. This is done in order to avoid a program silently removing values that are coded as missing.

A more effective way to measure variability is with the interquartile range (often IQR). The 1st and 3rd quartile cut off the top 25% and bottom 25% of the data (the median is the 2nd quartile) so the middle half of the data is between them. It is somewhat better that the range but not enormously so.

The easiest way to get the IQR is with the the summary() and IQR() functions. With summary() you can see the 1st and 3rd quartile (along with the min, max, median, and mean).

summary(petal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00    4.00    4.35    4.26    4.60    5.10
 
IQR(petal)
0.6

The IQR has the advantage of of cutting off extreme values so a single mutant flower isn’t going to throw it off. In fact the IQR is so popular that is has been enshrined in the boxplot.

Let’s look at two ways to visualize the IQR. First we just highlight the data.

vedf <- with(versidf, versidf[petal >= 4 & petal <= 4.6,])

p <- ggplot(versidf, aes(y = y, x = petal))
p + geom_point(size = 3) + geom_vline(xintercept = 4.26, size = 1.5) +
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	ylab("") +
	scale_y_continuous(labels=NULL) +
	theme(axis.ticks.y = element_blank()) +
	geom_point(data = vedf, aes(y = y, x= petal),color='red',size=3)

INSERT IMAGE

As a boxplot.

p <- ggplot(versidf, aes(y = petal, x = ''))
p + geom_boxplot() +
	scale_x_continuous(breaks=NULL,minor_breaks=NULL)

versi2

versibox

The boxplot() function in base R and the geom_boxplot() layer in ggplot2 will automatically create a Tukey Boxplot. The edges of the box at the 1st and 3rd quartiles. The middle line is the median. The whiskers go to the data points that are no more than 1.5 times the IQR from the 1st and 3rd quartiles. Outliers are shown as dots.

Boxplots have both strengths and weaknesses. Compared to violin plots or densities they show us less information about the data, however they are less subjective and showing the IQR can be very useful. Identifying outliers via Tukey’s method is also a nice touch.