All these previous measures of spread have one thing in common: They can easily be calculated by hand. Sure, it might be a tedious process for a large amount of data but that’s what graduate students are for. For essentially all of the history of mathematics the sole requirement of descriptive statistics was that a person be able to do the work with a pencil and paper.
This is no longer true, however. Computers can solve problems in seconds that would otherwise take months. We saw this before with probability density.
Just a reminder of our data from before.
versi <- iris[iris$Species=='versicolor',] petal <- versi$Petal.Length y <- 1:50 versidf <- data.frame(y, petal)
In fact using the density directly is what provides us with our final measure of spread, the highest density interval (or highest density region). The package hdrcde by Rob Hyndman provides us with this functionality.
install.packages(hdrcde) library(hdrcde)
Let’s just try with the default settings.
hdr(petal) $hdr [,1] [,2] 99% 3.160326 5.253373 95% 3.377735 5.143393 50% 4.059244 4.700000 $mode [1] 4.406846 $falpha 1% 5% 50% 0.1222823 0.1965873 0.6195948
That’s a bunch of information. The hdrs are the edges of several different highest density intervals, the 99% interval goes from 3.16 to 5.25 for example. You should recognize the concept of the mode for this sort of data, though, it’s the point where the density is highest. The information recorded as falpha is the density at the edges of each interval interval.
There are a few ways we can plot this information. This command automatically creates a plot using the hdrcde package. The hdr.den() function outputs both an image and that information we saw before.
hdr.den(petal)
You might notice something about that image, the density isn’t quite the same shape as the one we’ve seen before! It is a bit smoother. This occurs because Hyndman’s method of calculation is slightly different from the one R normally uses. The results are similar, as a comparison will show, but tend to produce a smoother estimate than the densitry() function does.
What the heck is a highest density region anyway?
Good question. The highest density region, as Hyndman uses the term and as I prefer to use it, is “the interval that contains a certain percentage of the data and for which all points in the region have higher density than those outside the region”. So for the 50% HDR half of the density is within the interval, this doesn’t necessarily mean half the data, though, like the interquartile range would.
How do we decide which HDR to use?
Also a good question. In inferential statistics, particularly Bayesian inference, the preferred interval is 95% because it sets a high standard without being useless. For a description, however, this isn’t what we want. The two realistic options are the 50% and 63.2%, which are analagous to the IQR and the standard deviation.
Let’s check them out.
hdr.den(petal, prob=c(50,63.2))
Let’s check out how these compare to our previous measures. It’s going to take a bit of work to make everything come out nice.
# Find the bandwidth the Hyndman uses hdrbw(petal, HDRlevel=63.2) 0.2554508 # Caluclate the density using the bandwidth d <- density(petal, bw=.255) # Get the HDR infomration h <- hdr(petal, prob = c(50,63.2)) h$hdr [,1] [,2] 63.2% 4.00000 4.752259 50% 4.07637 4.700000 # Make two dataframes with the names and limits spread1 <- data.frame(vals = c(4.076,4.7,4,4.6), group = c("50","50","IQR","IQR")) spread2 <- data.frame(vals = c(4,4.752,3.79,4.73), group = c("63.2","63.2","SD","SD")) # Make a blank ggplot object to work with p <- ggplot() # Add layers to the first plot p + geom_line(aes(x=d$x, y=d$y)) + geom_vline(data=spread1, aes(xintercept=vals, color = group), show_guide = TRUE, size = 1) + scale_color_manual(values=cbPalette) + theme(panel.grid.major = element_line(size = .8), panel.grid.minor = element_line(size = .8)) + labs(x="Petal Length", y="Density", title="IQR and 50% HDR") # Open a new window to plot in dev.new() # Add layers to the second plot p + geom_line(aes(x=d$x, y=d$y)) + geom_vline(data=spread2, aes(xintercept=vals, color = group), show_guide = TRUE, size = 1) + scale_color_manual(values=cbPalette) + theme(panel.grid.major = element_line(size = .8), panel.grid.minor = element_line(size = .8)) + labs(x="Petal Length", y="Density", title="SD and 63.2% HDR")