June | 2014 | Intuitor

# Find the bandwidth the Hyndman uses hdrbw(petal, HDRlevel=63.2) 0.2554508 # Caluclate the density using the bandwidth d <- density(petal, bw=.255) # Get the HDR infomration h <- hdr(petal, prob = c(50,63.2)) h$hdr [,1] [,2] 63.2% 4.00000 4.752259 50% 4.07637 4.700000 # Make two dataframes with the names and limits spread1 <- data.frame(vals = c(4.076,4.7,4,4.6), group = c("50","50","IQR","IQR")) spread2 <- data.frame(vals = c(4,4.752,3.79,4.73), group = c("63.2","63.2","SD","SD")) # Make a blank ggplot object to work with p <- ggplot() # Add layers to the first plot p + geom_line(aes(x=d$x, y=d$y)) + geom_vline(data=spread1, aes(xintercept=vals, color = group), show_guide = TRUE, size = 1) + scale_color_manual(values=cbPalette) + theme(panel.grid.major = element_line(size = .8), panel.grid.minor = element_line(size = .8)) + labs(x="Petal Length", y="Density", title="IQR and 50% HDR") # Open a new window to plot in dev.new() # Add layers to the second plot p + geom_line(aes(x=d$x, y=d$y)) + geom_vline(data=spread2, aes(xintercept=vals, color = group), show_guide = TRUE, size = 1) + scale_color_manual(values=cbPalette) + theme(panel.grid.major = element_line(size = .8), panel.grid.minor = element_line(size = .8)) + labs(x="Petal Length", y="Density", title="SD and 63.2% HDR")

Just a reminder of our data from before.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

The variance is one of the most important descriptors of data in statistics, in fact it has the special title of second central moment. It is, in a sense, analogous to the mean which is the first raw moment.

The variance is an oddball in the realm of descriptive statistics because, while it is a measure of dispersion, it is not a true descriptive statistic! Here’s the equation that calculates the variance and the function the does it. With the var() function R will determine the corrected sample variance, which is very similar to the exact variance for a large sample and more reasonable for small samples.

mean((petal – mean(petal))^2)
0.2164

var(petal)
0.2208163

Notice that exponent sign in there. Because of that even though the length of the petals is measured in centimeters the variance is in square centimeters. Clearly that is not a description of the data. It is meaningless to say that the length of the petals varies by 0.2164 square centimeters! True, higher variance means that the data is more spread out and lower variance that it’s more compact but that’s about it.

So why does the variance exist at all?

Like many things the variance has filtered down from the realms of higher mathematics where its properties are regularly exploited. The history of linear regression, for instance, owes a great deal to the analytical properties of the variance. When evaluating a linear model the variance and related measures provides a method to reduce the entire process to a pair of simple equations.

Naturally it is somewhat less useful as a simple description.

Notice also that, like the mean to which it is related, the variance tends to give a lot of weight to outliers.

The standard deviation (also known as sigma or σ) is, by a wide margin, the most popular way of describing how spread out data is. It owes this to its pedigree as a descendant of the variance and to a variety of equations in which it is needed. In fact it is so popular that we usually say that the variance is the square of the SD. However it is more sensible to think of it the other way around, the SD is the square root of the variance. By taking the square root of the variance we solve the issue of having the description in units that make no sense. Square centimeters become centimeters. Square years become years.

If asked R will calculate the corrected sample standard deviation if you ask for it. Notice again that with a sample size of 50 the values are extremely close.

sqrt(0.2164)
0.4651881

sd(petal)
0.469911

How popular is the standard deviation, you ask? It’s so popular that the MAD, which we talked about in the last post, is usually scaled so that it approximates the SD in an “ideal” scenario. Specifically if the data follows what is known as a normal distribution then the adjusted MAD and the SD will be identical. We’ll talk about the normal distribution in a later post.

It’s not immediately clear why anyone would bother to change the MAD like that when the direct interpretation seems so clear. The reason is twofold, firstly it means that MAD and SD are now comparable measures, if there are enormously different it will be because of a trait of the data not because of how they are calculated. Secondly there certain are equations which need the SD in order to work and, by scaling the MAD to approximated the SD it can be used in the equations (with some degree of care).

Finally let’s plot all of our measures on the density.

The plusminus() custom function we made last time is extremely convenient to use with ggplot, which will read the dataframe quite naturally. We base the position of the MAD on the median and the position of the SD on the mean.

spread <- plusminus(c(4.3, 4.35, 4.26), c(.30, .52, 0.47), c("IQR", "MAD", "SD"))

p <- ggplot(versidf, aes(x=petal))
p + 	geom_density(size=1) +
	geom_vline(data=spread,
		aes(xintercept=vals,
		linetype= group,
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8))

Let’s break down that code really quickly. First we put the versidf dataframe into a ggplot object with petal lengths on the x-axis aesthetic. Next we start adding layers. First we have geom_density() to show us how the data is distributed, the size is increased to 1 for ease of reading. Second we have geom_vline() to create vertical lines, the data is set to read from the spread dataframe we made. The aesthetics of the lines are that they are positioned at each “val” and their color and type are defined by the “group” variable. Then we tell ggplot to show the legend for geom_vline() and made the lines size 1 so they’re easier to read. The scale_color_manual() layer references the colorblind palette developed by Masataka Okabe and Kei Ito and adapted for R by Winston Chang. Lastly in the theme() layer we make the grid lines slightly larger.

Intuitor

Getting a grip on statistics, science, and philosophy.

Monthly Archives: June 2014

Measures of Variability – Highest Density Interval

Measures of Variability – Standard Deviation