Standard Error

We’ve used the standard error several times on this blog without going into much detail about it. In simplest possible terms the SE is the range within which we expect a value (typically the mean) to be. Specifically it is the standard deviation of the sample means.

This requires a bit of abstract thought. If you take many samples from a population you will not always get the same mean value. After taking many samples you’ll have a distribution of those means. We are attempting to describe the characteristics of something one step more abstract than the mean. To see a more concrete example refer to the post of efficient estimators.

When we say “the average weight of a baseball player is 200 pounds” we mean that in our sample the average weight is 200 pounds and that we believe this is close to the true average of the population. With the standard error, confidence interval, or credible interval we quantify how close we expect it to be.

Using our baseball player data from a while back, let’s look at how the standard error works in an ideal situation. We know the true mean of the population so let’s take a bunch of random samples and see how often we capture the true value of the mean.

For convenience make the weight its own object:

weight <- tab$weight

Here is the equation that defines the standard error, first in common notation and then as a custom R function.

\displaystyle se = \frac{s}{\sqrt{n}} = \frac{s^2}{n}
. . . where s is the sample standard deviation (s2 is the sample variance) and n is the sample size.

se <- function(x) {sd(x)/sqrt(length(x))}

Now we use replicate() and sample() to take 1000 samples of 100.

r <- 1000
size <- 100
samps <- replicate(r,sample(weight,size))

Using the apply() function we get the mean and standard error of each sample.

means <- apply(samps,2,mean)
ses <- apply(samps,2,se)

se.h <- means+ses
se.l <- means-ses

Add up all of the instances where the top of the interval is less than the mean or the bottom of the interval is more than the mean. These are the instances where the standard error has missed the true value.

a <- sum(length(se.h[se.h < 201.6893]),length(se.l[se.l > 201.6893]))
1-a/r
 0.709

That’s actually better than we should expect. Keep in mind that the calculation of the standard error relies on us assuming a normal distribution for the population.

So where does the this equation come from?

As you saw before we calculate the standard error as the standard deviation divided by the square root of the sample size or, perhaps more intuitively, the variance divided by the sample size.

We do this because of the Bienayme formula which provides that the sum of the variance of several variables is the same as the variance of the sums of those variables (ie you draw 10 numbers from two populations and add together each pair).

Consider the equation that defines the mean for some data X:
\displaystyle \sum_{i=1}^n x_i/n

Now imagine a sample of exactly one value. The mean is equal to that value (itself divided by 1) and the variance of those means is exactly the population variance. When we draw a large number of values we can treat our sample as many samples of one (which it is), so then the variance of the sum is the same as the variance of the sample (which is approximately equal to the variance of the population).

This is not a rigorous mathematical proof, of course, the intention is to understand the intuition of why the standard error is the way it is. Personally, I find it comforting to know that the standard error is not produced out of thin air.

Normalization

Here’s a simple but common problem: How do you compare two groups that have different scales?

Imagine we are giving assessment tests to several school classes but all of them have been designed differently. Test A ranges from 0 to 100, in increment of a tenth of a point. Test B only offers scores between 0 and 12. Finally Test C is scored between 200 and 800 for arcane reasons known only to the designers. Here is the data we collected, they can all be copied directly into R.

a <- c(69.6, 74.8, 62.7, 54.2, 66.8,
	48.7, 48.7, 67.8, 69.1, 85.1,
	51.2, 54.6, 53.9, 56.4, 51.5,
	73.5, 53.4, 76.1, 60.5, 62.7,
	54.9, 66.8, 54.3, 49.2, 57.3,
	61.3, 61.8, 70.4, 51.1, 53.4)
b <- c(3, 5, 5, 6, 7,
	4, 6, 8, 5, 4,
	5, 5, 6, 6, 5,
	4, 5, 7, 6, 6,
	2, 5, 5, 0, 6,
	5, 4, 5, 6, 4)
c <- c(560, 529, 593, 607, 627,
	597, 544, 531, 584, 510,
	420, 543, 577, 623, 531,
	598, 508, 555, 642, 490,
	596, 626, 597, 514, 602,
	569, 635, 551, 518, 617)

group <- c(rep('a',30),rep('b',30),rep('c',30))

These aren’t easy to make comparisons between the three classes because the tests have such radically different scales. We can look at the descriptive statistics to see how unhelpful the raw data is for purposes of comparison.

mean(a); mean(b); mean(c)
[1] 60.72
[1] 5
[1] 566.46

sd(a); sd(b); sd(c)
[1] 9.46
[1] 1.53
[1] 51.03

Personally I think it’s much funnier to plot the distributions.

dat <- data.frame(group = group, score = c(a,b,c))

p <- ggplot(dat,aes(x=score,color=group))
p + geom_density(size=2) + 
	xlab('score')

normalizeRawData

Yeah, not helpful.

One method of normalization is simply to report the percentage score, this is called non-dimensional normalization since it makes no assumptions about the data. To do this we subtract the minimum possible score and divide by the maximum possible score.

aa <- (a-0)/100
bb <- (b-0)/12
cc <- (c-200)/800

dat <- data.frame(group = group, score = c(aa,bb,cc))

p <- ggplot(dat,aes(x=score,color=group))
p + geom_density(size=2) + 
	xlab('normalized score') + 
	scale_x_continuous(limits=c(0,1))

Now that they are all on a common scale we can visually compare them.

normalizeProp

Notice that the shape of the distributions has not changed (you can see this most clearly with group a). Now we can make some actual meaningful comparisons.

mean(aa); mean(bb); mean(cc)
[1] 0.607
[1] 0.416
[1] 0.458
sd(aa); sd(bb); sd(cc)
[1] 0.094
[1] 0.127
[1] 0.063

The most popular form of dimensional normalization is the z-score or standard score which forces the data into having certain characteristics of a standard normal distribution. Creating a z-score is done by transforming the data so that mean equals 0 and the standard deviation equals 1. This is actually pretty easy to do. For each data point we subtract the mean and divide by the standard deviation.

# We make a function of our own to do it quickly.
normalize <- function(x) {
	(x-mean(x))/sd(x)
}
aa <- normalize(a)
bb <- normalize(b)
cc <- normalize(c)

This isn’t as immediately intuitive as calculating a percentage but it does have some advantages. One immediately advantage is that if a z-score is less than 0 that means the score is worse than average for the group and if the z-score is more than 0 it is better than average.

dat <- data.frame(group = group, score = c(aa,bb,cc))

p <- ggplot(dat,aes(x=score,color=group))
p + geom_density(size=2) + 
	xlab('normalized score')

normalizeZScore

Z-scores are used to compute p-values as part of the Z-test. We will use the Z-test as a lead in to the t-test.

Normalization is also the term used for taking a distribution and ensuring that it integrates or sums to 1, thus becoming a probability distribution. In fact this process is so important that several probability distributions are named after the function that normalizes them like the binomial distribution, the gamma distribution, and the beta distribution.

Measures of Variability – Standard Deviation

Just a reminder of our data from before.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

The <strong>variance</strong> is one of the most important descriptors of data in statistics, in fact it has the special title of <strong>second central moment</strong>. It is, in a sense, analogous to the mean which is the <strong>first raw moment</strong>.

The variance is an oddball in the realm of descriptive statistics because, while it is a measure of dispersion, it is not a true descriptive statistic! Here’s the equation that calculates the variance and the function the does it. With the var() function R will determine the corrected sample variance, which is very similar to the exact variance for a large sample and more reasonable for small samples.

mean((petal – mean(petal))^2)
0.2164

var(petal)
0.2208163

Notice that exponent sign in there. Because of that even though the length of the petals is measured in centimeters the variance is in square centimeters. Clearly that is not a description of the data. It is meaningless to say that the length of the petals varies by 0.2164 square centimeters! True, higher variance means that the data is more spread out and lower variance that it’s more compact but that’s about it.

So why does the variance exist at all?

Like many things the variance has filtered down from the realms of higher mathematics where its properties are regularly exploited. The history of linear regression, for instance, owes a great deal to the analytical properties of the variance. When evaluating a linear model the variance and related measures provides a method to reduce the entire process to a pair of simple equations.

Naturally it is somewhat less useful as a simple description.

Notice also that, like the mean to which it is related, the variance tends to give a lot of weight to outliers.

The standard deviation (also known as sigma or σ) is, by a wide margin, the most popular way of describing how spread out data is. It owes this to its pedigree as a descendant of the variance and to a variety of equations in which it is needed. In fact it is so popular that we usually say that the variance is the square of the SD. However it is more sensible to think of it the other way around, the SD is the square root of the variance. By taking the square root of the variance we solve the issue of having the description in units that make no sense. Square centimeters become centimeters. Square years become years.

If asked R will calculate the corrected sample standard deviation if you ask for it. Notice again that with a sample size of 50 the values are extremely close.

sqrt(0.2164)
0.4651881

sd(petal)
0.469911

How popular is the standard deviation, you ask? It’s so popular that the MAD, which we talked about in the last post, is usually scaled so that it approximates the SD in an “ideal” scenario. Specifically if the data follows what is known as a normal distribution then the adjusted MAD and the SD will be identical. We’ll talk about the normal distribution in a later post.

It’s not immediately clear why anyone would bother to change the MAD like that when the direct interpretation seems so clear. The reason is twofold, firstly it means that MAD and SD are now comparable measures, if there are enormously different it will be because of a trait of the data not because of how they are calculated. Secondly there certain are equations which need the SD in order to work and, by scaling the MAD to approximated the SD it can be used in the equations (with some degree of care).

Finally let’s plot all of our measures on the density.

The plusminus() custom function we made last time is extremely convenient to use with ggplot, which will read the dataframe quite naturally. We base the position of the MAD on the median and the position of the SD on the mean.

spread <- plusminus(c(4.3, 4.35, 4.26), c(.30, .52, 0.47), c("IQR", "MAD", "SD"))

p <- ggplot(versidf, aes(x=petal))
p + 	geom_density(size=1) +
	geom_vline(data=spread,
		aes(xintercept=vals,
		linetype= group,
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8))

Let’s break down that code really quickly. First we put the versidf dataframe into a ggplot object with petal lengths on the x-axis aesthetic. Next we start adding layers. First we have geom_density() to show us how the data is distributed, the size is increased to 1 for ease of reading. Second we have geom_vline() to create vertical lines, the data is set to read from the spread dataframe we made. The aesthetics of the lines are that they are positioned at each “val” and their color and type are defined by the “group” variable. Then we tell ggplot to show the legend for geom_vline() and made the lines size 1 so they’re easier to read. The scale_color_manual() layer references the colorblind palette developed by Masataka Okabe and Kei Ito and adapted for R by Winston Chang. Lastly in the theme() layer we make the grid lines slightly larger.

iqrmadsd