Standard Error

We’ve used the standard error several times on this blog without going into much detail about it. In simplest possible terms the SE is the range within which we expect a value (typically the mean) to be. Specifically it is the standard deviation of the sample means.

This requires a bit of abstract thought. If you take many samples from a population you will not always get the same mean value. After taking many samples you’ll have a distribution of those means. We are attempting to describe the characteristics of something one step more abstract than the mean. To see a more concrete example refer to the post of efficient estimators.

When we say “the average weight of a baseball player is 200 pounds” we mean that in our sample the average weight is 200 pounds and that we believe this is close to the true average of the population. With the standard error, confidence interval, or credible interval we quantify how close we expect it to be.

Using our baseball player data from a while back, let’s look at how the standard error works in an ideal situation. We know the true mean of the population so let’s take a bunch of random samples and see how often we capture the true value of the mean.

For convenience make the weight its own object:

weight <- tab$weight

Here is the equation that defines the standard error, first in common notation and then as a custom R function.

\displaystyle se = \frac{s}{\sqrt{n}} = \frac{s^2}{n}
. . . where s is the sample standard deviation (s2 is the sample variance) and n is the sample size.

se <- function(x) {sd(x)/sqrt(length(x))}

Now we use replicate() and sample() to take 1000 samples of 100.

r <- 1000
size <- 100
samps <- replicate(r,sample(weight,size))

Using the apply() function we get the mean and standard error of each sample.

means <- apply(samps,2,mean)
ses <- apply(samps,2,se)

se.h <- means+ses
se.l <- means-ses

Add up all of the instances where the top of the interval is less than the mean or the bottom of the interval is more than the mean. These are the instances where the standard error has missed the true value.

a <- sum(length(se.h[se.h < 201.6893]),length(se.l[se.l > 201.6893]))
1-a/r
 0.709

That’s actually better than we should expect. Keep in mind that the calculation of the standard error relies on us assuming a normal distribution for the population.

So where does the this equation come from?

As you saw before we calculate the standard error as the standard deviation divided by the square root of the sample size or, perhaps more intuitively, the variance divided by the sample size.

We do this because of the Bienayme formula which provides that the sum of the variance of several variables is the same as the variance of the sums of those variables (ie you draw 10 numbers from two populations and add together each pair).

Consider the equation that defines the mean for some data X:
\displaystyle \sum_{i=1}^n x_i/n

Now imagine a sample of exactly one value. The mean is equal to that value (itself divided by 1) and the variance of those means is exactly the population variance. When we draw a large number of values we can treat our sample as many samples of one (which it is), so then the variance of the sum is the same as the variance of the sample (which is approximately equal to the variance of the population).

This is not a rigorous mathematical proof, of course, the intention is to understand the intuition of why the standard error is the way it is. Personally, I find it comforting to know that the standard error is not produced out of thin air.

Reporting Mean and SE

Did you know you can put ggplot2 code inside of a function? Even though the grammar is different it is still R code

Before you try it out make a script with the mean.se() function we made last time and save it with the rest of your R code under the name “meanse.R” so we can reference it whenever we want.

Let’s make a function that creates the kind of graphics we’ve been looking at this week. A violin plot and a shaded boxplot are good. I think a dotplot is also worth including, we’ll tell ggplot2 to zoom out to the range of the data so we can see the variability in a reasonable way. As great as graphics are it always responsible to report the exact numbers as well so we want the function to output the summary stats that we’re using as well.

Since this post is all about the function we’ll go through it in pieces then gather it all together at the end.

Our first line defines the function. It will be named report() and show the use where to put the variable column, the grouping column, and to define the type of image. Finally the use can decide what interval size they want.

report <- function(variable, groups, type="violin", int=1) {

To save a lot of typing through the rest of the function I rename variable and group to a and b.

a <- variable
b <- groups

The require() function is like the library() function but it checks if ggplot2 is already loaded. Saves a bit of time when running the function. Meanwhile the source() function will read the script we saved elsewhere.

require(ggplot2)
source("meanse.R")

Now we make two dataframes for ggplot2 to use later on. First we load the variable and the groups into their own dataframe then we use mean.se to extract the mean and standard error for us.

dat <- data.frame(a, b)
datse <- mean.se(a, b, int = int)

This is just a polite line of code that reminds the user they’re showing something other than one SE.

if(int != 1) {
	warning(paste("Showing and reporting",int,"standard errors"), .call = FALSE)
}

Like we’ve done in previous posts we’ll build a ggplot object that contains the base information we want. This way we don’t have to rewrite thd information.

p <- ggplot(data = dat, aes(x=b, y=a)) + 
	theme_bw() + 
	theme(panel.grid.major = element_line(size = 1)) +
	labs(x="Group", y="Variable")

Here we have a very ugly block of code that creates, but doesn’t print, a violin plot. We only want the object created so we can extract the range of the plot. Later on this will be used to help make the dotplot look much neater.

q <- p + geom_violin()
ran <- ggplot_build(q)$panel$ranges[[1]]$y.range

These three if() statements will run depending on what the user sets the ‘type’ argument to. Both the violin plot and boxplot should be familiar. For the dot plot we use the datse dataframe we made to define errorbars and a large dot at the mean.

if(type=="violin") {
	p <- p + geom_violin(size=1) +
		geom_pointrange(data = datse,
			aes(x=group, y=m, ymax=m+se, ymin=m-se),
			size=1)
}
	if(type=="box") {
	p <- p + geom_boxplot(size=1) +
		geom_crossbar(data = datse, 
			aes(x=group, y=m, ymin=m-se, ymax=m+se), 
			color=NA, fill="grey", alpha=.5, width=.75)
}
if(type=="dot") {
	p <- p + geom_errorbar(data = datse,
			aes(x=group, y=m, ymax=m+se, ymin=m-se),
			size=1, width=.3) +
		geom_point(data = datse,
			aes(x=group, y=m),
			size=4) +
		coord_cartesian(ylim=range(ran))
	}

Finally we get to print our graphics. Hooray! We used print() once before and recall that it forces the function to output something without ending the function.

print(p)

The last coding line of he function will be returned as the official output.

datse

Don’t forget to close the function or everything will go wrong.

}

Here’s the whole thing:

report &lt;- function(variable, groups, type=&quot;dot&quot;, int=1) {
	a &lt;- variable
	b &lt;- groups

	require(ggplot2)
	source(&quot;meanse.R&quot;)

	dat &lt;- data.frame(a,b)
	datse &lt;- mean.se(a,b,int=int)

	if(int != 1) {
		warning(paste(&quot;Showing and reporting&quot;,int,&quot;standard errors&quot;), .call=FALSE)
	}
	
	p &lt;- ggplot(data = dat, aes(x=b,y=a)) + 
			theme_bw() + 
			theme(panel.grid.major = element_line(size = 1)) +
			labs(x=&quot;Group&quot;,y=&quot;Variable&quot;)

	q &lt;- p + geom_violin()
	ran &lt;- ggplot_build(q)$panel$ranges[[1]]$y.range

	if(type==&quot;violin&quot;) {
		p &lt;- p + 
			geom_violin(size=1) +
			geom_pointrange(data = datse,
				aes(x=group,y=m,ymax=m+se,ymin=m-se),
				size=1)
	}

	if(type==&quot;box&quot;) {
	p &lt;- p + 
		geom_boxplot(size=1) +
		geom_crossbar(data = datse, 
			aes(x=group,y=m,ymin=m-se,ymax=m+se), 
			color=NA, fill=&quot;grey&quot;, alpha=.5, width=.75)
	}

	if(type==&quot;dot&quot;) {
		p &lt;- p + 
			geom_errorbar(data = datse,
				aes(x=group,y=m,ymax=m+se,ymin=m-se),
				size=1, width=.3) +
			geom_point(data = datse,
				aes(x=group,y=m),
				size=2.5) +
			coord_cartesian(ylim=range(ran))
	}
print(p)
datse
}

You can save this as its own script so you can source() it later.

Let's try it out for ourselves with the <em>iris</em> data. Let's make boxplot of the Sepal Widths and show the 95% confidence interval of the mean (two standard errors).


with(iris,report(Sepal.Width,Species,type='box',int=2))

irisbox