Measures of Central Tendency – Efficient Estimator

Before we move on to some of the less common measures of central tendency let’s compare the efficiency of the mean and median. We would like for our description of the data to reflect not just the data but also the real world. This is important because there are few things we can truly take a census of, in practice we must rely upon samples.

This seems, in the abstract, like a difficult thing to determine but techniques for it do exist.

A wonderfully simple R program by Ajay Shah lets us study a simple example.

http://www.mayin.org/ajayshah/KB/R/html/p1.html

Let’s go through a slightly modified version of the code line by line, all of it will be collected at the end of the post for you to copy down. The ggplot2 graphics package will be used to make a nice plot of the information.

This custom function does a specialized task very quickly. We make a sequence that goes from 0 to 1000, select thirty random numbers from it, then calculate the mean and median of that sample. It is important to know that the mean and median of that sequence is exactly 500. Since we’re taking a sample of thirty we’re probably not going to get a mean or median that is exactly correct.

one.simulation <- function(N=30) {
	s ← seq(0,1000,1)
	x <- sample(s,N)
  	return(c(mean(x), median(x)))
}

The replicate function runs our function one hundred thousand times with different random samples. It creates a matrix with 2 columns and 100000 rows.

results <- replicate(100000, one.simulation(30))

Now we use the denisty() function to find the distribution of the means and medians. The density should be very familiar by now. Hopefully the means and medians will tend to be close to 500.

k1 <- density(results[1,])
k2 <- density(results[2,])

Ajay Shah then provides some code to plot the densities with base R. Here is the whole program with the actual functional part taking up a mere eight lines.

one.simulation <- function(N=30) {
	s <- seq(0,1000,1)
	x <- sample(s,N)
  	return(c(mean(x), median(x)))
}

results <- replicate(100000, one.simulation(30))

k1 <- density(results[1,])
k2 <- density(results[2,])

xrange <- range(k1$x, k2$x)
plot(k1$x, k1$y, xlim=xrange, type="l", xlab="Estimated value", ylab="")
grid()
lines(k2$x, k2$y, col="red")
abline(v=.5)
legend(x="topleft", bty="n",
       lty=c(1,1),
       col=c("black", "red"),
       legend=c("Mean", "Median"))

With the help of ggplot2 the visualization looks like this.

efficient

The mean is much more concentrated than the median, that is what efficiency is. The mean of a sample is a better estimate of the mean of the population it comes from than the median of a sample is for the population median. This is another reason that the mean is sometimes preferred over the median despite criticisms.

Leave a comment