Central Tendency – The Median and Mean

When people say “average” they are generally referring to what mathematicians call the arithmetic mean. They might also be talking about the mode, the median, the trimmed means, the winsored mean, the centroid or medioid, or (in rare cases where they are relevant) the geometric mean or harmonic mean.

In these posts we will use R as a teaching tool so that we can work with the various kinds of average hands on. We’ve already taken a look at the mode, which is the most common number in the data. Today we will take a look at the mean and median which are, by far, the most popular measures of central tendency.

What do the mean and median do? Oddly few people bother to address this when discussing the measures of central tendency. It’s important. It defines exactly what they really are.

The purpose of the arithmetic mean is to “balance” the data.

To find the mean we add up all of the values of the data and divide by the number of data points, it is the familiar average you probably learned in grade school.

Finding this point means determining where the errors add up to zero. The total error on either side of the mean is the same.

a <- c(13,1,8,4,5,0,10,3,10)
mean(a)
6

a-6
7 -5  2 -2  4 -6 -1 -3  4

Those numbers do indeed sum to zero but how should we interpret that fact?

Let’s say that we’re working in construction and the project will last one year. Our weekly progress ought be determined based on the mean amount of work done each week. If the mean falls below 1.9% a week we are not on track to finish. Some weeks might be slow and some weeks might be fast but the mean (and no other number) is what tells us if we’re keeping on schedule.

The mean is also popular due because it minimizes the root mean squared error. The RMSE is important for a variety of reasons in more advanced statistics isn’t particularly relevant to the description of data.

The purpose of the median is to find the number for which half of the data is greater and half of the data is smaller. It is the “middle” number.

a <- c(13,1,8,4,10,0,5,3,10)
sort(a)
0  1  3  4  5  8 10 10 13  #We can see 5 right there in the center once we sort the data.

median(a)
5

Somewhat coincidentally, but perhaps more importantly, the median minimizes the mean absolute error, a relative of the RMSE that actually does matter when describing data.

The absolute error is the distance from a value to a given datapoint. The mean absolute error, then, is the mean of all the absolute errors. The number that minimizes the MAE is the number that is strictly closest to all of the numbers in the data! That’s why the median is so effective at showing the typical value of the data.

Let’s quickly work through this with our simple data.

a <- c(13,1,8,4,10,0,5,3,10)
median(a)
5

abs(a-5)
8 4 3 1 5 5 0 2 5

The median is eight less than thirteen, four more than one, and so on. Those are the absolute errors. Consequently the MAE is as follows.

mean(abs(a-5))
3.666667

It is possible to visualize what the data looks like from the point of view of the MAE very nicely in order to see that the median minimizes it.

minimize

The MAE clearly shrinks as we approach the median and rises as we move away from it.

You will often see the assertion that the median is a better description of the typical value in a dataset than the mean is. While this is often true, certainly it tracks better with how we might want to describe typicality, it is far from panacea. Indeed believing that the median is simply better than the mean is a dangerous trap because attempting to describe data in terms of a single number rarely ends well. For thousands of years we had no choice. Thanks to computers, however, calculating a better description is just as easy as calculating the mean and median. We introduced the idea of probability density in the last post. Where the density is higher the data is more tightly clustered.

Let’s take a look at the duration of eruptions in the faithful data. The visualization here is done with the ggplot2 package to mimic the output of the code provided.

plot(density(faithful$eruptions))
abline(v=mean(faithful$eruptions),col='red')
abline(v=median(faithful$eruptions),col='blue')

meanmedden

Would you say that the mean or the median is better description of this data? Obviously we could talk about optimization as much as we want and go in circles. The faithful data is clustered around two values (in technical terms we call it bimodal because there are two peaks). The mean and median are both terrible descriptions of this data!

Reporting the density or histogram of some data is very simple, contains a great deal of information, can be supplemented by whatever summary statistics we like, and has the added benefit of being more visually engaging than numbers. It is a window into the data both for the public and researchers.

In any event the take away from this post should be that the mean and median are very different thing with consequently different interpretations. The mean misses the numbers above it by the same degree as the numbers below it while the median is the mathematical center of the data. Neither is a complete description of the data.

In the next post we will take a look at the notion of efficiency and why it matters.

Leave a comment