Central Tendency – Other Measures

Although there are a laundry list of measures of central tendency, they are fairly specialized. The trimmed mean and winsored mean are meant to be better descriptors of the data than the mean and more efficient than the median.

The geometric and harmonic means are specialized to describe certain kinds of data.

The barycenter, centroid, geometric median, and medoid are used for describing the center point for data that has several dimensions to it.

The rivers data shows the lengths of 141 major rivers in the North America.

str(rivers)
plot(density(rivers))

Yep, the density is unimodal.

mean(rivers)
591.1844

median(rivers)
425

with(density(rivers),x[y==max(y)]) # Code we used to find the mode.
334.6666

That isn’t a good sign! All three of our measures of central tendency are giving us wildly different values.

The trimmed means and winsored means are simple concepts, the idea is to remove the most extreme values from the data. With a trimmed mean we remove the largest values and the smallest values entirely while with a winsored mean we replace those values with whatever our two cutoff points were.

The interquartile mean is a popular form of trimmed or winsored mean. We remove the top 25% of the data and the bottom 25% of the data. With the summary() function finding the quartiles is easy.

summary(rivers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  135.0   310.0   425.0   591.2   680.0  3710.0

rivTrim <- rivers[rivers > 310 & rivers < 680]

Try using the sort() function on the rivers data and the rivTrim data so you can see the data that has been trimmed from the edges. For example: the 3710 kilometer Mississippi river is no longer there. You can see the same thing by checking the density.

Making the data for the Winsored mean is also easy. We take the 1st quartile and the 3rd quartile (which we just used) and use them to fill in the missing data.

rivWins <- c(rivTrim, rep(310,35), rep(680,35))

Now let’s take a look at those two measures.

mean(rivTrim)
448.6087
mean(rivWins)
471.9712

plot(density(rivers))
abline(v = c(448, 472), col = c('green','purple'))

They agree pretty well and they happen to agree with the median as well.

While the measures of central tendency for one dimensional data have a few well agreed upon standards this is less true of multidimensional data. The centroid and barycenter are the only two well agreed upon measurements, because they are clear analogues of the mean. The centroid is easy to find, it is the place where the mean of each of the variables intersects.

plot(faithful)
abline(v = mean(faithful$eruptions), h = mean(faithful$waiting))
points(x = mean(faithful$eruptions), mean(faithful$waiting),  cex=1.5, pch=19, col='red')

center

The barycenter is the weighted average of the data and corresponds to the center of mass of a physical system.

The medoid and geometric median, which are both analagous to the median in two dimensions, are unfortunately subject to many competing definitions, none of which are quite as trivial to calculate as the centroid. Minimizing the absolute errors requires deciding on how you wish to measure distance, typically either the Euclidian distance (a straight line) or the Manhattan distance (the square of that) is desired. The medoid is defined as the point from the data that accomplishes this best while the geometric median is simply whatever point does it best.

The geometric mean and harmonic mean rather specialized concepts that aren’t used heavily as descriptive statistics. Their main uses are for efficiently solving certain kinds of problems rather than taking the time to transform data. In a later post we will look at how to use them as well as the concept that connects them to the arithmetic mean, the generalized mean.

Next week we’ll be covering graphics! How to output nice images! How to use the ggplot2 package!

Measures of Central Tendency – Efficient Estimator

Before we move on to some of the less common measures of central tendency let’s compare the efficiency of the mean and median. We would like for our description of the data to reflect not just the data but also the real world. This is important because there are few things we can truly take a census of, in practice we must rely upon samples.

This seems, in the abstract, like a difficult thing to determine but techniques for it do exist.

A wonderfully simple R program by Ajay Shah lets us study a simple example.

http://www.mayin.org/ajayshah/KB/R/html/p1.html

Let’s go through a slightly modified version of the code line by line, all of it will be collected at the end of the post for you to copy down. The ggplot2 graphics package will be used to make a nice plot of the information.

This custom function does a specialized task very quickly. We make a sequence that goes from 0 to 1000, select thirty random numbers from it, then calculate the mean and median of that sample. It is important to know that the mean and median of that sequence is exactly 500. Since we’re taking a sample of thirty we’re probably not going to get a mean or median that is exactly correct.

one.simulation <- function(N=30) {
	s ← seq(0,1000,1)
	x <- sample(s,N)
  	return(c(mean(x), median(x)))
}

The replicate function runs our function one hundred thousand times with different random samples. It creates a matrix with 2 columns and 100000 rows.

results <- replicate(100000, one.simulation(30))

Now we use the denisty() function to find the distribution of the means and medians. The density should be very familiar by now. Hopefully the means and medians will tend to be close to 500.

k1 <- density(results[1,])
k2 <- density(results[2,])

Ajay Shah then provides some code to plot the densities with base R. Here is the whole program with the actual functional part taking up a mere eight lines.

one.simulation <- function(N=30) {
	s <- seq(0,1000,1)
	x <- sample(s,N)
  	return(c(mean(x), median(x)))
}

results <- replicate(100000, one.simulation(30))

k1 <- density(results[1,])
k2 <- density(results[2,])

xrange <- range(k1$x, k2$x)
plot(k1$x, k1$y, xlim=xrange, type="l", xlab="Estimated value", ylab="")
grid()
lines(k2$x, k2$y, col="red")
abline(v=.5)
legend(x="topleft", bty="n",
       lty=c(1,1),
       col=c("black", "red"),
       legend=c("Mean", "Median"))

With the help of ggplot2 the visualization looks like this.

efficient

The mean is much more concentrated than the median, that is what efficiency is. The mean of a sample is a better estimate of the mean of the population it comes from than the median of a sample is for the population median. This is another reason that the mean is sometimes preferred over the median despite criticisms.

Central Tendency – The Median and Mean

When people say “average” they are generally referring to what mathematicians call the arithmetic mean. They might also be talking about the mode, the median, the trimmed means, the winsored mean, the centroid or medioid, or (in rare cases where they are relevant) the geometric mean or harmonic mean.

In these posts we will use R as a teaching tool so that we can work with the various kinds of average hands on. We’ve already taken a look at the mode, which is the most common number in the data. Today we will take a look at the mean and median which are, by far, the most popular measures of central tendency.

What do the mean and median do? Oddly few people bother to address this when discussing the measures of central tendency. It’s important. It defines exactly what they really are.

The purpose of the arithmetic mean is to “balance” the data.

To find the mean we add up all of the values of the data and divide by the number of data points, it is the familiar average you probably learned in grade school.

Finding this point means determining where the errors add up to zero. The total error on either side of the mean is the same.

a <- c(13,1,8,4,5,0,10,3,10)
mean(a)
6

a-6
7 -5  2 -2  4 -6 -1 -3  4

Those numbers do indeed sum to zero but how should we interpret that fact?

Let’s say that we’re working in construction and the project will last one year. Our weekly progress ought be determined based on the mean amount of work done each week. If the mean falls below 1.9% a week we are not on track to finish. Some weeks might be slow and some weeks might be fast but the mean (and no other number) is what tells us if we’re keeping on schedule.

The mean is also popular due because it minimizes the root mean squared error. The RMSE is important for a variety of reasons in more advanced statistics isn’t particularly relevant to the description of data.

The purpose of the median is to find the number for which half of the data is greater and half of the data is smaller. It is the “middle” number.

a <- c(13,1,8,4,10,0,5,3,10)
sort(a)
0  1  3  4  5  8 10 10 13  #We can see 5 right there in the center once we sort the data.

median(a)
5

Somewhat coincidentally, but perhaps more importantly, the median minimizes the mean absolute error, a relative of the RMSE that actually does matter when describing data.

The absolute error is the distance from a value to a given datapoint. The mean absolute error, then, is the mean of all the absolute errors. The number that minimizes the MAE is the number that is strictly closest to all of the numbers in the data! That’s why the median is so effective at showing the typical value of the data.

Let’s quickly work through this with our simple data.

a <- c(13,1,8,4,10,0,5,3,10)
median(a)
5

abs(a-5)
8 4 3 1 5 5 0 2 5

The median is eight less than thirteen, four more than one, and so on. Those are the absolute errors. Consequently the MAE is as follows.

mean(abs(a-5))
3.666667

It is possible to visualize what the data looks like from the point of view of the MAE very nicely in order to see that the median minimizes it.

minimize

The MAE clearly shrinks as we approach the median and rises as we move away from it.

You will often see the assertion that the median is a better description of the typical value in a dataset than the mean is. While this is often true, certainly it tracks better with how we might want to describe typicality, it is far from panacea. Indeed believing that the median is simply better than the mean is a dangerous trap because attempting to describe data in terms of a single number rarely ends well. For thousands of years we had no choice. Thanks to computers, however, calculating a better description is just as easy as calculating the mean and median. We introduced the idea of probability density in the last post. Where the density is higher the data is more tightly clustered.

Let’s take a look at the duration of eruptions in the faithful data. The visualization here is done with the ggplot2 package to mimic the output of the code provided.

plot(density(faithful$eruptions))
abline(v=mean(faithful$eruptions),col='red')
abline(v=median(faithful$eruptions),col='blue')

meanmedden

Would you say that the mean or the median is better description of this data? Obviously we could talk about optimization as much as we want and go in circles. The faithful data is clustered around two values (in technical terms we call it bimodal because there are two peaks). The mean and median are both terrible descriptions of this data!

Reporting the density or histogram of some data is very simple, contains a great deal of information, can be supplemented by whatever summary statistics we like, and has the added benefit of being more visually engaging than numbers. It is a window into the data both for the public and researchers.

In any event the take away from this post should be that the mean and median are very different thing with consequently different interpretations. The mean misses the numbers above it by the same degree as the numbers below it while the median is the mathematical center of the data. Neither is a complete description of the data.

In the next post we will take a look at the notion of efficiency and why it matters.

Central Tendency – The Mode and Density

A very common goal in descriptive statistics is to determine the value that is typical or average. By far the three most common measures of central tendency are the mode, the median, and the mean. We call them measures of central tendency because data tends to form clumps and we would like to measure the center of that clump, where most of the data is.

In this series of posts we will use R as a teaching tool to explore central tendency and related concepts.

Because the mode is very simple this post will use R to introduce it along with the histograms and probability density.

The mode is, in common use, not actually a measure of centrality. It looks for the most common value that occurs in categorical data. In practice you can usually determine the mode by looking at a table without the need for any math. In fact R doesn’t even have a function for calculating the mode. (There is a mode() function but it does something else)

Using the table() function it is fairly easy to identify the mode of a vector. Factors can quickly and easily be read as vectors with the as.vector() function, though. Here we look at the nationalities of racers from the boston data.

table(as.vector(boston$Country))

ESP GBR JPN RSA SUI USA
1 1 3 1 2 2

Visualizations are another reasonable alternative, although base R doesn’t make it easy to do. Fortunately the ggplot2 package has this functionality built in to the geom_bar() layer of its graphics.

p <- ggplot(boston,aes(x=Country))
p + geom_bar()

Clearly Japanese racers were most common at the marathon.

Finally in the rare case where you need to extract the mode from the data directly a custom function is your best bet. This truly wonderful function was made public by Ken Williams on stackoverflow a few years ago:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode(boston$Country)
[1] JPN
Levels: ESP GBR JPN RSA SUI USA

Notice that Williams’ function is called ‘Mode’ not ‘mode’ so that R will accept it.

Understanding why the mode is considered a measure of central tendency requires us to delve deeper into the field of statistics. While it is an oft repeated truism that the mode is only for categorical data it can, in effect, be used for any kind of data.

It is a frequently repeated truism that the mode can only be used to describe categorical data because other measures are better for ordinal and interval data and because the mode simply doesn’t exist for interval data. Thanks to the rise of personal computers and software like R its very easy to show concepts that are otherwise quite abstract. Let’s look at a histogram of some data. What is the mode miles per gallon of cars in the mtcars data?

hist(mtcars$mpg)

mpghist

Looking at the histogram we might be tempted to say that the mode is “between 15 and 20 mpg” but that would be a mistake. We can vary the numbers of “bins” that the histogram has and, moreover, a histogram suffers from somewhat arbitrary limitations. R likes to make bins with widths that are round numbers and which start at round numbers. Nonetheless histograms give use a good starting place.

What would be better than a histogram is something we can mathematically optimize and which would give us a single value. This probability density does this very nicely. A few decades ago it was impractical to calculate a density for anything but a pre-defined function. Computers allow the use of a process called kernel density estimation to determine it for any arbitrary data. Let’s take a look at how R does it.

plot(density(mtcars$mpg))

mpgden

Rather than many discrete bins R is showing us a continuous curve. The density is a measure of how densely packed the data points are around each value. The exact value of the density is difficult to interpret and generally unimportant.

What we see here is that the density is highest at about 18 mpg.

Extracting the precise point of highest density is actually pretty simple. The density() function produces a bunch of information.

mpg <- density(mtcars$mpg)
str(mpg)
List of 7
 $ x        : num [1:512] 2.97 3.05 3.12 3.2 3.27 ...
 $ y        : num [1:512] 0.000114 0.000125 0.000137 0.00015 0.000164 ...
 $ bw       : num 2.48
 $ n        : int 32
 $ call     : language density.default(x = mtcars$mpg)
 $ data.name: chr "mtcars$mpg"
 $ has.na   : logi FALSE
 - attr(*, "class")= chr "density"

The x and y coordinates are what R uses to plot the image. We can use our ever useful friend, the subsetting notation, to find the value of x and y is equal to the greatest value of y. We’re looking for the point of highest density… the mode.

with(mpg,x[y==max(y)])
17.90862

Are there any cars that get a mileage of 17.90862 in the data?

No, there are not. The density is making a prediction about what we would see if we tested many millions of cars.

However we probably should not trust that value!

While probability densities are extremely useful the exact numbers are not reliable since there are a variety of densities. The gaussian (which we’ve just seen) and Epanechnikov methods are the most popular. They produce slightly different shapes with slightly different modes.

mpgep <- density(mtcars$mpg,kernel="epanechnikov")
with(mpgep,x[y==max(y)])
18.13383

mpgep

Obviously different in shape, although the similarity should be clear and we still estimate the mode as being very close to 18 mpg.

Use the ? function to call up the help menu for density and try out some of the less popular kernels for calculating the density. With plot() you can check out the shape of those alternative methods.

We will come back to probability density in a later post and discuss them in more detail. For now the density is important because we will use it to compare the other measures of central tendency.

Even if the point of highest density were entirely stable there wouldn’t be much call for using it because it lacks any useful mathematical properties. True, is is strictly more “typical” than anything else but very little can be done with that. The mean and median, on top of being trivial to calculate, are very functional. When we look at the the mean and median we’ll see what those are.