Central Tendency – The Mode and Density

A very common goal in descriptive statistics is to determine the value that is typical or average. By far the three most common measures of central tendency are the mode, the median, and the mean. We call them measures of central tendency because data tends to form clumps and we would like to measure the center of that clump, where most of the data is.

In this series of posts we will use R as a teaching tool to explore central tendency and related concepts.

Because the mode is very simple this post will use R to introduce it along with the histograms and probability density.

The mode is, in common use, not actually a measure of centrality. It looks for the most common value that occurs in categorical data. In practice you can usually determine the mode by looking at a table without the need for any math. In fact R doesn’t even have a function for calculating the mode. (There is a mode() function but it does something else)

Using the table() function it is fairly easy to identify the mode of a vector. Factors can quickly and easily be read as vectors with the as.vector() function, though. Here we look at the nationalities of racers from the boston data.

table(as.vector(boston$Country))

ESP GBR JPN RSA SUI USA
1 1 3 1 2 2

Visualizations are another reasonable alternative, although base R doesn’t make it easy to do. Fortunately the ggplot2 package has this functionality built in to the geom_bar() layer of its graphics.

p <- ggplot(boston,aes(x=Country))
p + geom_bar()

Clearly Japanese racers were most common at the marathon.

Finally in the rare case where you need to extract the mode from the data directly a custom function is your best bet. This truly wonderful function was made public by Ken Williams on stackoverflow a few years ago:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode(boston$Country)
[1] JPN
Levels: ESP GBR JPN RSA SUI USA

Notice that Williams’ function is called ‘Mode’ not ‘mode’ so that R will accept it.

Understanding why the mode is considered a measure of central tendency requires us to delve deeper into the field of statistics. While it is an oft repeated truism that the mode is only for categorical data it can, in effect, be used for any kind of data.

It is a frequently repeated truism that the mode can only be used to describe categorical data because other measures are better for ordinal and interval data and because the mode simply doesn’t exist for interval data. Thanks to the rise of personal computers and software like R its very easy to show concepts that are otherwise quite abstract. Let’s look at a histogram of some data. What is the mode miles per gallon of cars in the mtcars data?

hist(mtcars$mpg)

mpghist

Looking at the histogram we might be tempted to say that the mode is “between 15 and 20 mpg” but that would be a mistake. We can vary the numbers of “bins” that the histogram has and, moreover, a histogram suffers from somewhat arbitrary limitations. R likes to make bins with widths that are round numbers and which start at round numbers. Nonetheless histograms give use a good starting place.

What would be better than a histogram is something we can mathematically optimize and which would give us a single value. This probability density does this very nicely. A few decades ago it was impractical to calculate a density for anything but a pre-defined function. Computers allow the use of a process called kernel density estimation to determine it for any arbitrary data. Let’s take a look at how R does it.

plot(density(mtcars$mpg))

mpgden

Rather than many discrete bins R is showing us a continuous curve. The density is a measure of how densely packed the data points are around each value. The exact value of the density is difficult to interpret and generally unimportant.

What we see here is that the density is highest at about 18 mpg.

Extracting the precise point of highest density is actually pretty simple. The density() function produces a bunch of information.

mpg <- density(mtcars$mpg)
str(mpg)
List of 7
 $ x        : num [1:512] 2.97 3.05 3.12 3.2 3.27 ...
 $ y        : num [1:512] 0.000114 0.000125 0.000137 0.00015 0.000164 ...
 $ bw       : num 2.48
 $ n        : int 32
 $ call     : language density.default(x = mtcars$mpg)
 $ data.name: chr "mtcars$mpg"
 $ has.na   : logi FALSE
 - attr(*, "class")= chr "density"

The x and y coordinates are what R uses to plot the image. We can use our ever useful friend, the subsetting notation, to find the value of x and y is equal to the greatest value of y. We’re looking for the point of highest density… the mode.

with(mpg,x[y==max(y)])
17.90862

Are there any cars that get a mileage of 17.90862 in the data?

No, there are not. The density is making a prediction about what we would see if we tested many millions of cars.

However we probably should not trust that value!

While probability densities are extremely useful the exact numbers are not reliable since there are a variety of densities. The gaussian (which we’ve just seen) and Epanechnikov methods are the most popular. They produce slightly different shapes with slightly different modes.

mpgep <- density(mtcars$mpg,kernel="epanechnikov")
with(mpgep,x[y==max(y)])
18.13383

mpgep

Obviously different in shape, although the similarity should be clear and we still estimate the mode as being very close to 18 mpg.

Use the ? function to call up the help menu for density and try out some of the less popular kernels for calculating the density. With plot() you can check out the shape of those alternative methods.

We will come back to probability density in a later post and discuss them in more detail. For now the density is important because we will use it to compare the other measures of central tendency.

Even if the point of highest density were entirely stable there wouldn’t be much call for using it because it lacks any useful mathematical properties. True, is is strictly more “typical” than anything else but very little can be done with that. The mean and median, on top of being trivial to calculate, are very functional. When we look at the the mean and median we’ll see what those are.