Central Tendency – Other Measures

Although there are a laundry list of measures of central tendency, they are fairly specialized. The trimmed mean and winsored mean are meant to be better descriptors of the data than the mean and more efficient than the median.

The geometric and harmonic means are specialized to describe certain kinds of data.

The barycenter, centroid, geometric median, and medoid are used for describing the center point for data that has several dimensions to it.

The rivers data shows the lengths of 141 major rivers in the North America.

str(rivers)
plot(density(rivers))

Yep, the density is unimodal.

mean(rivers)
591.1844

median(rivers)
425

with(density(rivers),x[y==max(y)]) # Code we used to find the mode.
334.6666

That isn’t a good sign! All three of our measures of central tendency are giving us wildly different values.

The trimmed means and winsored means are simple concepts, the idea is to remove the most extreme values from the data. With a trimmed mean we remove the largest values and the smallest values entirely while with a winsored mean we replace those values with whatever our two cutoff points were.

The interquartile mean is a popular form of trimmed or winsored mean. We remove the top 25% of the data and the bottom 25% of the data. With the summary() function finding the quartiles is easy.

summary(rivers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  135.0   310.0   425.0   591.2   680.0  3710.0

rivTrim <- rivers[rivers > 310 & rivers < 680]

Try using the sort() function on the rivers data and the rivTrim data so you can see the data that has been trimmed from the edges. For example: the 3710 kilometer Mississippi river is no longer there. You can see the same thing by checking the density.

Making the data for the Winsored mean is also easy. We take the 1st quartile and the 3rd quartile (which we just used) and use them to fill in the missing data.

rivWins <- c(rivTrim, rep(310,35), rep(680,35))

Now let’s take a look at those two measures.

mean(rivTrim)
448.6087
mean(rivWins)
471.9712

plot(density(rivers))
abline(v = c(448, 472), col = c('green','purple'))

They agree pretty well and they happen to agree with the median as well.

While the measures of central tendency for one dimensional data have a few well agreed upon standards this is less true of multidimensional data. The centroid and barycenter are the only two well agreed upon measurements, because they are clear analogues of the mean. The centroid is easy to find, it is the place where the mean of each of the variables intersects.

plot(faithful)
abline(v = mean(faithful$eruptions), h = mean(faithful$waiting))
points(x = mean(faithful$eruptions), mean(faithful$waiting),  cex=1.5, pch=19, col='red')

center

The barycenter is the weighted average of the data and corresponds to the center of mass of a physical system.

The medoid and geometric median, which are both analagous to the median in two dimensions, are unfortunately subject to many competing definitions, none of which are quite as trivial to calculate as the centroid. Minimizing the absolute errors requires deciding on how you wish to measure distance, typically either the Euclidian distance (a straight line) or the Manhattan distance (the square of that) is desired. The medoid is defined as the point from the data that accomplishes this best while the geometric median is simply whatever point does it best.

The geometric mean and harmonic mean rather specialized concepts that aren’t used heavily as descriptive statistics. Their main uses are for efficiently solving certain kinds of problems rather than taking the time to transform data. In a later post we will look at how to use them as well as the concept that connects them to the arithmetic mean, the generalized mean.

Next week we’ll be covering graphics! How to output nice images! How to use the ggplot2 package!

Leave a comment