Summary – Descriptive Statistics

Posted on June 9, 2014 by alex1618

There are many more descriptive statistics than we’ve discussed so far with the measures of central tendency and the measures of spread but the most important and most common techniques have been covered. In the last post we acquired some data from the internet that I know very little about. Let’s use what we’ve learned in order to get an understanding of it.

The best place to start is with the str() and summary() functions. We used str() last time to get an idea of what data we have. It describes 1034 baseball players by name, team position, height, weight, and age.

summary(tab)
     name                team                 position       height    
 Length:1034        NYM    : 38   Relief_Pitcher  :315   Min.   :67.0  
 Class :character   ATL    : 37   Starting_Pitcher:221   1st Qu.:72.0  
 Mode  :character   DET    : 37   Outfielder      :194   Median :74.0  
                    OAK    : 37   Catcher         : 76   Mean   :73.7  
                    BOS    : 36   Second_Baseman  : 58   3rd Qu.:75.0  
                    CHC    : 36   First_Baseman   : 55   Max.   :83.0  
                    (Other):813   (Other)         :115                 
     weight           age       
 Min.   :150.0   Min.   :20.90  
 1st Qu.:187.0   1st Qu.:25.44  
 Median :200.0   Median :27.93  
 Mean   :201.7   Mean   :28.74  
 3rd Qu.:215.0   3rd Qu.:31.23  
 Max.   :290.0   Max.   :48.52  
 NA's   :1

The most important things we see here are that:
a) The teams are equally represented
b) The positions are not equally represented
c) Height, weight, and age have their mean and median close together, which is usually a good sign.
d) There is one missing value for weight!

That last one is especially important since R will return an error if it encounters an NA while doing math. We can order R to remove NAs if we want.

sd(tab$weight)
NA
sd(tab$weight, na.rm=TRUE)
20.99149

Since there is only a single missing value we should probably just drop that line rather than include na.rm=TRUE over and over again. The is.na() function will find where our missing data is. We want the subset of tab where it is FALSE that data is missing.

tab <- tab[is.na(tab$weight) == FALSE,]

The summary() function already showed us the mean, median, and quartiles of the player’s weights. Let’s look at the MAD and the SD.

sd(tab$weight, na.rm=TRUE)
 20.99149

mad(tab$weight, na.rm=TRUE)
 22.239

How would we, in colloquial terms, describe the weights of the players? Probably something like “According to this sample baseball players tend to weight around 200 pounds, give or take twenty pounds”. We lose a lot of precision here but its much easier to understand.

If our goal is to report the data, however, we need to do the best job we can. A traditional option would be along the lines of “This sample of baseball players (N=1033, one missing value excluded) had a mean weight of 201.7 pounds (SD = 21)”. A better option, if we have the space for it is a visualization. Since weight is continuous a density is appropriate.

How about if we divide things up by position? A boxplot or violin plot is a good way to look at the information that way. A boxplot gives us the most information. Median, IQR, and any outliers.

There’s just one problem with using a boxplot (or even a violin plot) here. The positions don’t all have the same number of players.

with(tab,tapply(position,position,length))
          Catcher Designated_Hitter     First_Baseman        Outfielder 
               76                18                55               194 
   Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
              315                58                52               221 
    Third_Baseman 
               45

There were only 18 Designated Hitters and there were 315 Relief Pitchers. Most plots will disguise that fact. A dotplot won’t, although it lacks much of the information that other methods offer.

Ah, now we see something truly strange about this data. There are striations. Look at how a few values have far more data points than the ones around them. If you look even more closely its visible that these striations happen on multiples of five.

What’s going on? That’s not random! Multiple of five are not important to nature. This is obviously an effect created by human beings. Yet we do have data points in between.

If you track down where this data came from you’ll find that it’s compiled from more than one source. Evidently some of these sources recorded weights only as multiples of five and others as any unit value.

Next time we’ll take a look at some of the other data we have in the dataframe and examine the data across multiple variables.

Here we see the code to make the images in this post with Hadley Wickham’s ggplot2 package.

p <- ggplot(tab)
p + geom_density(aes(x=weight), size=1) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8))

p + geom_boxplot(aes(x=position,y=weight)) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8),
		axis.text.x = element_text(angle=45,vjust=.6))

p + geom_dotplot(aes(x='identity',y=weight),binaxis="y",stackdir="center",binwidth=1) +
	theme(panel.grid.major = element_line(size = 1),legend.position="none") +
	scale_y_continuous(minor_breaks=NULL)

Measures of Variability – Highest Density Interval

Posted on June 5, 2014 by alex1618

All these previous measures of spread have one thing in common: They can easily be calculated by hand. Sure, it might be a tedious process for a large amount of data but that’s what graduate students are for. For essentially all of the history of mathematics the sole requirement of descriptive statistics was that a person be able to do the work with a pencil and paper.

This is no longer true, however. Computers can solve problems in seconds that would otherwise take months. We saw this before with probability density.

Just a reminder of our data from before.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

In fact using the density directly is what provides us with our final measure of spread, the highest density interval (or highest density region). The package hdrcde by Rob Hyndman provides us with this functionality.

install.packages(hdrcde)
library(hdrcde)

Let’s just try with the default settings.

hdr(petal)

$hdr
        [,1]     [,2]
99% 3.160326 5.253373
95% 3.377735 5.143393
50% 4.059244 4.700000

$mode
[1] 4.406846

$falpha
       1%        5%       50% 
0.1222823 0.1965873 0.6195948

That’s a bunch of information. The hdrs are the edges of several different highest density intervals, the 99% interval goes from 3.16 to 5.25 for example. You should recognize the concept of the mode for this sort of data, though, it’s the point where the density is highest. The information recorded as falpha is the density at the edges of each interval interval.

There are a few ways we can plot this information. This command automatically creates a plot using the hdrcde package. The hdr.den() function outputs both an image and that information we saw before.

hdr.den(petal)

You might notice something about that image, the density isn’t quite the same shape as the one we’ve seen before! It is a bit smoother. This occurs because Hyndman’s method of calculation is slightly different from the one R normally uses. The results are similar, as a comparison will show, but tend to produce a smoother estimate than the densitry() function does.

What the heck is a highest density region anyway?

Good question. The highest density region, as Hyndman uses the term and as I prefer to use it, is “the interval that contains a certain percentage of the data and for which all points in the region have higher density than those outside the region”. So for the 50% HDR half of the density is within the interval, this doesn’t necessarily mean half the data, though, like the interquartile range would.

How do we decide which HDR to use?

Also a good question. In inferential statistics, particularly Bayesian inference, the preferred interval is 95% because it sets a high standard without being useless. For a description, however, this isn’t what we want. The two realistic options are the 50% and 63.2%, which are analagous to the IQR and the standard deviation.

Let’s check them out.

hdr.den(petal, prob=c(50,63.2))

Let’s check out how these compare to our previous measures. It’s going to take a bit of work to make everything come out nice.

# Find the bandwidth the Hyndman uses
hdrbw(petal, HDRlevel=63.2)
0.2554508

# Caluclate the density using the bandwidth
d <- density(petal, bw=.255)

# Get the HDR infomration
h <- hdr(petal, prob = c(50,63.2))
h$hdr
         [,1]     [,2]
63.2% 4.00000 4.752259
50%   4.07637 4.700000

# Make two dataframes with the names and limits
spread1 <- data.frame(vals = c(4.076,4.7,4,4.6), group = c("50","50","IQR","IQR"))
spread2 <- data.frame(vals = c(4,4.752,3.79,4.73), group = c("63.2","63.2","SD","SD"))

# Make a blank ggplot object to work with
p <- ggplot()

# Add layers to the first plot
p + geom_line(aes(x=d$x, y=d$y)) +
	geom_vline(data=spread1, 
		aes(xintercept=vals, 
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8)) +
	labs(x="Petal Length", y="Density", title="IQR and 50% HDR")

# Open a new window to plot in
dev.new()

# Add layers to the second plot
p + geom_line(aes(x=d$x, y=d$y)) +
	geom_vline(data=spread2, 
		aes(xintercept=vals, 
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8)) +
	labs(x="Petal Length", y="Density", title="SD and 63.2% HDR")

Measures of Variability – Standard Deviation

Posted on June 2, 2014 by alex1618

Just a reminder of our data from before.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

The variance is one of the most important descriptors of data in statistics, in fact it has the special title of second central moment. It is, in a sense, analogous to the mean which is the first raw moment.

The variance is an oddball in the realm of descriptive statistics because, while it is a measure of dispersion, it is not a true descriptive statistic! Here’s the equation that calculates the variance and the function the does it. With the var() function R will determine the corrected sample variance, which is very similar to the exact variance for a large sample and more reasonable for small samples.

mean((petal – mean(petal))^2)
0.2164

var(petal)
0.2208163

Notice that exponent sign in there. Because of that even though the length of the petals is measured in centimeters the variance is in square centimeters. Clearly that is not a description of the data. It is meaningless to say that the length of the petals varies by 0.2164 square centimeters! True, higher variance means that the data is more spread out and lower variance that it’s more compact but that’s about it.

So why does the variance exist at all?

Like many things the variance has filtered down from the realms of higher mathematics where its properties are regularly exploited. The history of linear regression, for instance, owes a great deal to the analytical properties of the variance. When evaluating a linear model the variance and related measures provides a method to reduce the entire process to a pair of simple equations.

Naturally it is somewhat less useful as a simple description.

Notice also that, like the mean to which it is related, the variance tends to give a lot of weight to outliers.

The standard deviation (also known as sigma or σ) is, by a wide margin, the most popular way of describing how spread out data is. It owes this to its pedigree as a descendant of the variance and to a variety of equations in which it is needed. In fact it is so popular that we usually say that the variance is the square of the SD. However it is more sensible to think of it the other way around, the SD is the square root of the variance. By taking the square root of the variance we solve the issue of having the description in units that make no sense. Square centimeters become centimeters. Square years become years.

If asked R will calculate the corrected sample standard deviation if you ask for it. Notice again that with a sample size of 50 the values are extremely close.

sqrt(0.2164)
0.4651881

sd(petal)
0.469911

How popular is the standard deviation, you ask? It’s so popular that the MAD, which we talked about in the last post, is usually scaled so that it approximates the SD in an “ideal” scenario. Specifically if the data follows what is known as a normal distribution then the adjusted MAD and the SD will be identical. We’ll talk about the normal distribution in a later post.

It’s not immediately clear why anyone would bother to change the MAD like that when the direct interpretation seems so clear. The reason is twofold, firstly it means that MAD and SD are now comparable measures, if there are enormously different it will be because of a trait of the data not because of how they are calculated. Secondly there certain are equations which need the SD in order to work and, by scaling the MAD to approximated the SD it can be used in the equations (with some degree of care).

Finally let’s plot all of our measures on the density.

The plusminus() custom function we made last time is extremely convenient to use with ggplot, which will read the dataframe quite naturally. We base the position of the MAD on the median and the position of the SD on the mean.

spread <- plusminus(c(4.3, 4.35, 4.26), c(.30, .52, 0.47), c("IQR", "MAD", "SD"))

p <- ggplot(versidf, aes(x=petal))
p + 	geom_density(size=1) +
	geom_vline(data=spread,
		aes(xintercept=vals,
		linetype= group,
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = .8),
		panel.grid.minor = element_line(size = .8))

Let’s break down that code really quickly. First we put the versidf dataframe into a ggplot object with petal lengths on the x-axis aesthetic. Next we start adding layers. First we have geom_density() to show us how the data is distributed, the size is increased to 1 for ease of reading. Second we have geom_vline() to create vertical lines, the data is set to read from the spread dataframe we made. The aesthetics of the lines are that they are positioned at each “val” and their color and type are defined by the “group” variable. Then we tell ggplot to show the legend for geom_vline() and made the lines size 1 so they’re easier to read. The scale_color_manual() layer references the colorblind palette developed by Masataka Okabe and Kei Ito and adapted for R by Winston Chang. Lastly in the theme() layer we make the grid lines slightly larger.

Measures of Variability – Absolute Deviation

Posted on May 29, 2014 by alex1618

The range, as we discussed last time, isn’t really used as a measure of variability these days and the IQR isn’t terribly informative.

Since I’m a big fan of checking the density of data as a descriptor let’s try that before we continue. Let’s grab those objects we made last time.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

Okay, let’s try it out with both ggplot2 and with the base R functions. Remember that you have to load ggplot2 with the library() function each time you start R!

Just for fun let’s compare the typical Gaussian density with the much more exotic rectangular density. It would be comforting to see that densities are essentially in agreement even when we change something as fundamental as the kernel. After all we know that no description of the data will be perfect so knowing that the description isn’t going to randomly be misleading is nice.

plot(density(petal),col="#E69F00")
lines(density(petal, kernel = “rectangular”),col="#000000"))

p <- ggplot(versidf, aes(x = petal))
p + 	geom_density(kernel = 'rectangular', size=1, color="#000000") +
	geom_density(size=1, color="#E69F00") +
	theme(panel.grid.major = element_line(size = 1),
		panel.grid.minor = element_line(size = .8))

Yep those are both in agreement about where the mass of the data is. There are a lot of other densities we could use like the Epanechnikov, triangular, biweight, cosine, and optcosine. They all have particular strengths and weaknesses but they do broadly agree in most cases.

Let’s go back to just using the Gaussian density for now since it’s much easier to read. We can put the measures we developed onto the plot in order to evaluate them. With base R and with ggplot2. Here’s the IQR shown against the density.

plot(density(petal))
abline(v=c(4,4.6), col = '#33CC33')

p <- ggplot(versidf, aes(x = petal))
p + geom_density(size = 1) +
	geom_vline(x = c(4,4.6), size = 1, color = '#33CC33') +
	theme(panel.grid.major = element_line(size = 1),
		panel.grid.minor = element_line(size = .8))

So clearly the IQR is capturing a lot of the density (including the point of highest density) and it does represent a piece of information that is very easy to understand. Unfortunately it isn’t making using of much of the information in the data. Because of this if IQR is not very efficient, it doesn’t generalize very well without a very large sample.

The most popular measures of spread are ones that make use of all of the data points. Of these the most basic is the median absolute deviation. We saw the related concept of the mean absolute deviation in an earlier post when talking about the measures of central tendency. The MAD is the median distance from each data point to the median of the data. Like the median and the IQR the MAD is considered robust, not easily distorted by outliers. It is more efficient than the IQR.

It is somewhat rare to report the strict MAD, though, because the exact number of not of much interest. As a result it is usually scaled to mimic a more popular measure of variability called the standard deviation. With R we can get the strict MAD if we want but it gives us the traditional one by default.

mad(petal)
0.51891

mad(petal, constant = 1)
.35

median(petal)
4.35

For convenience later we will now make a function that gives us x+y and x-y as a conveniently structured dataframe. These sorts of convenience functions are extremely useful if you expect to do something many times.

plusminus <- function(p, q, groups = seq(1,max(length(x),length(y))) ) {
	a <- p-q
	b <- p+q
	c <- c(a,b)
	d <- data.frame(vals = c, group = groups)
	d[order(d$group),]
	}

The code here is very simple, you just have to keep in mind how R treats vectors. There are three arguments: p and q are numeric objects (either scalars or vectors); groups is a vector as the same size as the larger or p and q. In the function itself we subtract p from q, add p to q, and then make all of those results into a single vector. Notice that the vector “c” will contain all of the minus values followed by all of the plus values. Finally we put values into a dataframe along with groups. For the sake of neatness we than order the dataframe so that all of the groups are together.

plusminus(4.35,c(0.52,0.35))
  vals group
1 3.83     1
3 4.87     1
2 4.00     2
4 4.70     2

You can see immediately that, in this case, the strict MAD (group 2) is about the same as the IQR while the traditional MAD is obviously quite different. Let’s compare the traditional MAD with the IQR in an image. We saw the cbPalette vector of color codes a while back, I’ve deleted the first one so that the lines stand out a little better here.

spread <- plusminus(c(4.3,4.35), c(.3,.52), c('IQR','MAD'))
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

p <- ggplot(versidf, aes(x=petal))
p + 	geom_density(size=1) +
	geom_vline(data=spread,
		aes(xintercept=vals,
		color = group),
		show_guide = TRUE, size = 1) +
	scale_color_manual(values=cbPalette) +
	theme(panel.grid.major = element_line(size = 1),
		panel.grid.minor = element_line(size = .8))

Measures of Variability – The Ranges

Posted on May 24, 2014 by alex1618

The measures of central tendency tell us how far from zero the data tends to be. The measures of variability tells us how far from the center the data tends to be. Obviously we know that all of the data isn’t exactly at the mean or median. Just for the sake of illustration let’s look at some sample data from the iris dataframe. This will also give us a chance to look at more ggplot2 code.

For clarity let’s start by defining a few objects we’ll be using later.

versi <- iris[iris$Species=='versicolor',]
petal <- versi$Petal.Length
y <- 1:50
versidf <- data.frame(y, petal)

versi will hold all the information about the versicolor irises, petal then extracts just the information about the length of the petals. y is just a sequence from 1 to 50 to help us keep track of individual observations. Finally we make versidf to hold just the petal lengths and the assigned numbers.

p <- ggplot(versidf, aes(y = y, x = petal))
p + geom_point(size = 3) + 
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	ylab("") +
	scale_y_continuous(labels=NULL) +
	theme(axis.ticks.y = element_blank()) +
	geom_vline(xintercept = 4.26, size = 1.5) +
	scale_x_continuous(limits=c(3,5.5))

Most of the ggplot code there is just getting the plot to look exactly the way I wanted. The familiar theme elements make the grid lines a nicer size. Next we have several lines getting rid of everything that would normally be on the y-axis since there there’s no information on the y-axis, we’re just separating out the various data points so we can see them. Finally we use geom_vline() to make a vertical line at 4.26 (the mean of the petal lengths).

Clearly the data isn’t all in the same place.

When we discussed the mean we touched on the idea of the mean absolute error which measures how far from mean the data is, on average. This could be used as a measure of variability but isn’t common. We’ll come back to a cousin of it later in our discussion of variability.

The simplest measure of variation is the range, the distance between the lowest value and the highest value. With the range() function we can retrieve the min() and max() values of the data. The diff() function finds the difference between them.

petal <- versi$Petal.Length

range(petal)
3.0 5.1

diff(range(petal))
2.1

This isn’t a very good description of variability. While it gives a vague notion of where the data is, everything is within the range, it doesn’t tell us where it clusters within the range. One mutant flower with a 4 centimeter petal would totally alter the range despite not reflecting where the majority of the data is.

The main reason that people look at the range is because in certain kinds of data missing information will be coded as extreme values like 9999 or in order to warn people that something has to be dealt with. This is done in order to avoid a program silently removing values that are coded as missing.

A more effective way to measure variability is with the interquartile range (often IQR). The 1st and 3rd quartile cut off the top 25% and bottom 25% of the data (the median is the 2nd quartile) so the middle half of the data is between them. It is somewhat better that the range but not enormously so.

The easiest way to get the IQR is with the the summary() and IQR() functions. With summary() you can see the 1st and 3rd quartile (along with the min, max, median, and mean).

summary(petal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00    4.00    4.35    4.26    4.60    5.10
 
IQR(petal)
0.6

The IQR has the advantage of of cutting off extreme values so a single mutant flower isn’t going to throw it off. In fact the IQR is so popular that is has been enshrined in the boxplot.

Let’s look at two ways to visualize the IQR. First we just highlight the data.

vedf <- with(versidf, versidf[petal >= 4 & petal <= 4.6,])

p <- ggplot(versidf, aes(y = y, x = petal))
p + geom_point(size = 3) + geom_vline(xintercept = 4.26, size = 1.5) +
	theme(panel.grid.major = element_line(size = 1)) +
	theme(panel.grid.minor = element_line(size = 1)) +
	ylab("") +
	scale_y_continuous(labels=NULL) +
	theme(axis.ticks.y = element_blank()) +
	geom_point(data = vedf, aes(y = y, x= petal),color='red',size=3)

INSERT IMAGE

As a boxplot.

p <- ggplot(versidf, aes(y = petal, x = ''))
p + geom_boxplot() +
	scale_x_continuous(breaks=NULL,minor_breaks=NULL)

The boxplot() function in base R and the geom_boxplot() layer in ggplot2 will automatically create a Tukey Boxplot. The edges of the box at the 1st and 3rd quartiles. The middle line is the median. The whiskers go to the data points that are no more than 1.5 times the IQR from the 1st and 3rd quartiles. Outliers are shown as dots.

Boxplots have both strengths and weaknesses. Compared to violin plots or densities they show us less information about the data, however they are less subjective and showing the IQR can be very useful. Identifying outliers via Tukey’s method is also a nice touch.

Central Tendency – Other Measures

Posted on May 16, 2014 by alex1618

Although there are a laundry list of measures of central tendency, they are fairly specialized. The trimmed mean and winsored mean are meant to be better descriptors of the data than the mean and more efficient than the median.

The geometric and harmonic means are specialized to describe certain kinds of data.

The barycenter, centroid, geometric median, and medoid are used for describing the center point for data that has several dimensions to it.

The rivers data shows the lengths of 141 major rivers in the North America.

str(rivers)
plot(density(rivers))

Yep, the density is unimodal.

mean(rivers)
591.1844

median(rivers)
425

with(density(rivers),x[y==max(y)]) # Code we used to find the mode.
334.6666

That isn’t a good sign! All three of our measures of central tendency are giving us wildly different values.

The trimmed means and winsored means are simple concepts, the idea is to remove the most extreme values from the data. With a trimmed mean we remove the largest values and the smallest values entirely while with a winsored mean we replace those values with whatever our two cutoff points were.

The interquartile mean is a popular form of trimmed or winsored mean. We remove the top 25% of the data and the bottom 25% of the data. With the summary() function finding the quartiles is easy.

summary(rivers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  135.0   310.0   425.0   591.2   680.0  3710.0

rivTrim <- rivers[rivers > 310 & rivers < 680]

Try using the sort() function on the rivers data and the rivTrim data so you can see the data that has been trimmed from the edges. For example: the 3710 kilometer Mississippi river is no longer there. You can see the same thing by checking the density.

Making the data for the Winsored mean is also easy. We take the 1st quartile and the 3rd quartile (which we just used) and use them to fill in the missing data.

rivWins <- c(rivTrim, rep(310,35), rep(680,35))

Now let’s take a look at those two measures.

mean(rivTrim)
448.6087
mean(rivWins)
471.9712

plot(density(rivers))
abline(v = c(448, 472), col = c('green','purple'))

They agree pretty well and they happen to agree with the median as well.

While the measures of central tendency for one dimensional data have a few well agreed upon standards this is less true of multidimensional data. The centroid and barycenter are the only two well agreed upon measurements, because they are clear analogues of the mean. The centroid is easy to find, it is the place where the mean of each of the variables intersects.

plot(faithful)
abline(v = mean(faithful$eruptions), h = mean(faithful$waiting))
points(x = mean(faithful$eruptions), mean(faithful$waiting),  cex=1.5, pch=19, col='red')

The barycenter is the weighted average of the data and corresponds to the center of mass of a physical system.

The medoid and geometric median, which are both analagous to the median in two dimensions, are unfortunately subject to many competing definitions, none of which are quite as trivial to calculate as the centroid. Minimizing the absolute errors requires deciding on how you wish to measure distance, typically either the Euclidian distance (a straight line) or the Manhattan distance (the square of that) is desired. The medoid is defined as the point from the data that accomplishes this best while the geometric median is simply whatever point does it best.

The geometric mean and harmonic mean rather specialized concepts that aren’t used heavily as descriptive statistics. Their main uses are for efficiently solving certain kinds of problems rather than taking the time to transform data. In a later post we will look at how to use them as well as the concept that connects them to the arithmetic mean, the generalized mean.

Next week we’ll be covering graphics! How to output nice images! How to use the ggplot2 package!

Measures of Central Tendency – Efficient Estimator

Posted on May 14, 2014 by alex1618

Before we move on to some of the less common measures of central tendency let’s compare the efficiency of the mean and median. We would like for our description of the data to reflect not just the data but also the real world. This is important because there are few things we can truly take a census of, in practice we must rely upon samples.

This seems, in the abstract, like a difficult thing to determine but techniques for it do exist.

A wonderfully simple R program by Ajay Shah lets us study a simple example.

http://www.mayin.org/ajayshah/KB/R/html/p1.html

Let’s go through a slightly modified version of the code line by line, all of it will be collected at the end of the post for you to copy down. The ggplot2 graphics package will be used to make a nice plot of the information.

This custom function does a specialized task very quickly. We make a sequence that goes from 0 to 1000, select thirty random numbers from it, then calculate the mean and median of that sample. It is important to know that the mean and median of that sequence is exactly 500. Since we’re taking a sample of thirty we’re probably not going to get a mean or median that is exactly correct.

one.simulation <- function(N=30) {
	s ← seq(0,1000,1)
	x <- sample(s,N)
  	return(c(mean(x), median(x)))
}

The replicate function runs our function one hundred thousand times with different random samples. It creates a matrix with 2 columns and 100000 rows.

results <- replicate(100000, one.simulation(30))

Now we use the denisty() function to find the distribution of the means and medians. The density should be very familiar by now. Hopefully the means and medians will tend to be close to 500.

k1 <- density(results[1,])
k2 <- density(results[2,])

Ajay Shah then provides some code to plot the densities with base R. Here is the whole program with the actual functional part taking up a mere eight lines.

one.simulation <- function(N=30) {
	s <- seq(0,1000,1)
	x <- sample(s,N)
  	return(c(mean(x), median(x)))
}

results <- replicate(100000, one.simulation(30))

k1 <- density(results[1,])
k2 <- density(results[2,])

xrange <- range(k1$x, k2$x)
plot(k1$x, k1$y, xlim=xrange, type="l", xlab="Estimated value", ylab="")
grid()
lines(k2$x, k2$y, col="red")
abline(v=.5)
legend(x="topleft", bty="n",
       lty=c(1,1),
       col=c("black", "red"),
       legend=c("Mean", "Median"))

With the help of ggplot2 the visualization looks like this.

The mean is much more concentrated than the median, that is what efficiency is. The mean of a sample is a better estimate of the mean of the population it comes from than the median of a sample is for the population median. This is another reason that the mean is sometimes preferred over the median despite criticisms.

Central Tendency – The Median and Mean

Posted on May 12, 2014 by alex1618

When people say “average” they are generally referring to what mathematicians call the arithmetic mean. They might also be talking about the mode, the median, the trimmed means, the winsored mean, the centroid or medioid, or (in rare cases where they are relevant) the geometric mean or harmonic mean.

In these posts we will use R as a teaching tool so that we can work with the various kinds of average hands on. We’ve already taken a look at the mode, which is the most common number in the data. Today we will take a look at the mean and median which are, by far, the most popular measures of central tendency.

What do the mean and median do? Oddly few people bother to address this when discussing the measures of central tendency. It’s important. It defines exactly what they really are.

The purpose of the arithmetic mean is to “balance” the data.

To find the mean we add up all of the values of the data and divide by the number of data points, it is the familiar average you probably learned in grade school.

Finding this point means determining where the errors add up to zero. The total error on either side of the mean is the same.

a <- c(13,1,8,4,5,0,10,3,10)
mean(a)
6

a-6
7 -5  2 -2  4 -6 -1 -3  4

Those numbers do indeed sum to zero but how should we interpret that fact?

Let’s say that we’re working in construction and the project will last one year. Our weekly progress ought be determined based on the mean amount of work done each week. If the mean falls below 1.9% a week we are not on track to finish. Some weeks might be slow and some weeks might be fast but the mean (and no other number) is what tells us if we’re keeping on schedule.

The mean is also popular due because it minimizes the root mean squared error. The RMSE is important for a variety of reasons in more advanced statistics isn’t particularly relevant to the description of data.

The purpose of the median is to find the number for which half of the data is greater and half of the data is smaller. It is the “middle” number.

a <- c(13,1,8,4,10,0,5,3,10)
sort(a)
0  1  3  4  5  8 10 10 13  #We can see 5 right there in the center once we sort the data.

median(a)
5

Somewhat coincidentally, but perhaps more importantly, the median minimizes the mean absolute error, a relative of the RMSE that actually does matter when describing data.

The absolute error is the distance from a value to a given datapoint. The mean absolute error, then, is the mean of all the absolute errors. The number that minimizes the MAE is the number that is strictly closest to all of the numbers in the data! That’s why the median is so effective at showing the typical value of the data.

Let’s quickly work through this with our simple data.

a <- c(13,1,8,4,10,0,5,3,10)
median(a)
5

abs(a-5)
8 4 3 1 5 5 0 2 5

The median is eight less than thirteen, four more than one, and so on. Those are the absolute errors. Consequently the MAE is as follows.

mean(abs(a-5))
3.666667

It is possible to visualize what the data looks like from the point of view of the MAE very nicely in order to see that the median minimizes it.

The MAE clearly shrinks as we approach the median and rises as we move away from it.

You will often see the assertion that the median is a better description of the typical value in a dataset than the mean is. While this is often true, certainly it tracks better with how we might want to describe typicality, it is far from panacea. Indeed believing that the median is simply better than the mean is a dangerous trap because attempting to describe data in terms of a single number rarely ends well. For thousands of years we had no choice. Thanks to computers, however, calculating a better description is just as easy as calculating the mean and median. We introduced the idea of probability density in the last post. Where the density is higher the data is more tightly clustered.

Let’s take a look at the duration of eruptions in the faithful data. The visualization here is done with the ggplot2 package to mimic the output of the code provided.

plot(density(faithful$eruptions))
abline(v=mean(faithful$eruptions),col='red')
abline(v=median(faithful$eruptions),col='blue')

Would you say that the mean or the median is better description of this data? Obviously we could talk about optimization as much as we want and go in circles. The faithful data is clustered around two values (in technical terms we call it bimodal because there are two peaks). The mean and median are both terrible descriptions of this data!

Reporting the density or histogram of some data is very simple, contains a great deal of information, can be supplemented by whatever summary statistics we like, and has the added benefit of being more visually engaging than numbers. It is a window into the data both for the public and researchers.

In any event the take away from this post should be that the mean and median are very different thing with consequently different interpretations. The mean misses the numbers above it by the same degree as the numbers below it while the median is the mathematical center of the data. Neither is a complete description of the data.

In the next post we will take a look at the notion of efficiency and why it matters.

Central Tendency – The Mode and Density

Posted on May 9, 2014 by alex1618

A very common goal in descriptive statistics is to determine the value that is typical or average. By far the three most common measures of central tendency are the mode, the median, and the mean. We call them measures of central tendency because data tends to form clumps and we would like to measure the center of that clump, where most of the data is.

In this series of posts we will use R as a teaching tool to explore central tendency and related concepts.

Because the mode is very simple this post will use R to introduce it along with the histograms and probability density.

The mode is, in common use, not actually a measure of centrality. It looks for the most common value that occurs in categorical data. In practice you can usually determine the mode by looking at a table without the need for any math. In fact R doesn’t even have a function for calculating the mode. (There is a mode() function but it does something else)

Using the table() function it is fairly easy to identify the mode of a vector. Factors can quickly and easily be read as vectors with the as.vector() function, though. Here we look at the nationalities of racers from the boston data.

table(as.vector(boston$Country))

ESP GBR JPN RSA SUI USA
1 1 3 1 2 2

Visualizations are another reasonable alternative, although base R doesn’t make it easy to do. Fortunately the ggplot2 package has this functionality built in to the geom_bar() layer of its graphics.

p <- ggplot(boston,aes(x=Country))
p + geom_bar()

Clearly Japanese racers were most common at the marathon.

Finally in the rare case where you need to extract the mode from the data directly a custom function is your best bet. This truly wonderful function was made public by Ken Williams on stackoverflow a few years ago:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode(boston$Country)
[1] JPN
Levels: ESP GBR JPN RSA SUI USA

Notice that Williams’ function is called ‘Mode’ not ‘mode’ so that R will accept it.

Understanding why the mode is considered a measure of central tendency requires us to delve deeper into the field of statistics. While it is an oft repeated truism that the mode is only for categorical data it can, in effect, be used for any kind of data.

It is a frequently repeated truism that the mode can only be used to describe categorical data because other measures are better for ordinal and interval data and because the mode simply doesn’t exist for interval data. Thanks to the rise of personal computers and software like R its very easy to show concepts that are otherwise quite abstract. Let’s look at a histogram of some data. What is the mode miles per gallon of cars in the mtcars data?

hist(mtcars$mpg)

Looking at the histogram we might be tempted to say that the mode is “between 15 and 20 mpg” but that would be a mistake. We can vary the numbers of “bins” that the histogram has and, moreover, a histogram suffers from somewhat arbitrary limitations. R likes to make bins with widths that are round numbers and which start at round numbers. Nonetheless histograms give use a good starting place.

What would be better than a histogram is something we can mathematically optimize and which would give us a single value. This probability density does this very nicely. A few decades ago it was impractical to calculate a density for anything but a pre-defined function. Computers allow the use of a process called kernel density estimation to determine it for any arbitrary data. Let’s take a look at how R does it.

plot(density(mtcars$mpg))

Rather than many discrete bins R is showing us a continuous curve. The density is a measure of how densely packed the data points are around each value. The exact value of the density is difficult to interpret and generally unimportant.

What we see here is that the density is highest at about 18 mpg.

Extracting the precise point of highest density is actually pretty simple. The density() function produces a bunch of information.

mpg <- density(mtcars$mpg)
str(mpg)
List of 7
 $ x        : num [1:512] 2.97 3.05 3.12 3.2 3.27 ...
 $ y        : num [1:512] 0.000114 0.000125 0.000137 0.00015 0.000164 ...
 $ bw       : num 2.48
 $ n        : int 32
 $ call     : language density.default(x = mtcars$mpg)
 $ data.name: chr "mtcars$mpg"
 $ has.na   : logi FALSE
 - attr(*, "class")= chr "density"

The x and y coordinates are what R uses to plot the image. We can use our ever useful friend, the subsetting notation, to find the value of x and y is equal to the greatest value of y. We’re looking for the point of highest density… the mode.

with(mpg,x[y==max(y)])
17.90862

Are there any cars that get a mileage of 17.90862 in the data?

No, there are not. The density is making a prediction about what we would see if we tested many millions of cars.

However we probably should not trust that value!

While probability densities are extremely useful the exact numbers are not reliable since there are a variety of densities. The gaussian (which we’ve just seen) and Epanechnikov methods are the most popular. They produce slightly different shapes with slightly different modes.

mpgep <- density(mtcars$mpg,kernel="epanechnikov")
with(mpgep,x[y==max(y)])
18.13383

Obviously different in shape, although the similarity should be clear and we still estimate the mode as being very close to 18 mpg.

Use the ? function to call up the help menu for density and try out some of the less popular kernels for calculating the density. With plot() you can check out the shape of those alternative methods.

We will come back to probability density in a later post and discuss them in more detail. For now the density is important because we will use it to compare the other measures of central tendency.

Even if the point of highest density were entirely stable there wouldn’t be much call for using it because it lacks any useful mathematical properties. True, is is strictly more “typical” than anything else but very little can be done with that. The mean and median, on top of being trivial to calculate, are very functional. When we look at the the mean and median we’ll see what those are.

Structure of Data

Posted on May 7, 2014 by alex1618

Before we can really get a handle on concepts in datascience we need to understand the structure of data, both in R and in general. There are a lot of great datasets in the history of statistics but for understanding the basics of data nothing beats information from a race so for this post we will look at the boston data we made in our introduction to dataframes.

Here are the contents of the dataframe.

boston
   Place    Sex   Time Country
1      1   Male  80.60     RSA
2      2   Male  81.23     JPN
3      3   Male  81.23     JPN
4      4   Male  84.65     SUI
5      5   Male  84.70     ESP
6      1 Female  95.10     USA
7      2 Female  97.40     JPN
8      3 Female  98.55     USA
9      4 Female  99.65     SUI
10     5 Female 101.70     GBR

And here is the structure of the dataframe.

str(boston)
'data.frame':   10 obs. of  4 variables:
 $ Place  : int  1 2 3 4 5 1 2 3 4 5
 $ Sex    : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 1
 $ Time   : num  80.6 81.2 81.2 84.7 84.7 ...
 $ Country: Factor w/ 6 levels "ESP","GBR","JPN",..: 4 3 3 5 1 6 3 6 5 2

As you can see there are several different kinds of data represented here. R identifies them as an integer, two factors, and a numeric although these aren’t the common names.

Let’s begin by looking at the factors.

In general information that R stores as a factor is Categorical data. Both Country and Sex are categorical data. Being from Spain (ESP) is different from being from Switzerland (SUI) but that’s all we can sat about the difference. Notice how R describes the factors.

Country: Factor w/ 6 levels "ESP","GBR","JPN",..: 4 3 3 5 1 6 3 6 5 2

It is a factor with 6 levels, the first few of which are listed followed by a list of numbers. In order to save space with huge datasets R stores factors as a list of names (in alphabetical order) along with corresponding numbers.

Although they are both factors there is a difference between Sex and Country. Sex is also Dichotomous variable, it has exactly two possible values (since the Boston Athletic Association records sex only as female or male). Certain kinds of analysis can only be done with dichotomous variables and many kinds are impossible.

Next let’s take a look at the integer data. The Place variable is an example of Ordinal data, it has an order to it but no more information. The woman who came in first was ahead of the woman who came in second but their rank doesn’t tell us anything more that that. Ordinal data is also discrete, there is no such thing as 1.33th place in a race although there might be a tie in which case several people get 1st.

Place  : int  1 2 3 4 5 1 2 3 4 5

What does it mean that R stores this kind of data as an integer? In computing there is an important difference between integers and what are known as floating-point numbers. Here the relevant difference is that integers take less space to store and are only whole numbers.

Finally there is the Time variable which R stores as a numeric, it is Interval data. This is means that it has magnitude as well as rank. The woman in third took 3.45 minutes longer to finish than the woman who came in first, we know both that she was faster and how much faster she was. Take a closer look and notice that the values are not all whole numbers.

Time   : num  80.6 81.2 81.2 84.7 84.7 ...

The Time variable is also an example of Ratio data which means we can say things like “the fastest woman was 1.07 times as fast as the slowest woman”. This is possible because the zero point corresponds to an actual zero, a person with a time of 0.0 would finish the race instantly.

Isn’t that true of all interval data, though? No! Most temperature scales do not have a meaningful zero point. When it is 40° out it isn’t twice as hot as when it is 20° out because 0° doesn’t mean there is no heat at all, just that water will freeze. In fact, since absolute zero is -273.15°, for it to be twice as hot as 20° it would have to be over 300° out!

In fact 40° is merely 1.07 times as hot as 20°. Human beings happen to be incredibly sensitive to changes in temperature.

Let’s take a look at the types of data in another dataframe. mtcars comes built in to R.

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

What kinds of data do we see here?

If you use the str() function look at the structure of the dataframe you’ll see that all of the variables are stored as numerics! Nonetheless it is evident that they are different kinds of data. A car must have an even number of cylinders so the cyl variable is really ordinal, it has discrete values. Meanwhile mpg, wt, and qsec are examples of ratio data. Interestingly the am variable is categorical, it says whether the transmission is automatic or manual, even though it is stored as a numeric.

Remember that data is best understood based on it properties rather than by looking at the choices made by the person who recorded it. They may have had all sorts of reasons to do what they did.

There is some advice often given about these various levels of measurement which says that the mode ought to be used for categorical data, the median for ordinal data, and the mean for interval or ratio data. Unfortunately these suggestions are unhelpful at best and probably misleading. We’ll go over why as we take a look at the measures of central tendency.

Intuitor

Getting a grip on statistics, science, and philosophy.

Tag Archives: descriptive statistics