Summary – Descriptive Statistics Part 3

For the final part of our summary of descriptive statistics lets make a function that ties together everything we’ve learned so far. The great thing about functions in R as teaching tools is that they don’t introduce any new concepts. If you want a refresher functions check out the intro post.

We’ve seen that the mean and median can both be important so we’ll report those. The SD and the MAD are both important measures of variability to be aware of. The minimum and maximum sometimes hold important information, besides showing the edges of the data. We also want to know how many data points we have. Finally we’ve seen that graphics are important so our function should produce those as well.

This new function which we’ll call info has two arguments: na.rm to decide if we want to get rid of missing data (on by default) and draw to decide if the function will produce graphics (off by default).

info <- function(x, na.rm=TRUE, draw=FALSE) {

Since the function really wants a dataframe in order to work properly we make check the class of the data and stop with an error message if it’s wrong.

if(class(x) != "data.frame") {
	stop("Input must be a dataframe.")
}

The descriptive statistics we have don’t work at all on categorical data this next block of code strips out everything except numerics. It also counts how many variables are removed and says so. This is more than just polite, it informs the user about a significant change we’ve made to the dataframe.

lenx1 <- ncol(x)
x <- x[sapply(x, is.numeric)]
lenx2 <- ncol(x)
if(lenx1 != lenx2) {
	print((paste(“Warning”, lenx1-lenx2, “non-numeric variables removed.”)))
}

Enough housekeeping, lets make the function do some real work. Here we start by defining a special function, len, which will count the length of the variable without NAs. Then we use apply() to get descriptions of each individual column. Fortunately for us apply() returns the information with the name of the column included, which will save time later.

len <- function(x) {length(na.omit(x))}
l <- apply(x,2,len)
m <- apply(x,2,mean,na.rm=na.rm)
e <- apply(x,2,median,na.rm=na.rm)
s <- apply(x,2,sd,na.rm=na.rm)
d <- apply(x,2,mad,na.rm=na.rm)
a <- apply(x,2,min,na.rm=na.rm)
z <- apply(x,2,max,na.rm=na.rm)

Now we collect all of that information, put it into a dataframe, and use format to trim things down to a number of digits that will be easy to read. The print() function forces the output to occur before the function is finished.

sumstats <- format(data.frame(
	n = l, 
	mean = m, 
	med = e, 
	sd = s, 
	mad = d, 
	min = a, 
	max = z), 
		digits = 2)
print(sumstats)

Making the plots is actually pretty easy. A scatterplot matrix will be made automatically from the plot() function since our input is going to be purely numeric. All we have to do is give it a good name.

The par() function, which may be new to you, is telling R what to do with graphical objects. In this case it says that R will wait for us to click before continuing.

Using a for() loop produces a density plot of each variable and even gives it the appropriate name. The value of i increases by one with each iteration of the loop.

if (draw) {
	plot(x, main = "Scatterplot Matrix")

	par(ask=TRUE)
	
	for(i in 1:ncol(x)) {
		plot(density(x[,i], na.rm=na.rm), 
			main = paste(names(x)[i]))
	}
}

Don’t forget the final bracket to close the function!

The whole thing looks like this:

info <- function(x, na.rm=TRUE, draw=FALSE) {

	if(class(x) != "data.frame") {
		stop("Input must be a dataframe.")
	}

	lenx1 <- ncol(x)
	x <- x[sapply(x,is.numeric)]
	lenx2 <- ncol(x)
	if(lenx1 != lenx2) {
		print((paste('Warning', lenx1-lenx2, 'variables removed')))
	}

	len <- function(x) {length(na.omit(x))}
	l <- apply(x,2,len)
	m <- apply(x,2,mean,na.rm=na.rm)
	e <- apply(x,2,median,na.rm=na.rm)
	s <- apply(x,2,sd,na.rm=na.rm)
	d <- apply(x,2,mad,na.rm=na.rm)
	a <- apply(x,2,min,na.rm=na.rm)
	z <- apply(x,2,max,na.rm=na.rm)

	sumstats <- format(data.frame(
			n = l, 
			mean = m, 
			med = e, 
			sd = s, 
			mad = d, 
			min = a, 
			max = z), 
				digits = 2)
	print(sumstats)

	if (draw) {
		plot(x, main = "Scatterplot Matrix")

		par(ask=TRUE)
	
		for(i in 1:ncol(x)) {
			plot(density(x[,i], na.rm=na.rm), 
				main = paste(names(x)[i]))
		}
	}
}

Let’s try it out with the data we’ve been using the describes baseball players.

info(tab)
[1] "Warning 3 non-numeric variables removed"
          n mean med   sd  mad min max
height 1034   74  74  2.3  3.0  67  83
weight 1033  202 200 21.0 22.2 150 290
age    1034   29  28  4.3  4.1  21  49

A bunch of good stuff is visible here. It tells us that three variables aren’t included in the output. It also shows us that one data point is missing from weight. The output is a dataframe that we can easily read. If we want we can even save it to use later.

Do the graphics work?

info(tab,draw=TRUE)
[1] "Warning 3 non-numeric variables removed"
          n mean med   sd  mad min max
height 1034   74  74  2.3  3.0  67  83
weight 1033  202 200 21.0 22.2 150 290
age    1034   29  28  4.3  4.1  21  49
Waiting to confirm page change...
Waiting to confirm page change...
Waiting to confirm page change...