Lies, Damned Lies, and the Truth – Descriptive Statistics

Descriptive statistics is the field that seeks to characterize information. It does not generally seek to make predictions. The fundamental goal is to turn data, of which we now have mountains, into something actually useful. In this series of posts we will start with the measures of central tendency, the various averages, then move on to the higher “moments” and their analogues.

Why should we study descriptive statistics at all, though? Surely it would be better to let the data speak for itself rather than use a bunch of complicated formulas that will just confuse and mislead people? Maybe there’s no point in trying to teach the typical person this kind of thing and we should leave it up to trusted experts.

We need descriptive statistics because nothing speaks for itself (and even if it did there is no real degree of understanding that a human being can gain from looking at thousands of lines of data). Ultimately everything is a matter of interpretation. People who say they are not providing an interpretation are either lying or deluded. Worse, failing to accept that knowledge comes from interpretation of information blinds people to the actual content of their beliefs. Do you know what the median is and why we use it? Exactly what does it tell us?

Since descriptive statistics are the only hope we have of understand the world around us it is critical that everyone have at least a basic understanding of the field.

In that spirit this series of posts will focus on explaining not just the names of the various statistics and when its best to use them but also what they mean and where they come from. We will look at tradition statistical methods, robust statistics, as well as a few methods that are available thanks to the rise of personal computers.

Throughout this series I will also be providing sample R code so that people who want to follow along after reading the [i]Absolute Introduction to R[/i] can do so.

I’d also like to take this chance to introduce a dataset that provides a rich environment in which to play, the Magic data. Most of this data was acquired using the Gatherer Extractor program, arranged in a form compatible with R in Excel 2010, then finalized with some work in R.

The structure of the Magic Data dataframe.

'data.frame':   6337 obs. of  10 variables:
 $ ID    : int  175097 175042 176444 175038 179432 174941 174848 175147 175009 176452 ...
 $ Rating: num  0.82 0.938 1.107 1.206 1.352 ...
 $ Rare  : Ord.factor w/ 4 levels "C"<"U"<"R"<"M": 1 1 1 1 1 1 1 1 1 1 ...
 $ CMC   : int  4 5 3 3 4 3 5 3 6 6 ...
 $ Set   : Ord.factor w/ 31 levels "MRD"<"DST"<"FDN"<..: 17 17 17 17 17 17 17 17 17 17 ...
 $ Block : Ord.factor w/ 10 levels "MRDBL"<"CHKBL"<..: 6 6 6 6 6 6 6 6 6 6 ...
 $ Price : num  0.16 0.16 0.16 0.16 0.15 0.16 0.16 0.15 0.16 0.15 ...
 $ Year  : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ Color : Factor w/ 32 levels "","B","BG","BR",..: 7 9 2 9 9 6 2 9 9 2 ...
 $ Type  : Factor w/ 16 levels "Ar","ArCr","ArLa",..: 4 6 16 8 8 6 4 2 2 4 ...

This data provides continuous and discrete numerical data, categorical data, a chance to practice with regular expressions, ordered vs unordered factors, and a wide variety of chances to explore visualizations.

Since the Magic dataset is larger and less familiar than others I’ll reintroduce it later with a more detailed explanation of the variables. Nonetheless here are dropbox links to the Rdata and csv versions.

https://www.dropbox.com/s/r8tr01sp72meuf0/MagicData.Rdata
https://www.dropbox.com/s/rh6uvlumymnz9x3/MagicData.csv

Leave a comment