Before we can really get a handle on concepts in datascience we need to understand the structure of data, both in R and in general. There are a lot of great datasets in the history of statistics but for understanding the basics of data nothing beats information from a race so for this post we will look at the boston data we made in our introduction to dataframes.
Here are the contents of the dataframe.
boston Place Sex Time Country 1 1 Male 80.60 RSA 2 2 Male 81.23 JPN 3 3 Male 81.23 JPN 4 4 Male 84.65 SUI 5 5 Male 84.70 ESP 6 1 Female 95.10 USA 7 2 Female 97.40 JPN 8 3 Female 98.55 USA 9 4 Female 99.65 SUI 10 5 Female 101.70 GBR
And here is the structure of the dataframe.
str(boston) 'data.frame': 10 obs. of 4 variables: $ Place : int 1 2 3 4 5 1 2 3 4 5 $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 1 $ Time : num 80.6 81.2 81.2 84.7 84.7 ... $ Country: Factor w/ 6 levels "ESP","GBR","JPN",..: 4 3 3 5 1 6 3 6 5 2
As you can see there are several different kinds of data represented here. R identifies them as an integer, two factors, and a numeric although these aren’t the common names.
Let’s begin by looking at the factors.
In general information that R stores as a factor is Categorical data. Both Country and Sex are categorical data. Being from Spain (ESP) is different from being from Switzerland (SUI) but that’s all we can sat about the difference. Notice how R describes the factors.
Country: Factor w/ 6 levels "ESP","GBR","JPN",..: 4 3 3 5 1 6 3 6 5 2
It is a factor with 6 levels, the first few of which are listed followed by a list of numbers. In order to save space with huge datasets R stores factors as a list of names (in alphabetical order) along with corresponding numbers.
Although they are both factors there is a difference between Sex and Country. Sex is also Dichotomous variable, it has exactly two possible values (since the Boston Athletic Association records sex only as female or male). Certain kinds of analysis can only be done with dichotomous variables and many kinds are impossible.
Next let’s take a look at the integer data. The Place variable is an example of Ordinal data, it has an order to it but no more information. The woman who came in first was ahead of the woman who came in second but their rank doesn’t tell us anything more that that. Ordinal data is also discrete, there is no such thing as 1.33th place in a race although there might be a tie in which case several people get 1st.
Place : int 1 2 3 4 5 1 2 3 4 5
What does it mean that R stores this kind of data as an integer? In computing there is an important difference between integers and what are known as floating-point numbers. Here the relevant difference is that integers take less space to store and are only whole numbers.
Finally there is the Time variable which R stores as a numeric, it is Interval data. This is means that it has magnitude as well as rank. The woman in third took 3.45 minutes longer to finish than the woman who came in first, we know both that she was faster and how much faster she was. Take a closer look and notice that the values are not all whole numbers.
Time : num 80.6 81.2 81.2 84.7 84.7 ...
The Time variable is also an example of Ratio data which means we can say things like “the fastest woman was 1.07 times as fast as the slowest woman”. This is possible because the zero point corresponds to an actual zero, a person with a time of 0.0 would finish the race instantly.
Isn’t that true of all interval data, though? No! Most temperature scales do not have a meaningful zero point. When it is 40° out it isn’t twice as hot as when it is 20° out because 0° doesn’t mean there is no heat at all, just that water will freeze. In fact, since absolute zero is -273.15°, for it to be twice as hot as 20° it would have to be over 300° out!
In fact 40° is merely 1.07 times as hot as 20°. Human beings happen to be incredibly sensitive to changes in temperature.
Let’s take a look at the types of data in another dataframe. mtcars comes built in to R.
head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
What kinds of data do we see here?
If you use the str() function look at the structure of the dataframe you’ll see that all of the variables are stored as numerics! Nonetheless it is evident that they are different kinds of data. A car must have an even number of cylinders so the cyl variable is really ordinal, it has discrete values. Meanwhile mpg, wt, and qsec are examples of ratio data. Interestingly the am variable is categorical, it says whether the transmission is automatic or manual, even though it is stored as a numeric.
Remember that data is best understood based on it properties rather than by looking at the choices made by the person who recorded it. They may have had all sorts of reasons to do what they did.
There is some advice often given about these various levels of measurement which says that the mode ought to be used for categorical data, the median for ordinal data, and the mean for interval or ratio data. Unfortunately these suggestions are unhelpful at best and probably misleading. We’ll go over why as we take a look at the measures of central tendency.