Structure of Data

Before we can really get a handle on concepts in datascience we need to understand the structure of data, both in R and in general. There are a lot of great datasets in the history of statistics but for understanding the basics of data nothing beats information from a race so for this post we will look at the boston data we made in our introduction to dataframes.

Here are the contents of the dataframe.

boston
   Place    Sex   Time Country
1      1   Male  80.60     RSA
2      2   Male  81.23     JPN
3      3   Male  81.23     JPN
4      4   Male  84.65     SUI
5      5   Male  84.70     ESP
6      1 Female  95.10     USA
7      2 Female  97.40     JPN
8      3 Female  98.55     USA
9      4 Female  99.65     SUI
10     5 Female 101.70     GBR

And here is the structure of the dataframe.

str(boston)
'data.frame':   10 obs. of  4 variables:
 $ Place  : int  1 2 3 4 5 1 2 3 4 5
 $ Sex    : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 1
 $ Time   : num  80.6 81.2 81.2 84.7 84.7 ...
 $ Country: Factor w/ 6 levels "ESP","GBR","JPN",..: 4 3 3 5 1 6 3 6 5 2

As you can see there are several different kinds of data represented here. R identifies them as an integer, two factors, and a numeric although these aren’t the common names.

Let’s begin by looking at the factors.

In general information that R stores as a factor is Categorical data. Both Country and Sex are categorical data. Being from Spain (ESP) is different from being from Switzerland (SUI) but that’s all we can sat about the difference. Notice how R describes the factors.

Country: Factor w/ 6 levels "ESP","GBR","JPN",..: 4 3 3 5 1 6 3 6 5 2

It is a factor with 6 levels, the first few of which are listed followed by a list of numbers. In order to save space with huge datasets R stores factors as a list of names (in alphabetical order) along with corresponding numbers.

Although they are both factors there is a difference between Sex and Country. Sex is also Dichotomous variable, it has exactly two possible values (since the Boston Athletic Association records sex only as female or male). Certain kinds of analysis can only be done with dichotomous variables and many kinds are impossible.

 

Next let’s take a look at the integer data. The Place variable is an example of Ordinal data, it has an order to it but no more information. The woman who came in first was ahead of the woman who came in second but their rank doesn’t tell us anything more that that. Ordinal data is also discrete, there is no such thing as 1.33th place in a race although there might be a tie in which case several people get 1st.

Place  : int  1 2 3 4 5 1 2 3 4 5

What does it mean that R stores this kind of data as an integer? In computing there is an important difference between integers and what are known as floating-point numbers. Here the relevant difference is that integers take less space to store and are only whole numbers.

 

Finally there is the Time variable which R stores as a numeric, it is Interval data. This is means that it has magnitude as well as rank. The woman in third took 3.45 minutes longer to finish than the woman who came in first, we know both that she was faster and how much faster she was. Take a closer look and notice that the values are not all whole numbers.

Time   : num  80.6 81.2 81.2 84.7 84.7 ...

The Time variable is also an example of Ratio data which means we can say things like “the fastest woman was 1.07 times as fast as the slowest woman”. This is possible because the zero point corresponds to an actual zero, a person with a time of 0.0 would finish the race instantly.

Isn’t that true of all interval data, though? No! Most temperature scales do not have a meaningful zero point. When it is 40° out it isn’t twice as hot as when it is 20° out because 0° doesn’t mean there is no heat at all, just that water will freeze. In fact, since absolute zero is -273.15°, for it to be twice as hot as 20° it would have to be over 300° out!

In fact 40° is merely 1.07 times as hot as 20°. Human beings happen to be incredibly sensitive to changes in temperature.

 

Let’s take a look at the types of data in another dataframe. mtcars comes built in to R.

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

What kinds of data do we see here?

If you use the str() function look at the structure of the dataframe you’ll see that all of the variables are stored as numerics! Nonetheless it is evident that they are different kinds of data. A car must have an even number of cylinders so the cyl variable is really ordinal, it has discrete values. Meanwhile mpg, wt, and qsec are examples of ratio data. Interestingly the am variable is categorical, it says whether the transmission is automatic or manual, even though it is stored as a numeric.

 

Remember that data is best understood based on it properties rather than by looking at the choices made by the person who recorded it. They may have had all sorts of reasons to do what they did.

 

There is some advice often given about these various levels of measurement which says that the mode ought to be used for categorical data, the median for ordinal data, and the mean for interval or ratio data. Unfortunately these suggestions are unhelpful at best and probably misleading. We’ll go over why as we take a look at the measures of central tendency.

Leave a comment