Normalization

Here’s a simple but common problem: How do you compare two groups that have different scales?

Imagine we are giving assessment tests to several school classes but all of them have been designed differently. Test A ranges from 0 to 100, in increment of a tenth of a point. Test B only offers scores between 0 and 12. Finally Test C is scored between 200 and 800 for arcane reasons known only to the designers. Here is the data we collected, they can all be copied directly into R.

a <- c(69.6, 74.8, 62.7, 54.2, 66.8,
	48.7, 48.7, 67.8, 69.1, 85.1,
	51.2, 54.6, 53.9, 56.4, 51.5,
	73.5, 53.4, 76.1, 60.5, 62.7,
	54.9, 66.8, 54.3, 49.2, 57.3,
	61.3, 61.8, 70.4, 51.1, 53.4)
b <- c(3, 5, 5, 6, 7,
	4, 6, 8, 5, 4,
	5, 5, 6, 6, 5,
	4, 5, 7, 6, 6,
	2, 5, 5, 0, 6,
	5, 4, 5, 6, 4)
c <- c(560, 529, 593, 607, 627,
	597, 544, 531, 584, 510,
	420, 543, 577, 623, 531,
	598, 508, 555, 642, 490,
	596, 626, 597, 514, 602,
	569, 635, 551, 518, 617)

group <- c(rep('a',30),rep('b',30),rep('c',30))

These aren’t easy to make comparisons between the three classes because the tests have such radically different scales. We can look at the descriptive statistics to see how unhelpful the raw data is for purposes of comparison.

mean(a); mean(b); mean(c)
[1] 60.72
[1] 5
[1] 566.46

sd(a); sd(b); sd(c)
[1] 9.46
[1] 1.53
[1] 51.03

Personally I think it’s much funnier to plot the distributions.

dat <- data.frame(group = group, score = c(a,b,c))

p <- ggplot(dat,aes(x=score,color=group))
p + geom_density(size=2) + 
	xlab('score')

normalizeRawData

Yeah, not helpful.

One method of normalization is simply to report the percentage score, this is called non-dimensional normalization since it makes no assumptions about the data. To do this we subtract the minimum possible score and divide by the maximum possible score.

aa <- (a-0)/100
bb <- (b-0)/12
cc <- (c-200)/800

dat <- data.frame(group = group, score = c(aa,bb,cc))

p <- ggplot(dat,aes(x=score,color=group))
p + geom_density(size=2) + 
	xlab('normalized score') + 
	scale_x_continuous(limits=c(0,1))

Now that they are all on a common scale we can visually compare them.

normalizeProp

Notice that the shape of the distributions has not changed (you can see this most clearly with group a). Now we can make some actual meaningful comparisons.

mean(aa); mean(bb); mean(cc)
[1] 0.607
[1] 0.416
[1] 0.458
sd(aa); sd(bb); sd(cc)
[1] 0.094
[1] 0.127
[1] 0.063

The most popular form of dimensional normalization is the z-score or standard score which forces the data into having certain characteristics of a standard normal distribution. Creating a z-score is done by transforming the data so that mean equals 0 and the standard deviation equals 1. This is actually pretty easy to do. For each data point we subtract the mean and divide by the standard deviation.

# We make a function of our own to do it quickly.
normalize <- function(x) {
	(x-mean(x))/sd(x)
}
aa <- normalize(a)
bb <- normalize(b)
cc <- normalize(c)

This isn’t as immediately intuitive as calculating a percentage but it does have some advantages. One immediately advantage is that if a z-score is less than 0 that means the score is worse than average for the group and if the z-score is more than 0 it is better than average.

dat <- data.frame(group = group, score = c(aa,bb,cc))

p <- ggplot(dat,aes(x=score,color=group))
p + geom_density(size=2) + 
	xlab('normalized score')

normalizeZScore

Z-scores are used to compute p-values as part of the Z-test. We will use the Z-test as a lead in to the t-test.

Normalization is also the term used for taking a distribution and ensuring that it integrates or sums to 1, thus becoming a probability distribution. In fact this process is so important that several probability distributions are named after the function that normalizes them like the binomial distribution, the gamma distribution, and the beta distribution.

Leave a comment