Binomial Distribution – Equation

Last time we looked at the binomial distribution it was purely to understand the behavior of a random variable. This time we’re going to look at the equation that defines it.

Here it is:

\displaystyle \binom{n}{k}p^k(1-p)^{n-k}

All of this notation should by now be familiar. The first part is a binomial coefficient and the rest is basic math. The variable n is the sample size and the variable p is the probability of success. Meanwhile k is the set of natural numbers N0 (whole numbers from 0 to infinity). We test the probability at each value of k in order to create the distribution.

So why is this the equation?

I am not in the business of writing proofs, many good ones exist, but rather of producing intuition. The binomial coefficient is there because we are choosing from a set. It is there as a normalizing value to convert the rest of the equation into a meaningful probability.

Let us flips one coin three times. The probability that the coin will come up heads three times is p^3 since it has probability p each time. Alternatively the exponential term could be expanded to p*p*p. When we plug this into the equation:

\displaystyle \binom{3}{3}p^3(1-p)^{3-3}

Which can then be solved to become:

1\times p^3(1-p)^0
. . . which is . . .
1\times p^3\times1
. . . which is . . .
p^3

So it gives us the right answer.

How about the probability of getting two heads? The probability of getting heads twice is p^2 and the probability of getting one tails is (1-p)^1 so we multiply them together to find the probability. The result can be expanded to p*p*(1-p). This is not the probability of getting two heads and a tails, however, it is the total probability of getting each result.

What is the difference?

There are three equally likely ways to get this result so we have a probability of (p^2)*(1-p) three times! The equation is:

\displaystyle \binom{3}{2}p^2(1-p)^{3-2}

Which can then be solved to become:

\displaystyle 3\times p^2(1-p)^{1}

Exactly what we predicted.

The Hypergeometric Distribution

This is a distribution you can impress your friend’s with because it has such an absolutely awesome sounding name and a comparatively simple form.

While the binomial distribution tells us about independent events the hypergeometric distribution tells us about a particular kind of dependent event, sampling without replacement. If you reach into a bag and take out a marble then reach in and take out another marble you are sampling without replacement. Like the binomial and Bernoulli distributions the hypergeometric describes events in terms of success and failure.

Here is our scenario in mathematical terms:

Randomly select n items from a set of N items. In this set there are K possible successes and N-K possible failures (which simply means everything is either a success of a failure). The probability of picking k successes from the set of K will follow the hypergeometric distribution. Which has this form:

\displaystyle \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}

Why? Let’s take a look.

Here is a random set from which we will pick three items. As per ancient statistical tradition consider it an urn containing while and black balls.

WBBBWBWBBWBBW

We will pick four balls from the urn and want to know how likely it is that exactly three of them will be black?

Here are our variables.

N = 13 # There are 13 items
K = 8  # 8 of them are black
n = 4  # We are selecting four items
k = 3  # Looking for 3 black

First we will describe how may ways there are to choose 3 black balls from a set of 8, because there are only 8 black balls in the set. Choose? That sounds like a binomial coefficient so let’s write KCk. Now how many ways are there to draw 1 white ball from a let of 5 white balls? That’s another binomial coefficient! We write it as N-KCn-k because N-K is the the number of non-black balls and n-k is the number of times we can pick a non-black ball.

Now we can simply multiply these together to get the number of correct combinations! There are 280 possible ways.

That’s not a probability, though, so we need to normalize it. How to do that? Well, how many ways are there to pick a combination from the set? That must be NCn since it is a set of N and we are choosing n items in total. That comes to 715.

So there are 280 correct results out of 715 total options. That’s just a fraction!

208/715
 0.3916084

There is a roughly 29% chance of us picking the particular combination of black and white balls that we want. And since we know that 280 is reakky KCk times N-KCn-k and that 715 is really NCn we can see where the equation comes from.

We can also summon up the hypergeometric distribution in R with the dhyper() family of functions.

dhyper(3,8,5,4)
 0.3916084

With so many variables there are tons of hypergeometric distributions but here’s a simple one that show the probability for between zero and four balls black chosen from an urn.

Successes <- 0:4
plot(dhyper(Successes,8,5,4)~Successes,
	type='h', lwd=3,
	ylim=c(0,.5), ylab='Probability',
	main='A Hypergeomatric Distribution')

hyper

Obviously the hypergeometric distribution is good for more than placing bets on Jacob Bernoulli’s less interesting party games or we wouldn’t be talking about it. The hypergeometric can be used to count cards. Perhaps most famously, however, it was used by Sir Ronald Fisher to devise a statistical test. He was given the task of testing the claim that fellow biologist Muriel Bristol-Roach could tell whether her tea was poured into the cup before the milk or after. The idea of the test was that distribution told him the best possible guesses that could be made for the experiment he designed.

Doctor Bristol-Roach was presented with eight cups of tea to taste, some poured milk first and some not. The lady then tasted them one-by-one reporting which type she believed each cup was. History records that the well bred biologist was more than validated by the experiment.

Binomial Coefficient

Binomial Coefficient

The binomial coefficient in probability theory, especially the study of discrete probabilities. It looks like this:

\displaystyle \binom{n}{k}

Because its notation is unfamiliar this can make otherwise comprehensible equation appear to be nonsense. Traditionally the binomial coefficient represents how many different ways you can pick k objects from a set of n objects. Consequently it is read as “n choose k”. This is why R uses the choose() function to denote it.

Let’s look at some examples

# How many ways to choose 5 from a set of 10?
choose(10,5)
 252
# How many ways to choose 3 from a set of 7?
choose(7,3)
 35

Outside of probability and statistics the binomial coefficient is sometimes written as nCk and is read the same way.

For pairs it is easy to represent every possible combination.

Lets look at 5C2 The easiest way to determine this by hand is to write out a triangle that contains every possible pairing. The first row is all of the pairings that contain a 1, there are four of these. Next is the pairings that contain a 2, there are also four of these but we already wrote the pairing 1-2 in the first row. Then the pairings that contains a 3, and then the ones that contain a 4. By the time we get to “all the pairings that contain a 5” there are none left, each of those pairs has already been written.

12, 13, 14, 15
23, 24, 25,
35, 34,
45

choose(5,2)
 10

How about more generally? Making a construct like this for something other than a pair isn’t nearly as easy. Also it takes more time.

The equation you will most often find that defines the binomial coefficient is n!/k!(n-k)! where the “!” symbol indicates the factorial operation rather than excitement. A factorial is a number multiplied by all of the smaller natural numbers. For example:

5*4*3*2*1
 120
factorial(5)
 120

Why does n!/k!(n-k)! produce the correct answer?

This is a question that has always bothered me because people tend not to explain so I’d like to present a simple proof of the equation that defines binomial coefficient. As you might have guessed I like to think in statistical terms so this proof will be based on probability, namely: What is the probability of picking any given combination? First notice that all combinations of the same size are equally likely so if we can find the probability of a given combination we know the probability of all of them.

Let’s say we want a combination of three letters from a set of nine, our target is to get A, B, and C.

## The first nine letters of the alphabet.
LETTERS[1:9]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I"

When we pick the first element there are 3 possible successes out of 9 (A or B or C, since order doesn’t matter). If we do pick correctly there are now 2 possible success out of 8 (since one of the correct letters is gone). Finally, if we pick correctly the second time there is 1 possible success out of 7.

Thus the probability of us drawing ABC, in some order, must be:

3/9 * 2/8 * 1/7
 0.0119047619 

That’s not very easy to read but fortunately the reciprocal will be. In fact it will tell us exactly the number of combinations.

1 / (3/9 * 2/8 * 1/7)
 84

Just to make sure I haven’t lead you down the garden path let’s check the respectable statistical software.

choose(9,3)
 84

Awesome!

The final steps are to consider what the equation 1 / (3/9 * 2/8 * 1/7) represents. In essence we’re flipping over all of those fractions. This means that we can extract the possible combinations in one step by just doing that.

9/3 * 8/2 * 7
 84

Now we will return to the forumula n!/k!(n-k)! and see how that excitable equation results in the equation we found. If we expand the factorial we get:

\displaystyle \frac{9\times8\times7\times6\times5\times4\times3\times2\times1}{(3\times2\times1)(6\times5\times4\times3\times2\times1)}

You may recall that in equations like this common terms “cancel out” and here you can distinctly see the ones that will. All that’s left is:

\displaystyle \frac{9\times8\times7}{3\times2\times1}

Quod Erat Demonstrandum

The Volley Problem

Here’s a question for you. A line of musketeers, ten abreast, is firing on the enemy. Each man has a 10% chance of hitting his target at this range. What is the probability that at least one of them will hit in the first volley? This is a traditional sort of problem in logic and probability. To some people it seems as though there ought to be a 100% chance since that’s 10 times 10 but some further thought reveals that since there is some chance of each of them missing there must be some chance of all of them missing.

We calculate the result 100% minus the probability of each missing (90%) to the power of the number of musketeers (10).

1-(.9^10)
0.651

So about an 65% chance of at least one of them hitting.

That’s not a super informative result since it doesn’t tell us anything like what the most likely number of hits is. Consider this however: Shooting at a target is a Bernoulli trial, just like the flipping of a coin. It has exactly two outcomes, hit or miss, and particular probability. Thus the successes of the musketeers ought be following a Binomial distribution (both logically and because I have a theme going). Indeed they do, the distribution of their volleys should look like this:

musket

We see immediately that the most likely outcome is that 1 of them will hit and that is almost inconceivable that more than 4 of them will hit. Musketeers were not terrible efficient military machine, although they were quite deadly en masse.

Say you are the commander of this line and you’re not happy with the atrocious accuracy of these soldiers. You want a 90% of getting at least one hit. How much better do the musketeers have to be?

Thanks to computers the easiest method, and most general, is to just test many possible values.

## Many candidate values of q, the probability of a miss
q <- seq(.9,.5,-.01)
## The corresponding values of p.t the total probability 
## of a hit for the musketeers as a group.
p.t <- 1-(q^10)

## We set a equal to the smallest values of p.t that is equal
## to or greater than 0.9 and then check out what is on that
## line.
dat <- data.frame(q,p.t)
a <- min(dat$p.t[p.t >= .9])
dat[dat$p.t==a,]
      q         p.t
12 0.79 0.9053172

Recalling that for Bernoulli trials p = 1 – q we conclude that the probability of hitting on each shot it 0.21, or a 21% chance. The soldiers need to double their accuracy. You will also notice that if you make a binomial distribution with the new value of p the most likely outcome is now that two of the soldiers will hit.

For safety’s sake lets work out the analytical solution to the problem in order to check out answer.

.9 = 1-(x^10)
.9+(x^10) = 1
x^10 = .1
x = .1^.1
x = .794
1-.794
.206

Yep, that’s basically the same answer.

What if we tried to solve our problem the other way around? Historically musket lines were enormous and successful in part because it was easier to hand a musket to ten people than train one person to be a really good shot with them. How many musketeers need to be on the line to have a 90% that at least one of them will hit?

We could, of course, use the same method as we did before and simply recalculate for larger and larger numbers of musketeers. If we did we would end up with a cumulative geometric distribution a relative of the binomial.

geomcum

Like the binomial the geometric distribution is discrete, it has values only at whole numbers.

If there is one musketeer there the probability of hitting is 0.100, if there are two the probability is 0.190 (not quite double), if there are three the probability is 0.271, and so on. Eventually, with 21 musketeers, the probability rises to fully 0.902 and we’re done.

The plain geometric distribution itself is the probability that it will take N trials to get a success. Here is the flipping of a fair coin:

geometric

The probability of getting Tails on the first flip is 50%, the probability of getting Tails on the first flip and the second flip is 25%. This is why the cumulative distribution shows us the probability of getting a heads.

The code for our graphics in this post.

The binomial distribution. This is very simple but I include it because the xaxp argument is so rarely used. R creates tick marks automatically but not the ones I wanted in this case. The argument tells R the smallest tick mark to label, the largest one, and the total number of tick marks.

Hits <- 0:10
Probability <- dbinom(0:10,10,.1)
volley <- data.frame(Hits,Probability)

plot(volley,type='h',lwd=5,xaxp=c(0,10,10))

The cumulative geometric distribution.

## When the code \n is used in a string R will read it as creating a new line.
plot(pgeom(1:25,.1),type='h',lwd=3,
	main='How Many Musketeers \n Do We Need?',xlab='Musketeers',ylab='Probability',
	ylim=c(0,1))
## Create a horizontal dashed line.
abline(h=.9,lty=2)
## Draw a red line that covers column 21.
segments(21,0,21,0.9015229,col='red',lwd=3.5) 

The Binomial Distribution

The binomial distribution is not quite like the normal distribution or the uniform distribution. While those are continuous the binomial is discrete. The binomial distribution has no value at all at 2.5 or 3.7614 or any decimal, only at whole numbers.

In particular the binomial distribution describes the probable outcomes of an even that can have either a success or failure (getting heads on a coin, rolling a particular number on a die) also known as a Bernoulli trial. We care about the binomial because it is the basis of the simplest test of statistical significance and makes a good introduction to the concept. The defining parameters are the number of trials and the probability of success (or heads, or whatever).

Here is what a Binomial(4,0.5) distribution looks like. It represents flipping a fair coin four times.

trials <- 4
prob <- .5
comb <- 0:trials
bin <- dbinom(comb,trials,prob)

dat <- data.frame(comb,bin)

binom405

What you are looking at here is a probability mass plot. Unlike densities that we’ve looked at before the height of each bar is the probability of getting that value.

Understanding where this comes from is very easy:

There is one way to get zero heads (TTTT).
There are four ways to get one heads. (HTTT, THTT, TTHT, TTTH)
There are six ways to get two heads. (HHTT, TTHH, HTHT, THTH, HTTH, THHT)
There are four ways to get three heads. (THHH, HTHH, HHTH, HHHT)
There is one way to get four heads (HHHH).

Of course there is an somewhat complicated equation that defines the binomial distribution but for our purposes it is less important than the intuition of what it is.

Now let’s expand it to a larger number of flips, as if we were performing an actual experiment. Here is a Binomial(30,0.5) distribution.

binom3005

Notice how it now appears similar in shape to a normal distribution! This is more than just a visual similarity, a binomial distribution with an extremely large number of trials is one of the many things that approximates a normal distribution.

Here is the graphics code rolled up into a function that gives us nice axes and a descriptive title. We can even choose to make it show the cumulative distribution or the inverse cumulative function.

binom.chart <- function(n,p,type="mass") {
	require(ggplot2)
	trials <- n
	prob <- p
	comb <- 0:n
	if(type=="mass") {
		bin <- dbinom(comb,n,p)
	}
	if(type=="cumulative") {
		bin <- pbinom(comb,n,p)
	}
	if(type=="inverse") {
		bin <- 1-pbinom(comb,n,p)
	}
	dat <- data.frame(comb,bin)

	p <- ggplot(dat, aes(x=comb,y=bin)) +
		geom_point(size=4) + 
		geom_linerange(aes(ymax=bin,ymin=0),size=1) +
		labs(x="Number of Successes",
			y="Probability",
			title=paste0("Binomial(",trials,",",prob,")"))
	print(p)
	out <- data.frame(heads = comb, prob = bin)
	out$prob <- as.numeric(format(out$prob,digits=3,scientific=F))
	invisible(out)
}

A discussion of the equation that defines the binomial distribution can be found in a later post.

The Normal Distribution

The Normal Distribution

There’s not much to say about the uniform distribution, its very simple after all, but the normal distribution has been studied extensively. Why? Because quite a lot of things are approximately normally distributed. Heights and weights are popular examples. Relatively few physical things are strictly normally distributed, though, since there are practical limitations on them (ie no one weight -10 pounds or 2000 pounds).

In general approximately normal distributions are common because they tend to arise when there are many random influences on a value.

One useful piece of information is the so-called empirical rule. For any normal distribution (no matter the parameters) 68% of the probability is within one standard deviation of the mean, 95% is within two standard deviations, and 99.7% is within three standard deviation. This is a quick and well known reminder that values drawn from normal distributions rarely stray far from the mean.

Let’s simulate some normally distributed data and see if we can get any really extreme values. Using range() will return the minimum and maximum values.

range(rnorm(1000000,0,1))
-4.596342  4.855854

With one million random numbers the most extreme value I got was not quite five standard deviations from the mean. Keep in mind that any number is possible.

That thing that makes the normal distribution so important is that it is connected to the central limit theorem which we’ll take a look at in a little while. The CLT is a mathematical proof that, under certain conditions, many things will follow an approximately normal distribution. This is very fortunate for us! If the world did not tend to be approximately normal we’d be in serious trouble when trying to estimate anything because samples could easily have characteristics wildly divergent from the population they come from.

Do keep in mind that this behavior is approximate. Many people argue that because it is nearly impossible to exactly meet the requirements of the CLT most things in life are not quite normally distributed. A lot of statistics was founded in studies of gambling where everything is controlled to within tight tolerances and assumptions like those of the CLT can be precisely met.

Nonetheless we’ll see in a few posts that while it is important to be aware of the difference between normal and approximately normal we can still get a lot of mileage out of the normal distribution thanks to the CLT.

Next time we’re going to take a look at the binomial distribution and hypothesis testing which will lead into how the normal distribution gets used.