Binomial Distribution – Equation

Last time we looked at the binomial distribution it was purely to understand the behavior of a random variable. This time we’re going to look at the equation that defines it.

Here it is:

\displaystyle \binom{n}{k}p^k(1-p)^{n-k}

All of this notation should by now be familiar. The first part is a binomial coefficient and the rest is basic math. The variable n is the sample size and the variable p is the probability of success. Meanwhile k is the set of natural numbers N0 (whole numbers from 0 to infinity). We test the probability at each value of k in order to create the distribution.

So why is this the equation?

I am not in the business of writing proofs, many good ones exist, but rather of producing intuition. The binomial coefficient is there because we are choosing from a set. It is there as a normalizing value to convert the rest of the equation into a meaningful probability.

Let us flips one coin three times. The probability that the coin will come up heads three times is p^3 since it has probability p each time. Alternatively the exponential term could be expanded to p*p*p. When we plug this into the equation:

\displaystyle \binom{3}{3}p^3(1-p)^{3-3}

Which can then be solved to become:

1\times p^3(1-p)^0
. . . which is . . .
1\times p^3\times1
. . . which is . . .
p^3

So it gives us the right answer.

How about the probability of getting two heads? The probability of getting heads twice is p^2 and the probability of getting one tails is (1-p)^1 so we multiply them together to find the probability. The result can be expanded to p*p*(1-p). This is not the probability of getting two heads and a tails, however, it is the total probability of getting each result.

What is the difference?

There are three equally likely ways to get this result so we have a probability of (p^2)*(1-p) three times! The equation is:

\displaystyle \binom{3}{2}p^2(1-p)^{3-2}

Which can then be solved to become:

\displaystyle 3\times p^2(1-p)^{1}

Exactly what we predicted.

The Hypergeometric Distribution

This is a distribution you can impress your friend’s with because it has such an absolutely awesome sounding name and a comparatively simple form.

While the binomial distribution tells us about independent events the hypergeometric distribution tells us about a particular kind of dependent event, sampling without replacement. If you reach into a bag and take out a marble then reach in and take out another marble you are sampling without replacement. Like the binomial and Bernoulli distributions the hypergeometric describes events in terms of success and failure.

Here is our scenario in mathematical terms:

Randomly select n items from a set of N items. In this set there are K possible successes and N-K possible failures (which simply means everything is either a success of a failure). The probability of picking k successes from the set of K will follow the hypergeometric distribution. Which has this form:

\displaystyle \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}

Why? Let’s take a look.

Here is a random set from which we will pick three items. As per ancient statistical tradition consider it an urn containing while and black balls.

WBBBWBWBBWBBW

We will pick four balls from the urn and want to know how likely it is that exactly three of them will be black?

Here are our variables.

N = 13 # There are 13 items
K = 8  # 8 of them are black
n = 4  # We are selecting four items
k = 3  # Looking for 3 black

First we will describe how may ways there are to choose 3 black balls from a set of 8, because there are only 8 black balls in the set. Choose? That sounds like a binomial coefficient so let’s write KCk. Now how many ways are there to draw 1 white ball from a let of 5 white balls? That’s another binomial coefficient! We write it as N-KCn-k because N-K is the the number of non-black balls and n-k is the number of times we can pick a non-black ball.

Now we can simply multiply these together to get the number of correct combinations! There are 280 possible ways.

That’s not a probability, though, so we need to normalize it. How to do that? Well, how many ways are there to pick a combination from the set? That must be NCn since it is a set of N and we are choosing n items in total. That comes to 715.

So there are 280 correct results out of 715 total options. That’s just a fraction!

208/715
 0.3916084

There is a roughly 29% chance of us picking the particular combination of black and white balls that we want. And since we know that 280 is reakky KCk times N-KCn-k and that 715 is really NCn we can see where the equation comes from.

We can also summon up the hypergeometric distribution in R with the dhyper() family of functions.

dhyper(3,8,5,4)
 0.3916084

With so many variables there are tons of hypergeometric distributions but here’s a simple one that show the probability for between zero and four balls black chosen from an urn.

Successes <- 0:4
plot(dhyper(Successes,8,5,4)~Successes,
	type='h', lwd=3,
	ylim=c(0,.5), ylab='Probability',
	main='A Hypergeomatric Distribution')

hyper

Obviously the hypergeometric distribution is good for more than placing bets on Jacob Bernoulli’s less interesting party games or we wouldn’t be talking about it. The hypergeometric can be used to count cards. Perhaps most famously, however, it was used by Sir Ronald Fisher to devise a statistical test. He was given the task of testing the claim that fellow biologist Muriel Bristol-Roach could tell whether her tea was poured into the cup before the milk or after. The idea of the test was that distribution told him the best possible guesses that could be made for the experiment he designed.

Doctor Bristol-Roach was presented with eight cups of tea to taste, some poured milk first and some not. The lady then tasted them one-by-one reporting which type she believed each cup was. History records that the well bred biologist was more than validated by the experiment.

Binomial Coefficient

Binomial Coefficient

The binomial coefficient in probability theory, especially the study of discrete probabilities. It looks like this:

\displaystyle \binom{n}{k}

Because its notation is unfamiliar this can make otherwise comprehensible equation appear to be nonsense. Traditionally the binomial coefficient represents how many different ways you can pick k objects from a set of n objects. Consequently it is read as “n choose k”. This is why R uses the choose() function to denote it.

Let’s look at some examples

# How many ways to choose 5 from a set of 10?
choose(10,5)
 252
# How many ways to choose 3 from a set of 7?
choose(7,3)
 35

Outside of probability and statistics the binomial coefficient is sometimes written as nCk and is read the same way.

For pairs it is easy to represent every possible combination.

Lets look at 5C2 The easiest way to determine this by hand is to write out a triangle that contains every possible pairing. The first row is all of the pairings that contain a 1, there are four of these. Next is the pairings that contain a 2, there are also four of these but we already wrote the pairing 1-2 in the first row. Then the pairings that contains a 3, and then the ones that contain a 4. By the time we get to “all the pairings that contain a 5” there are none left, each of those pairs has already been written.

12, 13, 14, 15
23, 24, 25,
35, 34,
45

choose(5,2)
 10

How about more generally? Making a construct like this for something other than a pair isn’t nearly as easy. Also it takes more time.

The equation you will most often find that defines the binomial coefficient is n!/k!(n-k)! where the “!” symbol indicates the factorial operation rather than excitement. A factorial is a number multiplied by all of the smaller natural numbers. For example:

5*4*3*2*1
 120
factorial(5)
 120

Why does n!/k!(n-k)! produce the correct answer?

This is a question that has always bothered me because people tend not to explain so I’d like to present a simple proof of the equation that defines binomial coefficient. As you might have guessed I like to think in statistical terms so this proof will be based on probability, namely: What is the probability of picking any given combination? First notice that all combinations of the same size are equally likely so if we can find the probability of a given combination we know the probability of all of them.

Let’s say we want a combination of three letters from a set of nine, our target is to get A, B, and C.

## The first nine letters of the alphabet.
LETTERS[1:9]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I"

When we pick the first element there are 3 possible successes out of 9 (A or B or C, since order doesn’t matter). If we do pick correctly there are now 2 possible success out of 8 (since one of the correct letters is gone). Finally, if we pick correctly the second time there is 1 possible success out of 7.

Thus the probability of us drawing ABC, in some order, must be:

3/9 * 2/8 * 1/7
 0.0119047619 

That’s not very easy to read but fortunately the reciprocal will be. In fact it will tell us exactly the number of combinations.

1 / (3/9 * 2/8 * 1/7)
 84

Just to make sure I haven’t lead you down the garden path let’s check the respectable statistical software.

choose(9,3)
 84

Awesome!

The final steps are to consider what the equation 1 / (3/9 * 2/8 * 1/7) represents. In essence we’re flipping over all of those fractions. This means that we can extract the possible combinations in one step by just doing that.

9/3 * 8/2 * 7
 84

Now we will return to the forumula n!/k!(n-k)! and see how that excitable equation results in the equation we found. If we expand the factorial we get:

\displaystyle \frac{9\times8\times7\times6\times5\times4\times3\times2\times1}{(3\times2\times1)(6\times5\times4\times3\times2\times1)}

You may recall that in equations like this common terms “cancel out” and here you can distinctly see the ones that will. All that’s left is:

\displaystyle \frac{9\times8\times7}{3\times2\times1}

Quod Erat Demonstrandum

Bayesian Binomial Comparison

A while back we took a look at binomial tests. The two variations we looked at (one traditional and one Bayesian) were both single sample tests, they compared the behavior of the coin to that of a hypothetical coin with a known bias. In the field of making fair dice and fair coins this is a pretty reasonable test to do.

Often in science, however, we wish to compare two groups of samples. From a traditional perspective we wish to know if we can reasonably conclude they are drawn from differently shaped binomial distributions and from a Bayesian perspective we wish to describe the differences between groups.

There is no popular traditional test for this so instead we’ll just look at how to address the problem in terms of Bayesian estimation. To do so we have to introduce a technique called the Monte Carlo method, named after a large casino in Monte Carlo. Casinos make their money on statistics. Every game gives some edge to the house and when thousands of games are played that slight edge becomes a guaranteed source of income.

In statistics the Monte Carlo method is used to avoid having to do doing math! (Indeed much of the history of mathematics consists of brilliant thinkers trying to avoid mathematics.)

The Monte Carlo method works by producing large numbers of random values. For example we can generate a binomial distribution by flipping 6 coins ten thousand times and summing up the results.

barplot(table(colSums(replicate(10000,sample(c(0,1),6,replace=T)))))

montecarlobinom

Now let’s apply this to the question of comparing binomial data.

Recall that when we have binomial data in a Bayesian problem we use the Beta distribution to describe our knowledge about it. Last time we did this by using the dbeta() function to find where the density is. This time we will use the rbeta() function to produce a large number of values from the appropriate distribution.

Here is our information:
Our first coin flipped 100 heads and 50 tails.
Our second coin flipped 70 heads and 30 tails.

## Many random values from the distributions.
## We call each of them theta.
theta1 <- rbeta(10000,101,51)
theta2 <- rbeta(10000,71,31)

## The exact distributions.
comb <- seq(0,1,.01)
dens1 <- dbeta(comb,101,51)
dens2 <- dbeta(comb,71,31)

## Now let's compare the density to the sample.
par(mfrow=c(1,2))
plot(dens1~comb,xlim=c(.4,.8),type='l',lwd=2,main='Sample 1')
lines(density(theta1),col='red',lwd=2)
plot(dens2~comb,xlim=c(.5,.9),type='l',lwd=2,main='Sample 1')
lines(density(theta2),col='red',lwd=2)

montecarlobeta

Very similar, just as expected. If that isn’t close enough for your liking just increase the size of the sample.

Now we can make some simple inferences about the data. The quantile() function will produce a 95% interval and the median for each. Watch:

quantile(theta1,c(.025,.5,.975))
 2.5%   50% 97.5% 
0.590 0.665 0.737

quantile(theta2,c(.025,.5,.975))
 2.5%   50% 97.5% 
0.605 0.697 0.780

So 95% of the density of the first sample is between 0.590 and 0.737 (it is centered at 0.665).

I promised comparisons, though, and those turn out to be extremely easy to produce. The difference between theta1 and theta2 is . . . the difference between theta1 and theta2. This is why we needed to generate ten thousand random numbers rather than just looking at the density at each point.

diff <- theta1-theta2
quantile(diff,c(.025,.5,.975))
  2.5%    50%  97.5% 
-0.145 -0.032  0.089

So there doesn’t seem to be much of a difference. The median difference is just -0.032, which is just about zero. Bayesian methods also allow us to visualize the distribution of results.

dth1 <- density(theta1)
dth2 <- density(theta2)
ddif <- density(diff)

r <- range(c(theta1,theta2))
y <- max(c(dth1$y,dth2$y))

par(mfrow=c(2,1))
plot(dth1,lwd=3,xlim=r,ylim=c(0,y),main='Thetas',col='blue')
lines(dth2,lwd=3,col='blue')
plot(ddif,col='turquoise',lwd=3,main='Difference')

thetaonetwo

In blue we see the distribution of possible values for each coin’s bias. Below in turquoise are the distribution of differences.

Weird Facts from MTG JSON: Converted Mana Cost

Welcome back to my irregular series Weird Facts from MTG JSON, using data made available by Robert Shultz on his site.

In Magic: The Gathering every card requires a resource called “mana” in some amount in order to be used. The converted mana cost (abbreviated cmc) can be can zero or any natural number. Now we might wonder how those costs are distributed among the more than 13 thousand unique cards in the game.

Naively there are a lot of valid possibilities but anyone familiar with the game will be know that the vast majority of cards cost less than five mana. I expect they would be hard pressed to say with confidence which cost is the most common, though. Let’s check it out!

First the data itself:

library(rjson)
source("CardExtract.R")
## Read the data into R.
AllSets <- fromJSON(file="AllSets.json")

## Filter all of it through the extraction function,
## bind it together and fix the row names.
l <- lapply(AllSets,extract.cards)
AllCards <- AllCards[AllCards$exp.type != "un",]
AllCards <- do.call(rbind.data.frame,l)
rownames(AllCards) <- 1:length(AllCards$cmc)

Now we take three pieces of information we need, the name, the cost, and the layout for each card. We get rid of everything except “ordinary” cards and the squeeze out everything unique that remains so reprints don’t affect our results.

name <- AllCards$name
cmc <- AllCards$cmc
lay <- AllCards$layout
dat <- data.frame(name,cmc,lay)
dat <- dat[dat$lay == 'normal' | dat$lay == 'leveler',]
dat <- unique(dat)

Here’s where something very interesting happens so bear with me for a second. Let’s plot the counts for each unique cmc.

plot(table(dat$cmc),
	ylab='count',xlab='cmc',
	main='Costs of Magic Cards')

MagicCosts

If you read Wednesday’s post where we recreated the London Blitz this should look very familiar. That’s right, the costs of Magic cards follow a Poisson distribution. We can add some points that show the appropriate Poisson distribution.

m <- mean(dat$cmc)
l <- length(dat$cmc)
comb <- 0:16
plot(table(dat$cmc),
	ylab='count',xlab='cmc',
	main='Costs of Magic Cards \n Poisson In Red')

MagicCostsPoisson

Its a ridiculously good fit, almost perfect. What is going on exactly? I mean I doubt they did this deliberately. When we last saw the Poisson distribution, and in almost every explanation of it you will ever find, the Poisson is the result of pure randomness. Are the design and development teams working blindfolded?


(Image from “Look At Me I’m The DCI” provided by mtgimage.com)
(Art by Mark Rosewater, Head Designer of Magic: The Gathering)

Nope. The secret is to think about where exactly the Poisson distribution comes from and forget its association with randomness. It requires that only zero and the natural numbers be valid possibilities and that a well defined average exists. Although Magic has never had a card which cost more than 16 mana there’s no actual limitation in place, high costs are just progressively less likely.

I expect that what we are seeing here is that the developers have found that keeping the mean cost of cards at around 3.25, in any given set, makes for good gameplay in the Limited format. Because they also want to keep their options open in terms of costs, a creature that costs more than 8 mana is a special occasion, the inevitable result is a Poisson distribution.

Update – extract.cards

The extract.cards() function we made a while back has been updated to include layout information about each card in anticipation of a major update to mtgjson that will bring in tokens from all of the game’s history.

For now let us the updated function to refine our estimate of how many cards exist.

First let’s summarize the information that is in layout. We will get rid of the joke sets early on since they needlessly confuse many things.

## Read the data into R.
AllSets <- fromJSON(file="AllSets.json")
source("CardExtract.R")

## Filter all of it through the extraction function bind it together and fix the row names.
l <- lapply(AllSets,extract.cards)
AllCards <- AllCards[AllCards$exp.type != "un",]
AllCards <- do.call(rbind.data.frame,l)
rownames(AllCards) <- 1:length(AllCards$cmc)

## Layout information
summary(AllCards$layout)
      normal     vanguard        token        split         flip        plane 
       24333          116           13          100           42           74 
     leveler       scheme double-faced   phenomenon 
          30           45           66            8 

Robert currently defines 10 distinct card layouts, each of which represent some trait of the card. Normal cards are laid out like we’ve seen before and levelers are pretty much the same, just modified to allow a unique ability.

Some of these are not “true” cards. The vanguards, planes, schemes, and phenomenon come from rare variants. Tokens are just representations of things that cards make. We will simply exclude these from our count.

The tricky part is the split cards, flip cards, and double-faced cards. Notice that there are even numbers of all of them, that’s not coincidental. Each card has two parts that are recorded independently. So there are really 50 split cards 21 flip cards and 33 double-faced cards.

Now we can make a more accurate determination of how many unique cards there are.

s <- length(AllCards$layout[AllCards$layout == 'split'])/2
f <- length(AllCards$layout[AllCards$layout == 'flip'])/2
d <- length(AllCards$layout[AllCards$layout == 'double-faced'])/2

simple <- with(AllCards,AllCards[layout == "normal"| layout == "leveler",])
name <- simple$name
length(unique(simple$name)) + s + f + d
13983

The Volley Problem

Here’s a question for you. A line of musketeers, ten abreast, is firing on the enemy. Each man has a 10% chance of hitting his target at this range. What is the probability that at least one of them will hit in the first volley? This is a traditional sort of problem in logic and probability. To some people it seems as though there ought to be a 100% chance since that’s 10 times 10 but some further thought reveals that since there is some chance of each of them missing there must be some chance of all of them missing.

We calculate the result 100% minus the probability of each missing (90%) to the power of the number of musketeers (10).

1-(.9^10)
0.651

So about an 65% chance of at least one of them hitting.

That’s not a super informative result since it doesn’t tell us anything like what the most likely number of hits is. Consider this however: Shooting at a target is a Bernoulli trial, just like the flipping of a coin. It has exactly two outcomes, hit or miss, and particular probability. Thus the successes of the musketeers ought be following a Binomial distribution (both logically and because I have a theme going). Indeed they do, the distribution of their volleys should look like this:

musket

We see immediately that the most likely outcome is that 1 of them will hit and that is almost inconceivable that more than 4 of them will hit. Musketeers were not terrible efficient military machine, although they were quite deadly en masse.

Say you are the commander of this line and you’re not happy with the atrocious accuracy of these soldiers. You want a 90% of getting at least one hit. How much better do the musketeers have to be?

Thanks to computers the easiest method, and most general, is to just test many possible values.

## Many candidate values of q, the probability of a miss
q <- seq(.9,.5,-.01)
## The corresponding values of p.t the total probability 
## of a hit for the musketeers as a group.
p.t <- 1-(q^10)

## We set a equal to the smallest values of p.t that is equal
## to or greater than 0.9 and then check out what is on that
## line.
dat <- data.frame(q,p.t)
a <- min(dat$p.t[p.t >= .9])
dat[dat$p.t==a,]
      q         p.t
12 0.79 0.9053172

Recalling that for Bernoulli trials p = 1 – q we conclude that the probability of hitting on each shot it 0.21, or a 21% chance. The soldiers need to double their accuracy. You will also notice that if you make a binomial distribution with the new value of p the most likely outcome is now that two of the soldiers will hit.

For safety’s sake lets work out the analytical solution to the problem in order to check out answer.

.9 = 1-(x^10)
.9+(x^10) = 1
x^10 = .1
x = .1^.1
x = .794
1-.794
.206

Yep, that’s basically the same answer.

What if we tried to solve our problem the other way around? Historically musket lines were enormous and successful in part because it was easier to hand a musket to ten people than train one person to be a really good shot with them. How many musketeers need to be on the line to have a 90% that at least one of them will hit?

We could, of course, use the same method as we did before and simply recalculate for larger and larger numbers of musketeers. If we did we would end up with a cumulative geometric distribution a relative of the binomial.

geomcum

Like the binomial the geometric distribution is discrete, it has values only at whole numbers.

If there is one musketeer there the probability of hitting is 0.100, if there are two the probability is 0.190 (not quite double), if there are three the probability is 0.271, and so on. Eventually, with 21 musketeers, the probability rises to fully 0.902 and we’re done.

The plain geometric distribution itself is the probability that it will take N trials to get a success. Here is the flipping of a fair coin:

geometric

The probability of getting Tails on the first flip is 50%, the probability of getting Tails on the first flip and the second flip is 25%. This is why the cumulative distribution shows us the probability of getting a heads.

The code for our graphics in this post.

The binomial distribution. This is very simple but I include it because the xaxp argument is so rarely used. R creates tick marks automatically but not the ones I wanted in this case. The argument tells R the smallest tick mark to label, the largest one, and the total number of tick marks.

Hits <- 0:10
Probability <- dbinom(0:10,10,.1)
volley <- data.frame(Hits,Probability)

plot(volley,type='h',lwd=5,xaxp=c(0,10,10))

The cumulative geometric distribution.

## When the code \n is used in a string R will read it as creating a new line.
plot(pgeom(1:25,.1),type='h',lwd=3,
	main='How Many Musketeers \n Do We Need?',xlab='Musketeers',ylab='Probability',
	ylim=c(0,1))
## Create a horizontal dashed line.
abline(h=.9,lty=2)
## Draw a red line that covers column 21.
segments(21,0,21,0.9015229,col='red',lwd=3.5)