The Hypergeometric Distribution

This is a distribution you can impress your friend’s with because it has such an absolutely awesome sounding name and a comparatively simple form.

While the binomial distribution tells us about independent events the hypergeometric distribution tells us about a particular kind of dependent event, sampling without replacement. If you reach into a bag and take out a marble then reach in and take out another marble you are sampling without replacement. Like the binomial and Bernoulli distributions the hypergeometric describes events in terms of success and failure.

Here is our scenario in mathematical terms:

Randomly select n items from a set of N items. In this set there are K possible successes and N-K possible failures (which simply means everything is either a success of a failure). The probability of picking k successes from the set of K will follow the hypergeometric distribution. Which has this form:

\displaystyle \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}

Why? Let’s take a look.

Here is a random set from which we will pick three items. As per ancient statistical tradition consider it an urn containing while and black balls.

WBBBWBWBBWBBW

We will pick four balls from the urn and want to know how likely it is that exactly three of them will be black?

Here are our variables.

N = 13 # There are 13 items
K = 8  # 8 of them are black
n = 4  # We are selecting four items
k = 3  # Looking for 3 black

First we will describe how may ways there are to choose 3 black balls from a set of 8, because there are only 8 black balls in the set. Choose? That sounds like a binomial coefficient so let’s write KCk. Now how many ways are there to draw 1 white ball from a let of 5 white balls? That’s another binomial coefficient! We write it as N-KCn-k because N-K is the the number of non-black balls and n-k is the number of times we can pick a non-black ball.

Now we can simply multiply these together to get the number of correct combinations! There are 280 possible ways.

That’s not a probability, though, so we need to normalize it. How to do that? Well, how many ways are there to pick a combination from the set? That must be NCn since it is a set of N and we are choosing n items in total. That comes to 715.

So there are 280 correct results out of 715 total options. That’s just a fraction!

208/715
 0.3916084

There is a roughly 29% chance of us picking the particular combination of black and white balls that we want. And since we know that 280 is reakky KCk times N-KCn-k and that 715 is really NCn we can see where the equation comes from.

We can also summon up the hypergeometric distribution in R with the dhyper() family of functions.

dhyper(3,8,5,4)
 0.3916084

With so many variables there are tons of hypergeometric distributions but here’s a simple one that show the probability for between zero and four balls black chosen from an urn.

Successes <- 0:4
plot(dhyper(Successes,8,5,4)~Successes,
	type='h', lwd=3,
	ylim=c(0,.5), ylab='Probability',
	main='A Hypergeomatric Distribution')

hyper

Obviously the hypergeometric distribution is good for more than placing bets on Jacob Bernoulli’s less interesting party games or we wouldn’t be talking about it. The hypergeometric can be used to count cards. Perhaps most famously, however, it was used by Sir Ronald Fisher to devise a statistical test. He was given the task of testing the claim that fellow biologist Muriel Bristol-Roach could tell whether her tea was poured into the cup before the milk or after. The idea of the test was that distribution told him the best possible guesses that could be made for the experiment he designed.

Doctor Bristol-Roach was presented with eight cups of tea to taste, some poured milk first and some not. The lady then tasted them one-by-one reporting which type she believed each cup was. History records that the well bred biologist was more than validated by the experiment.

Binomial Coefficient

Binomial Coefficient

The binomial coefficient in probability theory, especially the study of discrete probabilities. It looks like this:

\displaystyle \binom{n}{k}

Because its notation is unfamiliar this can make otherwise comprehensible equation appear to be nonsense. Traditionally the binomial coefficient represents how many different ways you can pick k objects from a set of n objects. Consequently it is read as “n choose k”. This is why R uses the choose() function to denote it.

Let’s look at some examples

# How many ways to choose 5 from a set of 10?
choose(10,5)
 252
# How many ways to choose 3 from a set of 7?
choose(7,3)
 35

Outside of probability and statistics the binomial coefficient is sometimes written as nCk and is read the same way.

For pairs it is easy to represent every possible combination.

Lets look at 5C2 The easiest way to determine this by hand is to write out a triangle that contains every possible pairing. The first row is all of the pairings that contain a 1, there are four of these. Next is the pairings that contain a 2, there are also four of these but we already wrote the pairing 1-2 in the first row. Then the pairings that contains a 3, and then the ones that contain a 4. By the time we get to “all the pairings that contain a 5” there are none left, each of those pairs has already been written.

12, 13, 14, 15
23, 24, 25,
35, 34,
45

choose(5,2)
 10

How about more generally? Making a construct like this for something other than a pair isn’t nearly as easy. Also it takes more time.

The equation you will most often find that defines the binomial coefficient is n!/k!(n-k)! where the “!” symbol indicates the factorial operation rather than excitement. A factorial is a number multiplied by all of the smaller natural numbers. For example:

5*4*3*2*1
 120
factorial(5)
 120

Why does n!/k!(n-k)! produce the correct answer?

This is a question that has always bothered me because people tend not to explain so I’d like to present a simple proof of the equation that defines binomial coefficient. As you might have guessed I like to think in statistical terms so this proof will be based on probability, namely: What is the probability of picking any given combination? First notice that all combinations of the same size are equally likely so if we can find the probability of a given combination we know the probability of all of them.

Let’s say we want a combination of three letters from a set of nine, our target is to get A, B, and C.

## The first nine letters of the alphabet.
LETTERS[1:9]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I"

When we pick the first element there are 3 possible successes out of 9 (A or B or C, since order doesn’t matter). If we do pick correctly there are now 2 possible success out of 8 (since one of the correct letters is gone). Finally, if we pick correctly the second time there is 1 possible success out of 7.

Thus the probability of us drawing ABC, in some order, must be:

3/9 * 2/8 * 1/7
 0.0119047619 

That’s not very easy to read but fortunately the reciprocal will be. In fact it will tell us exactly the number of combinations.

1 / (3/9 * 2/8 * 1/7)
 84

Just to make sure I haven’t lead you down the garden path let’s check the respectable statistical software.

choose(9,3)
 84

Awesome!

The final steps are to consider what the equation 1 / (3/9 * 2/8 * 1/7) represents. In essence we’re flipping over all of those fractions. This means that we can extract the possible combinations in one step by just doing that.

9/3 * 8/2 * 7
 84

Now we will return to the forumula n!/k!(n-k)! and see how that excitable equation results in the equation we found. If we expand the factorial we get:

\displaystyle \frac{9\times8\times7\times6\times5\times4\times3\times2\times1}{(3\times2\times1)(6\times5\times4\times3\times2\times1)}

You may recall that in equations like this common terms “cancel out” and here you can distinctly see the ones that will. All that’s left is:

\displaystyle \frac{9\times8\times7}{3\times2\times1}

Quod Erat Demonstrandum

Bayesian Binomial Comparison

A while back we took a look at binomial tests. The two variations we looked at (one traditional and one Bayesian) were both single sample tests, they compared the behavior of the coin to that of a hypothetical coin with a known bias. In the field of making fair dice and fair coins this is a pretty reasonable test to do.

Often in science, however, we wish to compare two groups of samples. From a traditional perspective we wish to know if we can reasonably conclude they are drawn from differently shaped binomial distributions and from a Bayesian perspective we wish to describe the differences between groups.

There is no popular traditional test for this so instead we’ll just look at how to address the problem in terms of Bayesian estimation. To do so we have to introduce a technique called the Monte Carlo method, named after a large casino in Monte Carlo. Casinos make their money on statistics. Every game gives some edge to the house and when thousands of games are played that slight edge becomes a guaranteed source of income.

In statistics the Monte Carlo method is used to avoid having to do doing math! (Indeed much of the history of mathematics consists of brilliant thinkers trying to avoid mathematics.)

The Monte Carlo method works by producing large numbers of random values. For example we can generate a binomial distribution by flipping 6 coins ten thousand times and summing up the results.

barplot(table(colSums(replicate(10000,sample(c(0,1),6,replace=T)))))

montecarlobinom

Now let’s apply this to the question of comparing binomial data.

Recall that when we have binomial data in a Bayesian problem we use the Beta distribution to describe our knowledge about it. Last time we did this by using the dbeta() function to find where the density is. This time we will use the rbeta() function to produce a large number of values from the appropriate distribution.

Here is our information:
Our first coin flipped 100 heads and 50 tails.
Our second coin flipped 70 heads and 30 tails.

## Many random values from the distributions.
## We call each of them theta.
theta1 <- rbeta(10000,101,51)
theta2 <- rbeta(10000,71,31)

## The exact distributions.
comb <- seq(0,1,.01)
dens1 <- dbeta(comb,101,51)
dens2 <- dbeta(comb,71,31)

## Now let's compare the density to the sample.
par(mfrow=c(1,2))
plot(dens1~comb,xlim=c(.4,.8),type='l',lwd=2,main='Sample 1')
lines(density(theta1),col='red',lwd=2)
plot(dens2~comb,xlim=c(.5,.9),type='l',lwd=2,main='Sample 1')
lines(density(theta2),col='red',lwd=2)

montecarlobeta

Very similar, just as expected. If that isn’t close enough for your liking just increase the size of the sample.

Now we can make some simple inferences about the data. The quantile() function will produce a 95% interval and the median for each. Watch:

quantile(theta1,c(.025,.5,.975))
 2.5%   50% 97.5% 
0.590 0.665 0.737

quantile(theta2,c(.025,.5,.975))
 2.5%   50% 97.5% 
0.605 0.697 0.780

So 95% of the density of the first sample is between 0.590 and 0.737 (it is centered at 0.665).

I promised comparisons, though, and those turn out to be extremely easy to produce. The difference between theta1 and theta2 is . . . the difference between theta1 and theta2. This is why we needed to generate ten thousand random numbers rather than just looking at the density at each point.

diff <- theta1-theta2
quantile(diff,c(.025,.5,.975))
  2.5%    50%  97.5% 
-0.145 -0.032  0.089

So there doesn’t seem to be much of a difference. The median difference is just -0.032, which is just about zero. Bayesian methods also allow us to visualize the distribution of results.

dth1 <- density(theta1)
dth2 <- density(theta2)
ddif <- density(diff)

r <- range(c(theta1,theta2))
y <- max(c(dth1$y,dth2$y))

par(mfrow=c(2,1))
plot(dth1,lwd=3,xlim=r,ylim=c(0,y),main='Thetas',col='blue')
lines(dth2,lwd=3,col='blue')
plot(ddif,col='turquoise',lwd=3,main='Difference')

thetaonetwo

In blue we see the distribution of possible values for each coin’s bias. Below in turquoise are the distribution of differences.

Weird Facts from MTG JSON: Converted Mana Cost

Welcome back to my irregular series Weird Facts from MTG JSON, using data made available by Robert Shultz on his site.

In Magic: The Gathering every card requires a resource called “mana” in some amount in order to be used. The converted mana cost (abbreviated cmc) can be can zero or any natural number. Now we might wonder how those costs are distributed among the more than 13 thousand unique cards in the game.

Naively there are a lot of valid possibilities but anyone familiar with the game will be know that the vast majority of cards cost less than five mana. I expect they would be hard pressed to say with confidence which cost is the most common, though. Let’s check it out!

First the data itself:

library(rjson)
source("CardExtract.R")
## Read the data into R.
AllSets <- fromJSON(file="AllSets.json")

## Filter all of it through the extraction function,
## bind it together and fix the row names.
l <- lapply(AllSets,extract.cards)
AllCards <- AllCards[AllCards$exp.type != "un",]
AllCards <- do.call(rbind.data.frame,l)
rownames(AllCards) <- 1:length(AllCards$cmc)

Now we take three pieces of information we need, the name, the cost, and the layout for each card. We get rid of everything except “ordinary” cards and the squeeze out everything unique that remains so reprints don’t affect our results.

name <- AllCards$name
cmc <- AllCards$cmc
lay <- AllCards$layout
dat <- data.frame(name,cmc,lay)
dat <- dat[dat$lay == 'normal' | dat$lay == 'leveler',]
dat <- unique(dat)

Here’s where something very interesting happens so bear with me for a second. Let’s plot the counts for each unique cmc.

plot(table(dat$cmc),
	ylab='count',xlab='cmc',
	main='Costs of Magic Cards')

MagicCosts

If you read Wednesday’s post where we recreated the London Blitz this should look very familiar. That’s right, the costs of Magic cards follow a Poisson distribution. We can add some points that show the appropriate Poisson distribution.

m <- mean(dat$cmc)
l <- length(dat$cmc)
comb <- 0:16
plot(table(dat$cmc),
	ylab='count',xlab='cmc',
	main='Costs of Magic Cards \n Poisson In Red')

MagicCostsPoisson

Its a ridiculously good fit, almost perfect. What is going on exactly? I mean I doubt they did this deliberately. When we last saw the Poisson distribution, and in almost every explanation of it you will ever find, the Poisson is the result of pure randomness. Are the design and development teams working blindfolded?


(Image from “Look At Me I’m The DCI” provided by mtgimage.com)
(Art by Mark Rosewater, Head Designer of Magic: The Gathering)

Nope. The secret is to think about where exactly the Poisson distribution comes from and forget its association with randomness. It requires that only zero and the natural numbers be valid possibilities and that a well defined average exists. Although Magic has never had a card which cost more than 16 mana there’s no actual limitation in place, high costs are just progressively less likely.

I expect that what we are seeing here is that the developers have found that keeping the mean cost of cards at around 3.25, in any given set, makes for good gameplay in the Limited format. Because they also want to keep their options open in terms of costs, a creature that costs more than 8 mana is a special occasion, the inevitable result is a Poisson distribution.

The Poission Distribution

The Poisson Distribution

The Poisson distribution is actually based on a relatively simple line of thought. How many times can an event occur in a certain period?

Let’s say that on an average day the help center gets 4 calls. Sometimes more and sometimes less. At any given moment we can either get a call or not get a call. This tells us that the probability distribution which describes how many calls we are likely to get each day is Binomial. Furthermore we know that it must be of the form Binomial(N,4/N) where where N is the greatest possible number of calls because this is the only way for the mean to be 4.

The question now becomes how many call is it possible to get in a day? Hard to say but presumably the phone could be ringing off the hook constantly. Its outrageously unlikely for me to get a hundred calls in a day but there is some probability that everyone in the building could need help on the same day. It is rarely possible to set an upper limit with confidence but watch what happens as we make the limit higher and higher.

PoissonFinite

As we increase the The distribution is converging on a particular shape. Poisson proved what the shape would become if N were infinite. He also showed that this new distribution was much easier to calculate the form of than the Binomial distribution. Since a very large N and an infinite N are nearly identical the difference is irrelevant so long as its possible for the event to happen a large number of times.

Since the Poisson distribution is effectively a Binomial distribution with N always set to the same value we can describe the Poisson using only the average rate. Here it is for an average rate of 4 events per interval (calls per day, meteor strikes per year, etc).

PoissonTrue

It is occasionally said the the Poisson distribution is the “law of rare events” which is a sort of true but mostly confusing title. What matters is that the event could potentially happen much more often than it does.

A switchboard that gets an average of 100 calls an hour is not getting calls “rarely” but it is going to follow a Poisson distribution because a call can happen at any moment during the hour. If the switchboard were very slow and could only handle one call every ten seconds (so never more than 360 calls in an hour) it would be better to use a binomial distribution.

When explaining things it is often better to show than tell. Let’s look at an example of how the Poisson arises.

The Blitz was a massive bombing of Britain during World War II. A major concern of the British government was whether or not the Nazi bombs were precise weapons. If the bombs were guided their policies to protect people in the cities would have to be different. The analysts knew that if the bombing were random the distribution of bombs would have to follow an approximately Poisson distribution.

We can simulate a bombing campaign pretty easily by creating a large number of random coordinates to represent for each hit. The city is ten kilometers by ten kilometers.

set.seed(10201)
x <- runif(300,0,10)
y <- runif(300,0,10)
xy <- data.frame(x,y)

And what does this attack look like? We’ll plot each point and overlay a 10×10 grid.

p <- ggplot(xy,aes(x=x,y=y))
p + geom_point(size=2.5,color='red') +
	geom_hline(yintercept=0:10) +
	geom_vline(xintercept=0:10) +
	scale_x_continuous(minor_breaks=NULL,breaks=NULL) +
	scale_y_continuous(minor_breaks=NULL,breaks=NULL) +
	coord_cartesian(xlim=c(0,10),ylim=c(0,10))

Bombing

During the Blitz each square had to be counted up by hand. We can do a little better. The cut() function divides up the data into one hundred sections then we rename them into something easier to read.

xx <- cut(x,10)
yy <- cut(y,10)
levels(xx) <- 1:10
levels(yy) <- 1:10

A contingency table tells now tells us how many bombs landed in each square. As a dataframe its easier to work with. Making a plot to see how often each number of its came up is then very simple.

fr <- data.frame(table(xx,yy))
plot(table(fr$Freq))

Check

It isn’t a perfect, fit, of course, but as we discussed about the philosophy of statistics we do not expect perfection. The trend is one that clearly approximates the Poisson distribution.

Bernoulli Distribution

The Bernoulli Distribution

When we first started looking at statistical inference a few weeks ago the binomial distribution was introduced, which reflects the behavior of a coin that has been flipped many times. The Bernoulli distribution is simpler case which the binomial is a generalization of, it represent a coin flipped a single time.

Here’s an example for a biased coin:

bern <- data.frame(Value=c(0,1),Probability=c(.3,.7))

plot(bern,type='h',lwd=5,ylim=c(0,1),main="Bernoulli(0.3,0.7)")
points(bern,cex=2,pch=16) # Add large, solid color points

bern

Obviously you can’t get a lot of mileage out of this distribution. It is, however, the source of a lot of common terminology in statistics. By tradition Heads = 1 and Tails = 0 (which part of the reason that in R and many other languages the logical values TRUE and FALSE equal 1 and 0, respectively). The probability of Heads is written p while the probability of tails is written q. Since a coin only has two sides it is always true that q = 1-p and p = 1-q.

Anything that has discrete “success” and “failure” modes can be considered a Bernoulli trial. Flipping a coin is a Bernoulli trial. Rolling a die is a Bernoulli trial if you define every face as either success of failure (ie “a 6 is a success” or “all odds are successes”). Some things are close enough to Bernoulli trial that they can generally be thought of in these terms, for example, many text books use the sex of a newborn as a an example of a Bernoulli trial even though about 1% of people are born with ambiguous or intersex characteristics.

Bernoulli’s name is also given to the concept of a Bernoulli process which is multiple, independent, identical Bernoulli trials. This means that this is more than one trial, trials have no effect on each other, and the probability of success is the same each time. This is what creates a binomial distribution (flipping a coin many times). On the other hand drawing black and white marbles from a box isn’t a Bernoulli process because once you take a black marble out the probability of drawing a black marble decreases, the trials are not identical.

Weird Facts from MTG JSON: First Letters

A short little post today that will hopefully be part of an intermittent series. We’ll interrogating the JSON file we extracted a while back.

The question we want to answer today is: What is the most common first letter in the names of cards? Naively we might think that they’re all close to equally common and with a bit more thought we’d realize that certain letters like Q, X, Y, and Z are liable to be much less common than others.

First lets extract the name of every card that isn’t from a joke set (called the Un-sets).

a <- AllCards[AllCards$exp.type != "un",]$name

Some cards get reprinted multiple times and, while this probably wouldn’t throw off our results too dramatically much we should get rid of those repeated names in the interest of fairness. The unique() function does that for us.

names <- unique(a)

How to get the first letter from each name, though? For that we have to return to the regex capabilities that R has. There is a large family of function for working with strings of text. The substring() function asks us to provide it a vector (we have one of those!) then tell it where to start reading (the first letter) and where to stop reading (also the first letter).

letters <- substring(names,1,1)

Just like that we’ve got a vector of first letters for every unique name. To count up how many of each we just use the familiar table() function.

table(letters)
letters
   Æ    A    B    C    D    E    F    G    H    I 
  28  748  799 1000  827  485  636  785  504  340 
   J    K    L    M    N    O    P    Q    R    S 
 148  403  441  901  346  277  724   55  775 1993 
   T    U    V    W    X    Y    Z 
 852  169  388  542   14   51   77

barplot(table(letters),ylim=c(0,2000),
	main="First Letter of Magic Cards")

first letter

The letter S is outrageously popular, thanks in large part to names that start with “Sky”, “Soul”, “Sword”, and “Sylvan”. I think its more interesting that the rarely used letter Æ (pronounced ‘ash’) is twice as popular as X. There is an editorial policy where the word “aether” is always rendered as Æther, so it shows up on a number of cards that refer to ephemeral magic.

This also makes it easy determine how many unique Magic card have ever been printed:

length(names)
14308

It’s actually a bit less than that because there are twelve token cards included in the data and thirty three double faced cards, for which each side is counted.

Webscraping Images

Now let’s webscrape some images from mtgimage.com to see how we can work with images in R itself. We will need the RCurl and jpeg packages installed and loaded in order to do this. RCurl will give us some webscraping functions while jpeg allows R to convert .jpegs image files into a form R can read.

install.packages('RCurl')
library(RCurl)
install.packages('jpeg')
library(jpeg)

By using regular expressions we’ll look at just the creatures with the Beast subtype from our extracted JSON file, and get their ID codes.

codes <- MRD[grep("Beast", MRD$subtype),]$ID
codes
 "48436" "46556" "46123" "48603" "46115" "48600" "48083"

Since mtgimage.com is designed to be as easy to reference it is trivial to use this information to create a bunch of URLs that we will reference later. The paste() function to drop each multiverse ID into the middle of some text. The only thing we have to do is tell it that the separator value will be nothing (by default it is a space).

Once we have that

urls <- paste("http://mtgimage.com/multiverseid/",codes,"-crop.jpg", sep="")

jpglist <- list()
for (i in 1:7) {
	jpglist[[i]] <- readJPEG(getURLContent(urls[i]))
}

Now we have a bunch of raster files (ie bitmaps) and the built in rasterImage() function to visualize them with. The simplest way to show them is to call a new plot and put the image on it.

plot.new()
rasterImage(jpglist[[1]],0,0,1,1)

Why would you want to do this with R?

To be perfectly honest it isn’t something that is called for often with statistical software but it is a great chance to get practice with how things are referenced in R.

Extracting from a JSON File

In our last post we learned about how to deconstruct unfamiliar data in order to get a idea of what information is in it. Extracting the data into a workable form is the next step. Fortunately I’ve written a function to do exactly that. The annotations are contained in the code.

# Converts an extracted section of AllSets.json into a dataframe. Input should
# be in the form of AllSets$[Set Code].

extract.cards <- function(x) {
	
	# Grab the set code, block name, and the type of set.
	code <- x$code
	exp <- x$type
	if(class(x$block) == "NULL") {
		blk <- "NA"
	} else 
	blk <- x$block

	# Now we switch to just the cards.
	x <- x$cards
	
	# Create empty vectors of the correct size
	# of values for quicker loops later on.
	len <-rep("",length(x))
	type <- len
	subtype <- len
	name <- len
	cmc <- len
	pwr <- len
	tns <- len
	rare <- len
	color <- len
	ID <- len
	layout <- len

	## Each card in a given set has the same set code, 
	## block name, and expansion type. We just repeat
	## the value many times.
	set <- rep(code,length(x))
	block <- rep(blk,length(x))
	exp.type <- rep(exp,length(x))

	# A big loop wherein we extract information from each card.
	for(i in 1:length(x)) {
		card <- x[[i]]

		# Every card has exactly one rarity and exactly one name. The
		# layout type will let us filter certain unusual card types.
		name[i] <- card$name
		rare[i] <- card$rarity
		layout[i] <- card$layout
	
		# Slightly harder. Cards can have several types, subtypes, 
		# and colors. We use paste() to collapse together these
		# multiple values.
		type[i] <- paste(card$types[1:length(card$types)],collapse=" ")
		subtype[i] <- paste(card$subtypes[1:length(card$subtypes)],collapse=" ")
		color[i] <- paste(card$colors[1:length(card$colors)],collapse=" ")

		# Most complicated. These values can sometimes be blank.
		## ID (a few promotional cards lack ID codes)
		if(class(card$multiverseid) == "NULL") {
			ID[i] <- "NA"
		} else
		ID[i] <- card$multiverseid

		## Total Cost
		## Cards without a listed CMC actually have a value of
		## zero.
		if(class(card$cmc) == "NULL") {
			cmc[i] <- 0
		} else
		cmc[i] <- card$cmc

		## Power and Toughness
		if(class(card$power) == "NULL") {
			pwr[i] <- "NA"
		} else

		pwr[i] <- card$power

		if(class(card$toughness) == "NULL") {
			tns[i] <- "NA"
		} else
		tns[i] <- card$toughness
	} # End of the loop!

	## Clean up work so we can read stuff more easily! ##

	# Make power, toughness, and cmc into numerics. This will make a few
	# things into NAs but that's okay.
	pwr <- as.numeric(pwr)
	tns <- as.numeric(tns)
	cmc <- as.numeric(cmc)
	
	# Convert a few things to factors to save space.
	rare <- as.factor(rare)
	type <- as.factor(type)
	subtype <- as.factor(subtype)
	set <- as.factor(set)
	block <- as.factor(block)
	exp.type <- as.factor(exp.type)
	layout <- as.factor(layout)

	# Simplify the names for each color then convert to 
	# a factor and make empty values read as C for colorless.
	color <- gsub("White","W",color)
	color <- gsub("Blue","U",color)
	color <- gsub("Black","B",color)
	color <- gsub("Red","R",color)
	color <- gsub("Green","G",color)
	color <- as.factor(color)
	levels(color)[1] <- "C"

	# Everything goes into a dataframe.
	data.frame(
		name, ID, type, subtype, cmc, color, 
		power=pwr, toughness=tns, rare, set, 
		block, exp.type, layout,
		stringsAsFactors=F)
}

Code updated 8/2/2014 to add layout information.

Now lets grab all of the data from the MRD set and look at a random selection of cards.

MRD <- extract.cards(AllSets$MRD)

MRD[sample(nrow(MRD), 3), ]
                  name    ID        type subtype cmc color power toughness   rare set    block  exp.type
298 Wanderguard Sentry 48439    Creature   Drone   5     U     3         3 Common MRD Mirrodin expansion
203            Regress 49061     Instant           3     U    NA        NA Common MRD Mirrodin expansion
41   Contaminated Bond 49444 Enchantment    Aura   2     B    NA        NA Common MRD Mirrodin expansion

Where is the Fangren Hunter we met last time? Regular expressions let us search for a particular string. We combine that with the subsetting notation to track down a name.

MRD[grep("Fangren Hunter",MRD$name),]
             name    ID     type subtype cmc color power toughness  mana   rare set    block  exp.type
66 Fangren Hunter 46115 Creature   Beast   5     G     4         4 3 G G Common MRD Mirrodin expansion

Awesome! We’ve taken information in an unfamiliar format and turned it into a convenient dataframe. Obviously most JSON files don’t contain information about Magic cards but extraction is hardly difficult.

Next we’ll use webscraping to extract images from the internet and put them into a format that R can work with.

Working With JSON Files

I am a big fan of Magic: The Gathering and data regarding the game is my preferred way to experimenting with new techniques since I don’t know or care enough about baseball to look at that data. Thanks to Robert Shultz from mtgjson.com an up to date and easy to read file with information about Magic cards is readily available for us to check out.

This also gives us a chance of learn about how to deal with data in a foreign fomat.

The JSON format is a popular and incredibly lightweight way to store data. In R it can be read with the rjson package by Alex Couture-Beil. Once you’ve installed the rjson package and downloaded the AllSets.json file to your working directory we’ll load them into R.

library(rjson)
AllSets <- fromJSON(file="AllSets.json")

Notice that the JSON file loads in just a moment despite the fact that it describes more than twenty thousand objects, each with at least eleven characteristics. Much faster than reading a .csv file, for instance.

Exploring a foreign data format can be tricky. In particular we now have to deal with the fact that JSON files that are loaded into R become an unwieldy list of lists. There is no convenient way to summarize them. It is also very difficult to work with lists if you’re doing analysis or just searching for information. We would much prefer to have a dataframe. In fact the data here is so large and unfamiliar that we have to learn a new trick just to see what’s going on. If you just ask for the str() of the AllSets data you get more lines of data than R can show, not an improvement. Fortunately we can tell str() to not show every nested level of the data.

Here are the first few lines, showing just level 1.

str(AllSets, max.level=1)
List of 136
 $ LEA:List of 8
 $ LEB:List of 8
 $ ARN:List of 8

There are 136 sets included in the file (as of this post), from Limited Edition Alpha (1996) all the way through to Magic 2015 (2014). Each of them is referred to via a three letter code. If you read down the list a bit you’ll find a set abbreviated as MRD. Now let’s specify that we want to see the structure of MRD with that same safety valve to keep us from getting flooded with information.

str(AllSets$MRD, max.level=1)
List of 8
 $ name       : chr "Mirrodin"
 $ code       : chr "MRD"
 $ releaseDate: chr "2003-10-02"
 $ border     : chr "black"
 $ type       : chr "expansion"
 $ block      : chr "Mirrodin"
 $ booster    : chr [1:15] "rare" "uncommon" "uncommon" "uncommon" ...
 $ cards      :List of 306
  .. [list output truncated]

We can see a few easily understood things about it the set: It’s name is Mirrodin, there are 306 cards in it, and the set was released on February 10th 2003. Additionally the Mirrodin set is part of the Mirrodin block (the storyline that was covered that year). The remaining information is fairly technical but should make sense to anyone familiar with the game.

Let’s look at the 66th card of the set in a compact form. The unlist() function turns the list of characteristics into a named vector instead. Let’s also get rid of line 15 since it’s very large and not very important.

unlist(AllSets$MRD$cards[[66]][-15])
            layout               type              types             colors 
          "normal" "Creature — Beast"         "Creature"            "Green" 
      multiverseid               name           subtypes                cmc 
           "46115"   "Fangren Hunter"            "Beast"                "5" 
            rarity             artist              power          toughness 
          "Common"    "Darrell Riche"                "4"                "4" 
          manaCost               text             number          imageName 
       "{3}{G}{G}"          "Trample"              "119"   "fangren hunter" 

The card is the creature “Fangren Hunter”. Another site by Robert Shultz, mtgimage.com, provides us with an image of the card itself.

In our next post we’ll take a look at how to extract all this data from the JSON format into a dataframe. Exploring that will give us a chance to practice again with subsetting notation and regular expressions.