Weird Facts from MTG JSON: Converted Mana Cost

Welcome back to my irregular series Weird Facts from MTG JSON, using data made available by Robert Shultz on his site.

In Magic: The Gathering every card requires a resource called “mana” in some amount in order to be used. The converted mana cost (abbreviated cmc) can be can zero or any natural number. Now we might wonder how those costs are distributed among the more than 13 thousand unique cards in the game.

Naively there are a lot of valid possibilities but anyone familiar with the game will be know that the vast majority of cards cost less than five mana. I expect they would be hard pressed to say with confidence which cost is the most common, though. Let’s check it out!

First the data itself:

library(rjson)
source("CardExtract.R")
## Read the data into R.
AllSets <- fromJSON(file="AllSets.json")

## Filter all of it through the extraction function,
## bind it together and fix the row names.
l <- lapply(AllSets,extract.cards)
AllCards <- AllCards[AllCards$exp.type != "un",]
AllCards <- do.call(rbind.data.frame,l)
rownames(AllCards) <- 1:length(AllCards$cmc)

Now we take three pieces of information we need, the name, the cost, and the layout for each card. We get rid of everything except “ordinary” cards and the squeeze out everything unique that remains so reprints don’t affect our results.

name <- AllCards$name
cmc <- AllCards$cmc
lay <- AllCards$layout
dat <- data.frame(name,cmc,lay)
dat <- dat[dat$lay == 'normal' | dat$lay == 'leveler',]
dat <- unique(dat)

Here’s where something very interesting happens so bear with me for a second. Let’s plot the counts for each unique cmc.

plot(table(dat$cmc),
	ylab='count',xlab='cmc',
	main='Costs of Magic Cards')

MagicCosts

If you read Wednesday’s post where we recreated the London Blitz this should look very familiar. That’s right, the costs of Magic cards follow a Poisson distribution. We can add some points that show the appropriate Poisson distribution.

m <- mean(dat$cmc)
l <- length(dat$cmc)
comb <- 0:16
plot(table(dat$cmc),
	ylab='count',xlab='cmc',
	main='Costs of Magic Cards \n Poisson In Red')

MagicCostsPoisson

Its a ridiculously good fit, almost perfect. What is going on exactly? I mean I doubt they did this deliberately. When we last saw the Poisson distribution, and in almost every explanation of it you will ever find, the Poisson is the result of pure randomness. Are the design and development teams working blindfolded?


(Image from “Look At Me I’m The DCI” provided by mtgimage.com)
(Art by Mark Rosewater, Head Designer of Magic: The Gathering)

Nope. The secret is to think about where exactly the Poisson distribution comes from and forget its association with randomness. It requires that only zero and the natural numbers be valid possibilities and that a well defined average exists. Although Magic has never had a card which cost more than 16 mana there’s no actual limitation in place, high costs are just progressively less likely.

I expect that what we are seeing here is that the developers have found that keeping the mean cost of cards at around 3.25, in any given set, makes for good gameplay in the Limited format. Because they also want to keep their options open in terms of costs, a creature that costs more than 8 mana is a special occasion, the inevitable result is a Poisson distribution.

Update – extract.cards

The extract.cards() function we made a while back has been updated to include layout information about each card in anticipation of a major update to mtgjson that will bring in tokens from all of the game’s history.

For now let us the updated function to refine our estimate of how many cards exist.

First let’s summarize the information that is in layout. We will get rid of the joke sets early on since they needlessly confuse many things.

## Read the data into R.
AllSets <- fromJSON(file="AllSets.json")
source("CardExtract.R")

## Filter all of it through the extraction function bind it together and fix the row names.
l <- lapply(AllSets,extract.cards)
AllCards <- AllCards[AllCards$exp.type != "un",]
AllCards <- do.call(rbind.data.frame,l)
rownames(AllCards) <- 1:length(AllCards$cmc)

## Layout information
summary(AllCards$layout)
      normal     vanguard        token        split         flip        plane 
       24333          116           13          100           42           74 
     leveler       scheme double-faced   phenomenon 
          30           45           66            8 

Robert currently defines 10 distinct card layouts, each of which represent some trait of the card. Normal cards are laid out like we’ve seen before and levelers are pretty much the same, just modified to allow a unique ability.

Some of these are not “true” cards. The vanguards, planes, schemes, and phenomenon come from rare variants. Tokens are just representations of things that cards make. We will simply exclude these from our count.

The tricky part is the split cards, flip cards, and double-faced cards. Notice that there are even numbers of all of them, that’s not coincidental. Each card has two parts that are recorded independently. So there are really 50 split cards 21 flip cards and 33 double-faced cards.

Now we can make a more accurate determination of how many unique cards there are.

s <- length(AllCards$layout[AllCards$layout == 'split'])/2
f <- length(AllCards$layout[AllCards$layout == 'flip'])/2
d <- length(AllCards$layout[AllCards$layout == 'double-faced'])/2

simple <- with(AllCards,AllCards[layout == "normal"| layout == "leveler",])
name <- simple$name
length(unique(simple$name)) + s + f + d
13983

Weird Facts from MTG JSON: First Letters

A short little post today that will hopefully be part of an intermittent series. We’ll interrogating the JSON file we extracted a while back.

The question we want to answer today is: What is the most common first letter in the names of cards? Naively we might think that they’re all close to equally common and with a bit more thought we’d realize that certain letters like Q, X, Y, and Z are liable to be much less common than others.

First lets extract the name of every card that isn’t from a joke set (called the Un-sets).

a <- AllCards[AllCards$exp.type != "un",]$name

Some cards get reprinted multiple times and, while this probably wouldn’t throw off our results too dramatically much we should get rid of those repeated names in the interest of fairness. The unique() function does that for us.

names <- unique(a)

How to get the first letter from each name, though? For that we have to return to the regex capabilities that R has. There is a large family of function for working with strings of text. The substring() function asks us to provide it a vector (we have one of those!) then tell it where to start reading (the first letter) and where to stop reading (also the first letter).

letters <- substring(names,1,1)

Just like that we’ve got a vector of first letters for every unique name. To count up how many of each we just use the familiar table() function.

table(letters)
letters
   Æ    A    B    C    D    E    F    G    H    I 
  28  748  799 1000  827  485  636  785  504  340 
   J    K    L    M    N    O    P    Q    R    S 
 148  403  441  901  346  277  724   55  775 1993 
   T    U    V    W    X    Y    Z 
 852  169  388  542   14   51   77

barplot(table(letters),ylim=c(0,2000),
	main="First Letter of Magic Cards")

first letter

The letter S is outrageously popular, thanks in large part to names that start with “Sky”, “Soul”, “Sword”, and “Sylvan”. I think its more interesting that the rarely used letter Æ (pronounced ‘ash’) is twice as popular as X. There is an editorial policy where the word “aether” is always rendered as Æther, so it shows up on a number of cards that refer to ephemeral magic.

This also makes it easy determine how many unique Magic card have ever been printed:

length(names)
14308

It’s actually a bit less than that because there are twelve token cards included in the data and thirty three double faced cards, for which each side is counted.

Extracting from a JSON File

In our last post we learned about how to deconstruct unfamiliar data in order to get a idea of what information is in it. Extracting the data into a workable form is the next step. Fortunately I’ve written a function to do exactly that. The annotations are contained in the code.

# Converts an extracted section of AllSets.json into a dataframe. Input should
# be in the form of AllSets$[Set Code].

extract.cards <- function(x) {
	
	# Grab the set code, block name, and the type of set.
	code <- x$code
	exp <- x$type
	if(class(x$block) == "NULL") {
		blk <- "NA"
	} else 
	blk <- x$block

	# Now we switch to just the cards.
	x <- x$cards
	
	# Create empty vectors of the correct size
	# of values for quicker loops later on.
	len <-rep("",length(x))
	type <- len
	subtype <- len
	name <- len
	cmc <- len
	pwr <- len
	tns <- len
	rare <- len
	color <- len
	ID <- len
	layout <- len

	## Each card in a given set has the same set code, 
	## block name, and expansion type. We just repeat
	## the value many times.
	set <- rep(code,length(x))
	block <- rep(blk,length(x))
	exp.type <- rep(exp,length(x))

	# A big loop wherein we extract information from each card.
	for(i in 1:length(x)) {
		card <- x[[i]]

		# Every card has exactly one rarity and exactly one name. The
		# layout type will let us filter certain unusual card types.
		name[i] <- card$name
		rare[i] <- card$rarity
		layout[i] <- card$layout
	
		# Slightly harder. Cards can have several types, subtypes, 
		# and colors. We use paste() to collapse together these
		# multiple values.
		type[i] <- paste(card$types[1:length(card$types)],collapse=" ")
		subtype[i] <- paste(card$subtypes[1:length(card$subtypes)],collapse=" ")
		color[i] <- paste(card$colors[1:length(card$colors)],collapse=" ")

		# Most complicated. These values can sometimes be blank.
		## ID (a few promotional cards lack ID codes)
		if(class(card$multiverseid) == "NULL") {
			ID[i] <- "NA"
		} else
		ID[i] <- card$multiverseid

		## Total Cost
		## Cards without a listed CMC actually have a value of
		## zero.
		if(class(card$cmc) == "NULL") {
			cmc[i] <- 0
		} else
		cmc[i] <- card$cmc

		## Power and Toughness
		if(class(card$power) == "NULL") {
			pwr[i] <- "NA"
		} else

		pwr[i] <- card$power

		if(class(card$toughness) == "NULL") {
			tns[i] <- "NA"
		} else
		tns[i] <- card$toughness
	} # End of the loop!

	## Clean up work so we can read stuff more easily! ##

	# Make power, toughness, and cmc into numerics. This will make a few
	# things into NAs but that's okay.
	pwr <- as.numeric(pwr)
	tns <- as.numeric(tns)
	cmc <- as.numeric(cmc)
	
	# Convert a few things to factors to save space.
	rare <- as.factor(rare)
	type <- as.factor(type)
	subtype <- as.factor(subtype)
	set <- as.factor(set)
	block <- as.factor(block)
	exp.type <- as.factor(exp.type)
	layout <- as.factor(layout)

	# Simplify the names for each color then convert to 
	# a factor and make empty values read as C for colorless.
	color <- gsub("White","W",color)
	color <- gsub("Blue","U",color)
	color <- gsub("Black","B",color)
	color <- gsub("Red","R",color)
	color <- gsub("Green","G",color)
	color <- as.factor(color)
	levels(color)[1] <- "C"

	# Everything goes into a dataframe.
	data.frame(
		name, ID, type, subtype, cmc, color, 
		power=pwr, toughness=tns, rare, set, 
		block, exp.type, layout,
		stringsAsFactors=F)
}

Code updated 8/2/2014 to add layout information.

Now lets grab all of the data from the MRD set and look at a random selection of cards.

MRD <- extract.cards(AllSets$MRD)

MRD[sample(nrow(MRD), 3), ]
                  name    ID        type subtype cmc color power toughness   rare set    block  exp.type
298 Wanderguard Sentry 48439    Creature   Drone   5     U     3         3 Common MRD Mirrodin expansion
203            Regress 49061     Instant           3     U    NA        NA Common MRD Mirrodin expansion
41   Contaminated Bond 49444 Enchantment    Aura   2     B    NA        NA Common MRD Mirrodin expansion

Where is the Fangren Hunter we met last time? Regular expressions let us search for a particular string. We combine that with the subsetting notation to track down a name.

MRD[grep("Fangren Hunter",MRD$name),]
             name    ID     type subtype cmc color power toughness  mana   rare set    block  exp.type
66 Fangren Hunter 46115 Creature   Beast   5     G     4         4 3 G G Common MRD Mirrodin expansion

Awesome! We’ve taken information in an unfamiliar format and turned it into a convenient dataframe. Obviously most JSON files don’t contain information about Magic cards but extraction is hardly difficult.

Next we’ll use webscraping to extract images from the internet and put them into a format that R can work with.

Working With JSON Files

I am a big fan of Magic: The Gathering and data regarding the game is my preferred way to experimenting with new techniques since I don’t know or care enough about baseball to look at that data. Thanks to Robert Shultz from mtgjson.com an up to date and easy to read file with information about Magic cards is readily available for us to check out.

This also gives us a chance of learn about how to deal with data in a foreign fomat.

The JSON format is a popular and incredibly lightweight way to store data. In R it can be read with the rjson package by Alex Couture-Beil. Once you’ve installed the rjson package and downloaded the AllSets.json file to your working directory we’ll load them into R.

library(rjson)
AllSets <- fromJSON(file="AllSets.json")

Notice that the JSON file loads in just a moment despite the fact that it describes more than twenty thousand objects, each with at least eleven characteristics. Much faster than reading a .csv file, for instance.

Exploring a foreign data format can be tricky. In particular we now have to deal with the fact that JSON files that are loaded into R become an unwieldy list of lists. There is no convenient way to summarize them. It is also very difficult to work with lists if you’re doing analysis or just searching for information. We would much prefer to have a dataframe. In fact the data here is so large and unfamiliar that we have to learn a new trick just to see what’s going on. If you just ask for the str() of the AllSets data you get more lines of data than R can show, not an improvement. Fortunately we can tell str() to not show every nested level of the data.

Here are the first few lines, showing just level 1.

str(AllSets, max.level=1)
List of 136
 $ LEA:List of 8
 $ LEB:List of 8
 $ ARN:List of 8

There are 136 sets included in the file (as of this post), from Limited Edition Alpha (1996) all the way through to Magic 2015 (2014). Each of them is referred to via a three letter code. If you read down the list a bit you’ll find a set abbreviated as MRD. Now let’s specify that we want to see the structure of MRD with that same safety valve to keep us from getting flooded with information.

str(AllSets$MRD, max.level=1)
List of 8
 $ name       : chr "Mirrodin"
 $ code       : chr "MRD"
 $ releaseDate: chr "2003-10-02"
 $ border     : chr "black"
 $ type       : chr "expansion"
 $ block      : chr "Mirrodin"
 $ booster    : chr [1:15] "rare" "uncommon" "uncommon" "uncommon" ...
 $ cards      :List of 306
  .. [list output truncated]

We can see a few easily understood things about it the set: It’s name is Mirrodin, there are 306 cards in it, and the set was released on February 10th 2003. Additionally the Mirrodin set is part of the Mirrodin block (the storyline that was covered that year). The remaining information is fairly technical but should make sense to anyone familiar with the game.

Let’s look at the 66th card of the set in a compact form. The unlist() function turns the list of characteristics into a named vector instead. Let’s also get rid of line 15 since it’s very large and not very important.

unlist(AllSets$MRD$cards[[66]][-15])
            layout               type              types             colors 
          "normal" "Creature — Beast"         "Creature"            "Green" 
      multiverseid               name           subtypes                cmc 
           "46115"   "Fangren Hunter"            "Beast"                "5" 
            rarity             artist              power          toughness 
          "Common"    "Darrell Riche"                "4"                "4" 
          manaCost               text             number          imageName 
       "{3}{G}{G}"          "Trample"              "119"   "fangren hunter" 

The card is the creature “Fangren Hunter”. Another site by Robert Shultz, mtgimage.com, provides us with an image of the card itself.

In our next post we’ll take a look at how to extract all this data from the JSON format into a dataframe. Exploring that will give us a chance to practice again with subsetting notation and regular expressions.