Thomas and the Lie Detector

This post is partially based on Eliezer Yudowsky’s popular introduction to Bayes’ Theorem, which I do recommend if you want a more rigorous explanation, but is meant to illustrate a few different points along the way.

Imaging that you are a cyberpunk detective living in a dystopian version of Barcelona where one person in four is a murder. There is so much murder going on that your grizzled old chief decides to arrest people en masse and make you sort it out. In order to ease this process you’ve been provided with a state of the art lie detector, the Decto-1000. Unlike the competing brand this one is confirmed by Science!TM to work.

You bring in your first subject and begin the interrogation.

“Are you  murderer?”
“No.”

The machine flashes LIE on its fancy screen.

Interrogation done. Throw her in the slammer and move on to the next one, right?

(No, wrong, obviously. I’d never ask a question like that when the answer is yes.)

You’re probably aware that you sometimes miss lies when people tell them. It stands to reason that the machine probably doesn’t get all of them either. Let’s go look at the box that the Decto-1000 came in.

On the cover it says “Catches 90% of Lies!”

Okay, so she’s probably a murder. Good enough. You live in a dystopia anyway.

Not so fast. That claim is pretty easy to manipulate. I, for example, have the power to catch far more than 90% of lies. Indeed my powers allow me to perform this feat across time and space without even meeting the potential liar.

Think of a lie or a falsehood then highlight this next section.

You thought of a lie.

With this method I can catch 100% of lies! Of course, I will also wrongly say that every truth I get is a lie so my technique is clearly a poor one and, yes, something of a strawman. The point is the claim on the front of the box is not nearly enough information. They are providing only a measure of “sensitivity” when we also need a measure of “specificity,” the machine’s ability to tell the difference between truths and lies. To be a bit more technical: Sensitivity measures how often you get a true positive and specificity measures how often you get a false positive.

Since sensitivity and specificity are a bit too esoteric and technical to be put on a flashy advertisement you’ll have to go digging through the manual to find this information. After going through some technicolor images of human brains being scanned by fMRI, explanations of microexpressions, and analysis of the logic behind voice stress analysis (all very impressive I assure you) we find the listing for sensitivity and specificity.

Conveniently both are exactly 90%. This means that when given a lie it calls it a lie 90% of the time and when given a truth it calls it a truth 90% of the time. As a result it is not wrong to say that the machine is correct 90% of the time. However, we will see that it is a bit misleading and accepting it without consideration can lead to make serious mistakes.

To clean up the language in this next section it is important to note that I’ll be assuming everyone denies being a murderer. In reality there are some people who genuinely confess and some people who falsely confess and this probably applies in dystopian Barcelona as well. Fortunately it doesn’t change our results if they do, it only makes the language much more confusing.

 

Back in the holding cells are 1000 people who were grabbed off the street to be interrogated. Recall that we know 250 of them are actually murders and 750 are not.

Before we look at the results try to imagine what results we will get.

Will the machine say that 250 of them are murderers? Will it say that 750 are innocent?

. . .

. . .

When we test all of them we get the following results: 300 murderers and 700 innocents.

The first thing you can see here is that if we believe the machine unconditionally we’re a good distance off from what the numbers really should be. That’s not surprising since the machine isn’t perfect.

Of the 250 murderers the machine will wrongly tell us to let 25 of them go (10%) and correctly tell us to lock up 225 of them (90%).

Of the 750 non-murderers the machine will wrongly tell us to lock up 75 of them (10%) and correctly tell us to let 675 of them (90%).

We’re seeing that the machine is right 90% of the time just like the advertisements say (900 of the 1000 answers are correct) yet I’ve already warned you not to think that way. This is where I part ways with Eliezer Yudowsky. Knowing that the machine is right 90% of the time is a useful piece of information even though we may misuse that information

Let me ask you a question: When the first woman you interviewed denied being a murder and the machine said she was lying what was the probability the machine was wrong?

Go look at the breakdown up top.

Do you see it? The machine is wrong about lies 25% of the time, not 10% like we might have thought. There are 225 people correctly accused of murder and 75 people wrongly accused of murder.

The opposite happens when the machine detects a truth. It is wrong about truths only 4% of the time. There are 675 people correctly let go and 25 wrongly let go.

It is very simple to understand what is happening here: Even in our horrible dystopia most people aren’t murderers. As a result the majority of our false positives, in absolute terms, come from non-murderers.

Now I’ll give you two situations so we can see how and why these differences matter.

The grizzled old police veteran who runs your precinct (he has a robot leg and an eyepatch) comes in and looks at all of this that we’ve gone over. He’s a savvy person and he tells you that you should just put everyone the machine tells you is a murderer in prison. After all, he points out, this will get 90% of the murderers off the streets.

Is he right?

Yes, he is. This is similar to what is known as the long run interpretation of statistics. In the long run the machine is right 90% of the time. Now perhaps a 10% error rate is unacceptably high but the logic behind the order is perfectly sound.

Next you are called into court to testify about the lie detector results for the woman you interviewed at the start of the article and found to be a murderer. The lawyer for the prosecution tells the jury that the machine only has a 10% chance of being wrong so they must convict her. When the lawyer for the defense stands up he asks you if this is true.

You must say no, that the prosecution is misunderstanding the statistics. The machine is right 90% of the time but on each trial it is either has a 75% chance of being right or 96% chance of being right. On any given trial the probability of being correct is never 90%. In the case of the accused woman there is a 25% that the machine was wrong.

This difference is why you may see people write weird things like “just because this method is right 95% of the time doesn’t mean it’s right with 95% probability” without any explanation when talking about statistics. There are varying ways of talking about probability.

 

Actually that’s what made me write this post in the first place. For some reason no one, anywhere, ever, feels the need to explain this very simple difference. The next post will explore the issue of “prior distributions” which were a critical piece of information in this post. Which is so say: “How did I know one in four people were murderers in the first place?”

Floor and Ceiling Effects

How would you measure the weight of a dime using only a bathroom scale?

If you put the dime on the scale nothing will happen. The scale isn’t sensitive enough to measure the weight of the dime at all, so it says zero just as if nothing were on it. This is a floor effect; we cannot measure something below the scale’s sensitivity. Now in the case of measuring the weight of a dime it’s not very important since we’re not going to mistakenly conclude that dimes have no weight but there are cases where it might matter.

For example: At the time of this writing physicists say that neutrinos have a very low mass but won’t specify what it is. Originally they assumed that neutrinos were massless but because of floor effects physicists knew they couldn’t prove this by directly measuring their mass. In the end they were able to measure other properties of the particles to determine that neutrinos must have mass, even if they can’t yet say exactly what it is.

A ceiling effect is the opposite of a floor effect. Simply we cannot measure something above the scale’s limit. Using a bathroom scale to weigh an elephant is just as problematic as trying to weight a dime. If the scale survives it will say that the elephant weighs a few hundred pounds because that is the greatest weight the scale can measure. Again we’re not going to make the mistake of thinking that elephants only weigh as much as people but there are cases where it can be relevant.

Tests are the most common things that are vulnerable to ceiling effects.

Psychologists like to use matching tasks to determine information about working memory, the information that you can hold in mind. These tasks are very simple. You’re shown an image and have to identify it later. Unfortunately people are nearly perfect at them even with a long delay between showing the image and identifying it. This is a useful bit of information in and of itself but it makes comparisons very tricky. Think of it this way: How different are two people who both get a perfect score on a series of matching task?

We have almost no idea. A person with typical working memory, extraordinary working memory, and slightly impaired working memory will all score at about 100% on the task. All of them are indistinguishable.

Still, matching tasks are nice because they’re easy to apply many times to lots of people in order to get a bunch of useful data so psychologists would like to keep them in their toolset. In order to make matching tasks useful they have to make them arbitrarily more difficult by doing thing like having people recite long meaningless sentences during the delay. By forcing down the accuracy of the subjects it becomes possible to get a proper measurement.

Even if you’ve never taken a psychological test you might be familiar with test taking something simple and making it harder. Tests like the SAT, GCSE, or TOEFL can feel unfairly hard. While I’m not a test writer for any of these companies it’s likely that various “gotcha” questions are in them because the test makers have little choice. Standardized tests are not given to a random selection of the population; they are given to the section of the population that is most likely to know the answers. The SAT tests high school students on information expected to be taught to high school students. Even worse people who expect to fail may not take the test at all and many people will prepare just to get a better score.

All of these factors push the typical ability of a student taking the test to a level far higher than any other kind of test. If the SAT were “fair” almost everyone taking the test would score highly, making the test completely useless. To combat this without requiring that high school students know more things additional difficulties are added to the test. The result is questions that are genuinely confusing.

Of course this means that standardized test are measuring things they don’t advertise and arguably corrupting their utility.

 

Floor and ceiling effects are important on a philosophical level as well since they’re a very accessible example of how critical it is to understand the limits of our knowledge. If we didn’t understand that our measuring devices are limited (and what those limits) we would lead ourselves into absurd beliefs. There is an anthropological fable about a culture that counts “One, two, three, many.” that serves as a useful parallel. The word “many” is vague but it is clearly better than “four” in this case since it acknowledges uncertainty and uncertainty is what we have. Knowing that we don’t know something is useful, its the only reason we look for new ways to learn.

Mission Statement

For some reason people don’t seem to put much time into explaining things with the intent that people gain a sense of understanding. When I look up something online I tend to find either explanations I don’t trust or one’s that assume I’m already an expert.

I want to change that.

Go look at the wikipedia page on the Central Limit Theorem, I’ll wait. It is a very complete explanation of the concept but all the parts that a passing reader is interested in are buried in jargon and calculus that most people have no way of knowing beforehand. The closest thing you get to a clear explanation of what the CLT means is: “In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed.”

That’s not wrong. It’s worse, it’s intimidating. I bet you could figure it out if you spent a little while thinking about the words and looking up the jargon you’d get an idea of what it means but I don’t feel like you should have to. Here’s a quick explanation stripped of the jargon: “The central limit theorem says that the average of a big random sample will usually be pretty close to the average of whatever it is a sample of.”

That’s it.

It’s pretty vague, I’ll admit, but if all you want is an understanding of the CLT it’s plenty. The specifics are important (actually the details are responsible for the whole field of statistics) but they’re not always necessary.

So the purpose of this blog is to take complicated subjects and pull the important details out of piles of overwhelming details.