Are the F.B.I.’s Probabilities About DNA Matches Crazy?

Jason Felch and Maura Dolan of the Los Angeles Times recently wrote a fascinating piece about a controversy that has arisen regarding the use of DNA in identifying criminal suspects. The article starts like this:

State crime lab analyst Kathryn Troyer was running tests on Arizona’s DNA database when she stumbled across two felons with remarkably similar genetic profiles.

The men matched at 9 of the 13 locations on chromosomes, or loci, commonly used to distinguish people.

The [Federal Bureau of Investigation] estimated the odds of unrelated people sharing those genetic markers to be as remote as 1 in 113 billion. But the mug shots of the two felons suggested that they were not related: One was black, the other white.

In the years after her 2001 discovery, Troyer found dozens of similar matches — each seeming to defy impossible odds.

As word spread, these findings by a little-known lab worker raised questions about the accuracy of the F.B.I.’s DNA statistics and ignited a legal fight over whether the nation’s genetic databases ought to be opened to wider scrutiny.

Later, a systematic search of the 65,000 felons in the Arizona database revealed that there were 122 pairs that matched at 9 of 13 loci. Twenty pairs matched at 10 loci.

When I heard about this, I wondered if the F.B.I. is totally off its rocker when it comes to the probabilities it gives about DNA matches. Is it possible that the F.B.I. is right about the statistics it cites, and that there could be 122 nine-out-of-13 matches in Arizona’s database?

Perhaps surprisingly, the answer turns out to be yes. Let’s say that the chance of any two individuals matching at any one locus is 7.5 percent. In reality, the frequency of a match varies from locus to locus, but I think 7.5 percent is pretty reasonable. For instance, with a 7.5 percent chance of matching at each locus, the chance that any 2 random people would match at all 13 loci is about 1 in 400 trillion. If you choose exactly 9 loci for 2 random people, the chance that they will match all 9 is 1 in 13 billion. Those are the sorts of numbers the F.B.I. tosses around, I think.

So under these same assumptions, how many pairs would we expect to find matching on at least 9 of 13 loci in the Arizona database? Remarkably, about 100. If you start with 65,000 people and do a pairwise match of all of them, you are actually making over 2 billion separate comparisons (65,000 * 64,999/2). And if you aren’t just looking for a match on 9 specific loci, but rather on any 9 of 13 loci, then for each of those pairs of people there are over 700 different combinations that are being searched.

So all told, you end up doing about 1.4 trillion searches! If 1 in 13 billion searches yields a positive match as noted above, this leads to roughly 100 expected matches on 9 of 13 loci in a database the size of Arizona’s. (The way I did the calculations, I am allowing for 2 individuals to match on different sets of loci; so to get 100 different pairs of people who match, I need a match rate of slightly higher than 7.5 percent per locus.)

What I find interesting about this article and these calculations is that they show how the same sets of basic statistical relationships can appear much more or less convincing depending on how they are portrayed. When we hear that there are 112 matches out of 65,000 people, it seems like DNA fingerprinting is not nearly as good as we think — but that is largely because we aren’t thinking about the fact that 65,000 people imply 2 billion pairs of people.

Note, however, that if we start with DNA from a crime scene and then go search the Arizona database for matches, we aren’t doing 2 billion searches, we are doing “only” 46 million (65,000 people times 715 different combos of 9 loci), so we will have a false positive rate of “only” 1 in 279.

The bottom line is that DNA testing is not perfect, but it is still a million (or maybe a thousand?) times better than anything else we have to catch criminals and (just as importantly, especially in Illinois) exonerate the innocent.

(Thanks to Dimitris Batzilis for cranking out these numbers.)


Justin

Did the "systematic search" of the database that revealed 122 matched pairs really search all 1.4 trillion searches that Levitt calculates? If not -- and that seems like a ton of searches -- then we might still be worried here if a more simplified search is finding this many matches.

Patterico

"I think its the reverse; because figures in billions have been bandied around when it now seems thereÂ's a 1 in 580 chance of a match (65000/112, is this right?) thats rather a suprise."

To the person who wrote the above:

Didja read the post you're commenting on?

It's not 122 out of 65,000.

It's 122 out of 1,400,000,000,000.

That's kinda different.

Sylvester

@36 It's not important that we base our statistics on a completely random sample anyway. If these statistics are used for a crime, for example, than the suspects could be limited to those in the vicinity of the crime (the city or the state). In that area there will be more relatives than a random sampling of the world's population. In other words, the chance that a person in the vicinity of the crime had the same DNA as the sample DNA is necessarily higher that that of a person picked randomly from the world.

A Scientist

This blog post and the original article are very interesting, as they discuss a topic that I've wondered about for a long time (i.e. the probability of a random DNA match). As high-information content assays have developed, people have struggled with the concept of "false discovery rate" and "multiplicity correction."

I am surprised that these terms are not mentioned in the original article, Steven's post, or in any of the follow-up comments, as that is the core issue being raised here. If you perform a WWW search these terms, you'll find they are often applied to gene/microarray data, where the probability of a "false positive" match occurs simply because of the large number of "questions" you are asking of your data.

Self-professed psychics take advantage of the false discovery rate (FDR) phenomemon all the time. For example, suppose I tell a single individual that I am going to guess the month and date of their birth. Because an individual can be born on 1 of 366 such days (remember the leap year), I have a 1-in-366 probability of correctly guessing their birthdate. In this example, I have made 1 guess, so if I were to get the guess right, some people might believe me to be psychic. The skeptic would ask me to do it again. If I guess the birthdays of two randomly different people, then my odd's of being correct are 1-in-133,956 (i.e. 366*366)

However, suppose that I'm standing in front of a crowd of 5,000 people, and I state that I know there is someone in the audience born on one of the following dates: Jan 21, Feb 29. Then, I ask them to stand. Would you be surprised if two or more people stood up? If they did, would you consider me psychic? How many guesses did I make in this instance? Some people would say just two, when in reality, I made 10,000 guesses. The large number of guesses is not so obvious because I did it in a parallel way.

The bad news is that even smart scientists get fooled by this phenomenon more often than I like to admit. However, the good news is that there are mathematical approaches to deal with it.

Read more...

Maths illiterate

"When we hear that there are 112 matches out of 65,000 people, it seems like DNA fingerprinting is not nearly as good as we think - but that is largely because we aren't thinking about the fact that 65,000 people imply 2 billion pairs of people"

I think its the reverse; because figures in billions have been bandied around when it now seems there's a 1 in 580 chance of a match (65000/112, is this right?) thats rather a suprise. However is matching on 9 loci enough to prove identity and convict? Or is it just usefull for developing lines of enquiry for investigators?

samwyse

#28, the computer that I'm using right now runs at 2.8 GHz. That means that it can perform over one billion arithmetic operations per second. It seems reasonable that it could perform 1 million genetic comparisons per second. At that rate, 1.4 trillion pair-wise comparisons would take 16 days and 5 hours.

#24 and #26, your so-called "criminal class" covers anyone processed by the court system. This isn't just thieves, rapists and killers, it also includes white collar criminals (remember, Arizona was home to the S&L crisis), drunk drivers, and sex offenders (which in some states includes public urination). And don't forget suspects who were later released; do you think that their DNA records get thrown away? If you think there's a common genetic thread connecting Martha Stewart to Willie Horton, let's hear it. Me, I find it a bit doubtful.

Kevin

From a statistical view the math is not right in the sense that a felon database is not random while a statisitcal odds of a match database is a random database. A felon databse has the requirement that you have to be a felon and there are relatives in the database. So the report mixes felon databases with statisitical match databases. The birthday problem hits it. We have known about the 'matches'in the felon databases for a long time. Obviously it isnt germain to really calculating the odds or it would have been front page news 8 years ago.

Platinum

#24, I was thinking the same thing. The sample was from the convicts and not very random. On the far extreme one could even think that this might show a genetic tread towards crime.

JJ

Ad Jim #25:

Again, as I wrote earlier: EXCLUSION, i.e. proving that the samples are NOT identical, can be performed with 100% certainty. Thus, it does not matter, whether none of the tested loci match or whether 999 out of 1000 match and one does not. The one mismatch would be enough to show that the samples are not of identical origin.
An exception would be paternity/maternity testing, since de-novo germline mutations, especially in microsatellites affecting repeat length, occur at a finite frequency (very rough ballpark 1 in 1000). There, a single mismatch would not be enough to exclude a line of descent. However, somatic DNA (i.e. body fluids or tissue like skin or saliva) left behind at a crime scene is not affected by this special case of clonal origin.
Thus, the innocence project rests on solid foundations. There is NO statistical ambiguity when it comes to exclusion (proof of non-identity of samples). The argument is very much like a mathematical proof. You only need to show a single case to prove that a theorem is not correct, but even if you show a quadrillion cases in which the outcome matches the predictions of the theorem, this does still not constitute mathematical proof.

Read more...

Dave

One of the solutions could be to require match 9 out of 9, not out of 13. That will decrees false positive matches significantly.

Jim

Out of curiosity, does anyone know how many loci are allowed to match when exonerating someone based on DNA evidence? We could play the margin of error game on that side as well.

#20 Ryan - no, Levitt is not assuming that the Loci are interchangeable. Of the 13 Loci, there are 715 ways to match any 9 (ie, the first 9, the last 9, the first 8 plus the last, etc.).

#19 Kevin - you said:
"The 122 matches out of 65,000 doesn't reflect the likelihood of a specific sample having a match in the other 64,900 samples."
This is exactly right, and is the point that the last two paragraphs of the post address.

Bob

As has been previously stated, seldom is the DNA evidence the only evidence against a defendant. It is possible, although very unlikely, for a guilty individual to not provide a perfect match..somatic mutations may result in an individual having two or more DNA sequences, one sequence gets left at the crime scene and another sequence is revealed in the blood test. Although never admitted by defense lawyers, sloppy lab work can sometimes miss a match. The reciprocal possibility is always promoted by the defense...witness the OJ Simpson trial... but bad lab work can result in both types of errors.

Paternity testing with DNA analysis is similar to its use in forensics but not far less stringent. Society accepts assigning a non-father paternity responsibility more readily than convicting an innocent person of a crime.

John Lloyd Scharf

Obviously, even one failure to match of 13 should exclude someone as guilty. That is the MAIN benefit of DNA - to exclude suspects so you can concentrate on the others.

Byron

Unless I'm missing something, the FBI's numbers are still way off. You show that it's not unreasonable that these matches would exist if the odds are 1 in 13 billion, but according to the quoted article they estimated the odds "to be as remote as 1 in 113 billion." I'm not good enough with numbers to crunch how many matches that should yield, but a much lower total seems likely.

HardyW

This was all hashed out in the 80s by RC Lewontin (see eg http://www.sciencemag.org/cgi/content/abstract/254/5039/1745 ). Prosecutors calmed down a bit, then they took off again, even despite getting their heads handed to them by OJ's team, now they're off again claiming probabilities of uniqueness greater than the number of humans who have ever lived on the planet (setting aside that probability of successful id can never exceed probability of lab error). Prosecutors = innumerate

mmoore42

Regarding "But the mug shots of the two felons suggested that they were not related: One was black, the other white."

How is that revelant? Setting aside that fact that distiguishing race is difficult if not impossible from a photograph--they could still be related. I don't think that anyone is "purely" white or black in this country. I am certain that I have white cousins 3 generations back from the slavery era.

Dan

We don't need to prove guilt beyond a shadow of a doubt. We need to prove it beyond a reasonable doubt.

Kevin

The point that doesn't seem to be covered is that you are not looking for a potential match of any6 pair of samples, you are looking for a match with a specific sample, ie in a room of 23 people there is ~%50 probability that (any) 2 people share a birthday, but there is a ~%6 chance that a specific person shares a birthday with someone else in the room.

The 122 matches out of 65,000 doesn't reflect the likelihood of a specific sample having a match in the other 64,900 samples.

jonathan

Great post, but I think it misses a point about the law. Imagine that you had DNA evidence that said x murdered the victim, but the witnesses say the perpetrator was white and x is black. You wouldn't get a conviction of x. Heck, you wouldn't indict the guy.

The point is that people - and people in the criminal process - assume DNA excludes everyone but the one guy, that it's absolute, when it isn't. DNA should be seen in a context of proof: you have witnesses or at least circumstantial evidence that ties the defendant to the scene or at least to the victim so the DNA evidence completes the proof to "beyond a reasonable doubt." If you have other evidence, then the sheer possibility that yes there's likely a match somewhere, maybe in this state, maybe in this county, isn't important. In other words, the real role of DNA is to exclude people and it's useful as a proof when it fails to exclude the defendant.

Even CSI ties the defendant to the crime scene or to the victims.

Read more...

BlackPolitical Analysis

This makes a lot more sense. I kept thinking, "But, there are only 6.4 billion people on earth!" Thanks.
http://blackpoliticalanalysis.com