Attack of the Super Crunchers: Adventures in Data Mining

There’s been plenty written about the success that can be generated by an effective algorithm. Google and scores of other businesses thrive in large part because they are masters of the algorithmic mindset, gathering and analyzing data in ways previously thought impossible. As consumers, we’ve become accustomed to reaping the benefit of this revolution: using Google to find practically anything, heading to the discount airfare site that guarantees the “absolute lowest” rates, clicking through Amazon’s personal book recommendations.

Ian Ayres, Yale Law School professor, Forbes columnist, and data fanatic, has now written a book on data mining, Super Crunchers: Why Thinking-By-Numbers Is the New Way to Be Smart. (Full disclosure: Levitt is a friend and collaborator of Ayres, and he blurbed the book; Ayres also discusses Freakonomics and other research by Levitt in the book.)

Ayres writes about “a new breed of number crunchers … who have analyzed large datasets to discover empirical correlations between seemingly unrelated things.” These include hospitals predicting physician cleanliness based on infection rates and credit card companies examining a customer’s charge history to determine whether he or she will get divorced. Besides their usefulness in consumer transactions, Ayres argues, regression and data analysis can predict outcomes far better than we can, and have already had a huge impact on human behavior. Below are a few excerpts.

On determining the presence of racial discrimination in auto loan rates:

While most consumers now know that the sales price of a car can be negotiated, many do not know that auto lenders, such as Ford Motor Credit or GMAC, often give dealers the option of marking up a borrower’s interest rate. When a car buyer works with the dealer to arrange financing, the dealer normally sends the customer’s credit information to a potential lender. The lender then responds with a private message to the dealer that offers a “buy rate” — the interest rate at which the lender is willing to lend. Lenders will often pay a dealer — sometimes thousands of dollars — if the dealer can get the consumer to sign a loan with an inflated interest rate …

In a series of cases that I worked on, African-American borrowers challenged the lenders’ markup policies because they disproportionately harmed minorities. [Vanderbilt economist Mark] Cohen and I found that on average white borrowers paid what amounted to about a $300 markup on their loans, while black borrowers paid almost $700 in markup profits. Moreover, the distribution of markups was highly skewed. Over half of white borrowers paid no markup at all, because they qualified for loans where markups were not allowed. Yet 10 percent of GMAC borrowers paid more than $1,000 in markups and 10 percent of the Nissan customers paid more than a $1,600 markup. These high markup borrowers were disproportionately black. African-Americans were only 8.5 percent of GMAC borrowers, but paid 19.9 percent of the markup profits….

These studies were only possible because lenders now keep detailed electronic records of every transaction. The one variable they don’t keep track of is the borrower’s race. Once again, though, technology came to the rescue. Fourteen states … will, for a fee, make public the information from their driver’s license database — information that includes the name, race and Social Security number of the driver.

On the campaign of Don Berwick, a pediatrician and president of the Institute for Healthcare Improvement, to change hospital practices to follow the results of data analysis (a topic that Dubner and Levitt addressed here):

In December 2004, [Berwick] brazenly announced a plan to save 100,000 lives over the next year and a half. The “100,000 Lives Campaign” challenged hospitals to implement six changes in care to prevent avoidable deaths. He wasn’t looking for subtle or sophisticated changes. He wasn’t calling for increased precision in surgical operations. No … he wanted hospitals to change some of their basic procedures. For example, a lot of people after surgery develop lung infections while they’re on ventilators. Randomized studies showed that simply elevating the head of the hospital bed and frequently cleaning the patient’s mouth substantially reduces the chance of infection. Again and again, Berwick simply looked at how people were actually dying and then tried to find out whether there was large-scale statistical evidence showing interventions that might reduce these particular risks ….

Berwick’s most surprising suggestion, however, is the one with the oldest pedigree. He noticed that thousands of ICU patients die each year from infections after a central line catheter is placed in their chests. About half of all intensive care patients have central line catheters, and ICU infections are deadly (carrying mortality rates of up to 20 percent). He then looked to see if there was any statistical evidence of ways to reduce the chance of infection. He found a 2004 article in Critical Care Medicine that showed that systematic hand-washing (combined with a bundle of improved hygienic procedures such as cleaning the patient’s skin with an antiseptic called chlorhexidine) could reduce the risk of infection from central-line catheters by more than 90 percent. Berwick estimated that if all hospitals just implemented this one bundle of procedures, they might be able to save as many as 25,000 lives per year.

On predicting the success of law review articles (measured in subsequent mentions from other articles):

As a law professor, my primary publishing job is to write law review articles. I don’t get paid for them, but a central measure of an article’s success is the number of times the articles have been cited by other professors. So with the help of a full-time number-crunching assistant named Fred Vars, I went out and analyzed what caused a law review article to be cited more or less. Fred and I collected citation information on all the articles published for fifteen years in the top three law reviews. Our central statistical formula had more than fifty variables. Like Epagogix [a group that created an algorithm intended to predict whether a movie will be successful based on characteristics of its script], Fred and I found that seemingly incongruous things mattered a lot. Articles with shorter titles and fewer footnotes were cited significantly more, whereas articles that included an equation or an appendix were cited a lot less. Longer articles were cited more, but the regression formula predicted that citations per page peak for articles that were a whopping fifty-three pages long….

Law review editors who want to maximize their citation rates should also avoid publishing criminal and labor law articles, and focus instead on constitutional law. And they should think about publishing more women. White women were cited 57 percent more often than white men, and minority women were cited more than twice as often.


Mario Ruiz

Hi Melissa,

One posting for me. You are nice.

I am in the technology industry. I remember the time when the weather was predicted for the next hours. I am not that old (45). As computers gained power, the weather prediction expanded in time. Today, at weather.com we can see the next 10 days of any town with reasonable degree of certanty.

The basic premised is to have:

1. The computer power.
2. A theory behind.
3. The tools to do it.

For example, the financial market does not have a theory behind to explain its behavior. We certainly have the software and the hardware. The underlined algorithm that you talk about is the result of the applied theory when exists.

Not everything can be explained with number crunching. To tell you the truth, I hope we can never explain the human behavior with data crunching.

Mario Ruiz
@ http://www.oursheet.com

Albert

Isn't all this just trivial?

I enjoyed Freakonomics a lot. But the deluge of me-too books is getting silly. I mean, shouldn't economics professors really be publishing books about how economies work? And shouldn't law professors really be publishing books about how laws work?

von humboldt fleischer

Off Topic...
Has anyone else noted the sudden profusion of advertisement for a hip hop album by Young Joc titled, "Hustlenomics"? On the official website, the slogan reads, "Get Your Hustlenomics On", a overt allusion!

Daniel Cecil

Great, soon human behavior will be more predictable than computer software.

DrNova

Thanks for the informative blog. No doubt the "way of the algorithm" is justified when lives are saved through behavior modification in hospitals.

An unfortunate outcome in pragmatic America, however, will likely be an ADDICTION to this "way" as "the only way."

Especially disturbing is the thought that the "way of the algorithm" will lead to uniform articles in medical journals or law journals, based on the "algorithm" that will "get" the author the most fame, renown, attention, or other desirable appeal to human vanity.

I guess the coming wave of academic papers by students in top-tier universitites will be churned out in the "way of the algorithm," dealing more death yet to spontaneous human genius, checked by human discipline, in delivery of the word.

Let's see the algorithm predicting how much less fruit the trees will bear, when addiction to algorithms determines performance.

The "new new thing" is always the "saving grace"--the same kind of "saving grace" that television was to General Sarnoff at RCA in the earliest days. There can be no doubt that the "way of the algorithm" will descend to its lucrative outcome as a tool for commercial and political propaganda, following in TV's footsteps.

Is this not the "way of all flesh?"

God bless The New York Times for this platform for free speech.

(ISAIAH 55, JOHN 21:17)

Read more...

Blue Sun

One critical problem with attempting to use data-mining to build a real-world picture is making sure that your algorithm considers all of the relevant factors and places appropriate weighting on them.

In the lending example, did the data-miners factor in the borrowers' incomes, family size and stability, past loan history and general credit history, make of car, model and price of car, geographical region, or dozens of other factors that might have affected the decisions of the dealers and lenders?

I once read a study that found that a disproportionate percentage of trash incinerators were located in Black neighborhoods. For weeks, leaders in the Black community were expressing their outrage. When others checked the data, however, they found that by correlating by median neighborhood income, a poor White neighborhood was just as likely to have an incinerator as a poor Black one. It turned out that race was not the deciding factor, but poverty and the resulting powerlessness against the local government was.

We must always be careful not to read what we expect to see from incomplete correlations, or to confuse correlation with causation.

Read more...

Bill Conerly

Also worth looking at is Competing on Analytics, http://www.amazon.com/Competing-Analytics-New-Science-Winning/dp/1422103323/ref=pd_bbs_sr_1/105-1572616-3315623?ie=UTF8&s=books&qid=1187891816&sr=8-1

Jarret

Interesting article. Also interesting comments. I think that using algorithms for such things as deciding which paper to publish really only works until the algorithm is commonly adopted. Once it becomes ubiquitous, it becomes analagous to any sort of arb trading when everyone uses the same model: efficacy vanishes. If every paper were predictably the same, those papers would bore people, and the predidictive ability of the algorithm would be sharply reduced going forward. I guess its called being human. Thank God.

ils vont...

As humans I think we are constantly seeking truths to some degree, but I think a continually growing understanding of finite data could lead to truths that we as a culture might not like or might not be ready for. But maybe these things have to be exposed. Hopefully they do not have adverse reactions.

http://www.ilsvont.com

Paul

Mario,

Unfortunately, looks like DARPA has filed a patent on predicting behaviour based on observations of past behaviour.

http://www.newscientist.com/blog/invention/2007/08/strategy-predicting-software.html

Susan parker

It seems to me that one of the best predictors of citations would be the ranking (prestige) of the journal in which the paper is published. By limiting the analysis to the top three journals (not sure what determines this) what might be the most powerful predictor cannot be tested.

Tim

I thought a .05% confidence means that only 1 in 20 correlations will be due to chance. If data mining looks at 100 variables with a .05% confidence, we'll probably have some chance, right?

Jim Cropcho

An interesting example of mining data via intersecting datasets currently in the news is the discovery that an individual's specific voter preferences can be determined using public records. Read about it more on C|NET or

http://www.ThePublicBallot.org

PS> I'm totally buying this book!

Raul

Wonderful. Why make statistical inferences about a population when you can get pretty darned close to a census? Humanity, however, still gets my vote of confidence for a remarkable ability to code garbage. Algorithms are merely an analogy of a 'real world' system, and an analogy (by definition) always breaks down. Not to mention our skills at including tacit assumptions and drawing the conclusion first, later to design the query to support it. It's an exciting new capability in the human experience though. We just need to be careful what we place our faith on.

In another vein, it's fascinating what can emerge as an apparently worthwhile endeavor with so many algorithms running about. We try our best to manage our credit scores. Web presentations are more valuable the higher the google ranking. There are 'gold farmers.' What's a gold farmer anyway? We do most anything to avoid getting 'dinged' in our online auctions. These things are relatively new to the human experience. And we're always probing the algorithms, trying to figure out how they really work and how they change over time. We sure sink a lot of time into interacting with the algorithm so that we get the desired result out the other end. It seems so contrived when you consider our basic needs. Maybe they can invent a computer that emits cheese puffs if the user increases their score. I'd buy one.

Read more...

Ajay

Hi All,

As someone who has been using data mining for the past four years ..in areas as diverse as who will buy more auto inusrance ..or credit cards ...or how many sales of a particular product happen..or how much will India's GDP be in next ten years....

I can reasonable say..data mining is based on past behaviour and the premise that the future will repeat itself.. regression procedures relate to statistical significance ignoring the effects on un named variables.

However as you guys would have read in the freakonomics book, co relation is not always causation..

and most data mining technques simply fail to understand or incorporate the dynamics inherent in human behaviour and the consequent shift...example if everyone started writing 53 pages on constitutional law..the new most cited article could shift to something else.

thats why you have sub prime mortgage crisis, presidents who win despite losing popoular vote, and terrorists who survive after spending trillions , including billions in NSA funding .

artificial intelligence/neural networks that are dynamic may be the way forward but the computing power and mainstream software supoort required for them is still some time away....

once it is there... we could probably predict what articles are more likely to be quoted on this blog :))

Read more...

Michael

The higher citation rates for women, especially minority women, suggest that articles by women, especially minority women, have to better than articles by men to get published or that women writing on the law are more capable than their male counterparts (and is that due to barriers in employment that let in less capable males while barring less capable men?). Rich data point to many interesting things...

Dave

Is there an algorithm to predict the flood of lawsuits that will cite the physician's findings?

"Ladies and Gentlemen of the Jury, Uncle John would still be here if only the doctors and nurses had heeded his advice and WASHED THEIR HANDS!"

Ben

When a statistician hears the term data mining, it conjures up the image of someone trying to find data that proves a point he wants to make rather than observing the points made in the data. Data mining is a bad thing in research.

With regard to Tim's comment about a 0.05% confidence, I can only assume he meant "degree of significance" rather than level of confidence. The level of confidence describes the width of the confidence interval that prevents a Type I error (rejecting a true null hypothesis). The level of significance at 0.05% would be more like 5 out of 10,000 or 1 out of 2,000. Setting the degree of significance that low (or high, depending on your perspective) invites a Type II error (accepting a false null hypothesis).

The important point in the comments so far is that these are probabilistic interpretations of the data, and if you start with a poor theory behind the model you can only come up with a valid result quite by accident.

Read more...

Catherine Alberte

Funny that all the comments ignored way data mining was used to identify a solvable problem that can save 100,000 lives a year. This is not theory - I recently attended a conference where a majory NYC hospital told us how they reduced their central line infection to almost none by implementing the changes that the article in Critical Care Medicine called for.

Jeff Williams

Did anyone else take issue with Ayres' claim on Page 203 that a poll where on candidate is leading 51% to 49% with a 2% margin of error was NOT a statistical dead heat? I was troubled by this assumption and the fact that he seemed to confuse that the margin of error meant that there was uncertainty with the results of the poll relative to the "true" probability.