Attack of the Super Crunchers: Adventures in Data Mining

August 23, 2007

There’s been plenty written about the success that can be generated by an effective algorithm. Google and scores of other businesses thrive in large part because they are masters of the algorithmic mindset, gathering and analyzing data in ways previously thought impossible. As consumers, we’ve become accustomed to reaping the benefit of this revolution: using Google to find practically anything, heading to the discount airfare site that guarantees the “absolute lowest” rates, clicking through Amazon’s personal book recommendations.

Ian Ayres, Yale Law School professor, Forbes columnist, and data fanatic, has now written a book on data mining, Super Crunchers: Why Thinking-By-Numbers Is the New Way to Be Smart. (Full disclosure: Levitt is a friend and collaborator of Ayres, and he blurbed the book; Ayres also discusses Freakonomics and other research by Levitt in the book.)

Ayres writes about “a new breed of number crunchers … who have analyzed large datasets to discover empirical correlations between seemingly unrelated things.” These include hospitals predicting physician cleanliness based on infection rates and credit card companies examining a customer’s charge history to determine whether he or she will get divorced. Besides their usefulness in consumer transactions, Ayres argues, regression and data analysis can predict outcomes far better than we can, and have already had a huge impact on human behavior. Below are a few excerpts.

On determining the presence of racial discrimination in auto loan rates:

While most consumers now know that the sales price of a car can be negotiated, many do not know that auto lenders, such as Ford Motor Credit or GMAC, often give dealers the option of marking up a borrower’s interest rate. When a car buyer works with the dealer to arrange financing, the dealer normally sends the customer’s credit information to a potential lender. The lender then responds with a private message to the dealer that offers a “buy rate” — the interest rate at which the lender is willing to lend. Lenders will often pay a dealer — sometimes thousands of dollars — if the dealer can get the consumer to sign a loan with an inflated interest rate …

In a series of cases that I worked on, African-American borrowers challenged the lenders’ markup policies because they disproportionately harmed minorities. [Vanderbilt economist Mark] Cohen and I found that on average white borrowers paid what amounted to about a $300 markup on their loans, while black borrowers paid almost $700 in markup profits. Moreover, the distribution of markups was highly skewed. Over half of white borrowers paid no markup at all, because they qualified for loans where markups were not allowed. Yet 10 percent of GMAC borrowers paid more than $1,000 in markups and 10 percent of the Nissan customers paid more than a $1,600 markup. These high markup borrowers were disproportionately black. African-Americans were only 8.5 percent of GMAC borrowers, but paid 19.9 percent of the markup profits….

These studies were only possible because lenders now keep detailed electronic records of every transaction. The one variable they don’t keep track of is the borrower’s race. Once again, though, technology came to the rescue. Fourteen states … will, for a fee, make public the information from their driver’s license database — information that includes the name, race and Social Security number of the driver.

On the campaign of Don Berwick, a pediatrician and president of the Institute for Healthcare Improvement, to change hospital practices to follow the results of data analysis (a topic that Dubner and Levitt addressed here):

In December 2004, [Berwick] brazenly announced a plan to save 100,000 lives over the next year and a half. The “100,000 Lives Campaign” challenged hospitals to implement six changes in care to prevent avoidable deaths. He wasn’t looking for subtle or sophisticated changes. He wasn’t calling for increased precision in surgical operations. No … he wanted hospitals to change some of their basic procedures. For example, a lot of people after surgery develop lung infections while they’re on ventilators. Randomized studies showed that simply elevating the head of the hospital bed and frequently cleaning the patient’s mouth substantially reduces the chance of infection. Again and again, Berwick simply looked at how people were actually dying and then tried to find out whether there was large-scale statistical evidence showing interventions that might reduce these particular risks ….

Berwick’s most surprising suggestion, however, is the one with the oldest pedigree. He noticed that thousands of ICU patients die each year from infections after a central line catheter is placed in their chests. About half of all intensive care patients have central line catheters, and ICU infections are deadly (carrying mortality rates of up to 20 percent). He then looked to see if there was any statistical evidence of ways to reduce the chance of infection. He found a 2004 article in Critical Care Medicine that showed that systematic hand-washing (combined with a bundle of improved hygienic procedures such as cleaning the patient’s skin with an antiseptic called chlorhexidine) could reduce the risk of infection from central-line catheters by more than 90 percent. Berwick estimated that if all hospitals just implemented this one bundle of procedures, they might be able to save as many as 25,000 lives per year.

On predicting the success of law review articles (measured in subsequent mentions from other articles):

As a law professor, my primary publishing job is to write law review articles. I don’t get paid for them, but a central measure of an article’s success is the number of times the articles have been cited by other professors. So with the help of a full-time number-crunching assistant named Fred Vars, I went out and analyzed what caused a law review article to be cited more or less. Fred and I collected citation information on all the articles published for fifteen years in the top three law reviews. Our central statistical formula had more than fifty variables. Like Epagogix [a group that created an algorithm intended to predict whether a movie will be successful based on characteristics of its script], Fred and I found that seemingly incongruous things mattered a lot. Articles with shorter titles and fewer footnotes were cited significantly more, whereas articles that included an equation or an appendix were cited a lot less. Longer articles were cited more, but the regression formula predicted that citations per page peak for articles that were a whopping fifty-three pages long….

Law review editors who want to maximize their citation rates should also avoid publishing criminal and labor law articles, and focus instead on constitutional law. And they should think about publishing more women. White women were cited 57 percent more often than white men, and minority women were cited more than twice as often.

Search the Site

Attack of the Super Crunchers: Adventures in Data Mining

Comments