Free Super-Crunching Software

I probably have an unhealthy attraction to the powers of Excel. I taught my daughter how to use it when she was 7. When I teach corporate finance, I try to make sure that my law students come away from the course knowing how to crunch in Excel.

It would be embarrassing to teach students how to use Microsoft Word in a law-school course; but one of the goals of my corporate finance course is to make sure that they can comfortably manipulate its numerical cousin.

A middle-school math teacher recently told me that there are some things you can do on a graphing calculator that you just can’t do in Excel. I’m pretty sure (like 99 percent sure) that this is not true. In fact, Microsoft has expanded the functionality of Excel so that it’s starting to invade the domain of statistical packages.

The just-published (shameless plug) paperback edition of Super Crunchers has a new chapter that describes several different free tools that make it easier and easier to crunch numbers.

1. Microsoft has a new data-mining add-in that lets you run all kinds of cool statistical procedures inside Excel. Taking a page from the Google playbook, Microsoft is just giving this add-in away (but it only works if you’ve purchased the Office 2007 version of Excel).

2. Google (taking a page from its own playbook) is giving away its Website Optimizer, which will let you run randomized experiments on your own web page.

Any webmaster who is not running randomized trials on different page content is making a serious mistake.

Here’s an explanatory video. I’ve used the Website Optimizer myself and it is a joy to use.

3. I’ve created and assembled links to a bunch of cool “prediction tools” that let you plug in a few numbers and predict how long you’ll live, predict your due date (if you’re pregnant), rate the quality of a book title, or even predict political or sporting contests.

One of the cool things about these tools is that they provide feedback on the precision of predictions that is easy to digest. When you see the results of an experiment like this one below, you have a pretty clear idea of not only the winner, but of how confident you should be in the results.

INSERT DESCRIPTION

(As with all other statistical tests, you should not just blindly accept the p-values in the print out, but these graphics are still a huge leap forward.)

A fourth freebie is the open-source statistical package called “R.” While most members of the Freakonomics crowd tend to use Stata as their statistical package of choice (and businesses tend to run SAS or SPSS), R is the Linux of statistical software. It lets you do an awful lot for free.

Of course, having mastered the commands of Stata and SAS, I have poor incentives to learn the commands of a new (GNU) software. And R is probably not kept up to speed on the cutting-edge empirical methods as quickly as the traditional packages. (I should disclose that SPSS and SAS have paid me handsomely to give Super Crunching talks, so I may not be the most objective observer.)

But then again, R has plenty of power to run the vast majority of statistical techniques. There is still a huge discrepancy between the techniques that are used by academics and those used in business.

In fact, here’s a Super Crunchers bleg: Can anyone identify an instance where a business has run an instrumental-variables regression?

The I.V. approach has been around for decades and is a standard (if misused) technique in hundreds, if not thousands, of academic articles. But provocatively, I’d almost bet that it has never yet been used by a corporation to help make a business decision. We’ll send some Freakonomics schwag to the first person who can prove me wrong.


Tal Galili

Thank you for this post, especially for the comment.

As a statistics student, I have been looking for a good account of R place in the current market, and it seems the people responding here have shared a solid perspective on the subject.

Thanks!

Tal.

Stephan Kolassa

Excel is actually very unreliable when it comes to statistics (unless you install add-ons). Bruce D. McCullough has written extensively on this in the American Statistician (1998/1999), and no - things haven't gotten better: see his paper in Foresight (2006). And I wouldn't bet on Excel 2007 to have solved all these problems.

Stephan Kolassa

"And R is probably not kept up to speed on the cutting-edge empirical methods as quickly as the traditional packages."

People proposing new statistical techniques have an interest in helping other people use those methods (so the authors get cited). Consequently, most authors write an R package implementing the technique at the same time as publishing the method, which is really easy in R. This is why SAS et al. are way behind.

Tobias Verbeke

R is in use in heavily regulated industries. Companies that work out a validation process for R (such as in big pharma) use the following official document of the R Foundation as a basis for the vendor assessment part of the exercise.

http://www.r-project.org/doc/R-FDA.pdf

For QA, R has BTW concepts and tools that leave many other statistical packages behind. Upon demonstration, one client recently exclamated 'this is even better than our internal system'...

Have you ever heard that ? :-)

Christian Robert, Universite Paris Dauphine

"R is probably not kept up to speed on the cutting-edge empirical methods as quickly as the traditional packages"..! As already pointed out by several comments, this does not make sense. R is used in most statistical courses nowadays so is a "traditional package (this is the first language we teach our students in Paris Dauphine) and academic statisticians increasingly write R code in conjunction with their papers, as I can see for myself in the Journal of the Royal Statistical Society. Plus, R being open-source, the incorporation of new statistical techniques happens seamlessly, as opposed to "traditional packages" who need to wait for upgrades and new releases. SAS is certainly way beyond R in this respect.

Ajay

A Good book for SAS users to learn R is (and should help people like Ian especially) is R for SAS and SPSS Users at http://www.rforsasandspssusers.com/

Its especially aimed at people moving over or thinking about moving over to R but never had time to do so.

Ajay

Patrick Burns

I think you have it upside down regarding cutting edge methods. A good portion of new statistical methods are created in R. The statistical packages have to try to keep up with R, not the other way around.

Dylan Beaudette

Quoth the author:

"And R is probably not kept up to speed on the cutting-edge empirical methods as quickly as the traditional packages."

I would contend that the opposite is more likely. Some of the leading minds in statistics and numerical methods are the main contributors to the R software and documentation. The traffic on R-help mailing list supports this statement on a daily basis.

Ironman

Ian - thank you for linking via your prediction tools page again! Political Calculations is seeing a spike in site traffic today, and given today's stock market carnage, our immediate reaction was the market declines were driving the traffic spike (given our analytical focus on the worst the stock market has ever done and some of our forecasting of market distress), but we're pleased that it's mainly being driven by people wanting to know if their marriages will last instead.

What can we say? It put a smile on our faces too! Best wishes....

shaman

I work in a market research consultancy and we use SAS, SPSS, Statistica, Q, Latent Gold as well as Excel, and have used any number of packages in our past.

I can attest to having used IV regression, as part of two-stage least squares, in loyalty/retention and satisfaction models. The extent to which the results of those models were used by the end-client varied, of course, as the client didn't always like what they were told!

Scott

Add another comment for R being much more cutting-edge. While I don't have the link in front of me, I have previously used an article that estimated SAS/SPSS and other commercial packages to be 10-years behind in adopting new methods as a way to talk down clients that thought I might not be using the best tools. Though in business situations you don't often need the new algorithms, the tried and true techniques typically do the trick.

The one case that I find SAS or Matlab to be more useful is in huge datasets (many GB) as the default behavior for R is to load full datasets into memory and I have found it tedious to write code to get around this (depending on what you are trying to implement).

Carlos López

In Mexico most of the people at the UNAM, doing research in Statistics, use R as a working tool, mainly because buying the commercial stuff is so expensive, and because R gives you the flexibility no other commercial software can provide, which, by the way, is the definition "cutting-edge".

Yorick

Considering the dizzying speed the R release cycle, it's more likely to be SAS and SPSS that lag in terms of cutting-edge empirical methods.

Dave

R has gained a lot of steam lately in private industry. Good evidence for this is the recent announcements by Spotfire that they are integrating R and SPLUS into its visual analytics platform.

Jason

To paraphrase Greenspun's 10th rule of programming:

Any sufficiently complicated excel worksheet contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

Chewxy

I've raised this question to my math and stat friends (who also happen to be champions of free software) - why use R when EViews or Stata gives you beautiful results with a click of a button? (yes, R is free, but I value my time more :D)

As for IV regression, my startup (pressyo.com, launching October 10th) uses it in the business process. Then again we are regressing to find causes of market fluctuations of sorts. It's more of a background thing than a research commissioned to find causes.

Patrick

Quoth Damon:

"1) R is open source and I just don’t think financial companies want to risk their analyses accurateness on open source software that they have to justify to various regulatory bodies. They prefer other risks."

Most companies base their entire information infrastructure on open source operating systems and web servers be it Linux or Apache because they are the most powerful and most customizable. R is quickly becoming a similar trend because you can verify the integrity of your entire operation since you have the source code. In fact, there are a number of fields that now require the use of R as well as the inclusion of your source code in the publication in order to afford transparency.

Tom

@Matt B

The screenshots in Ian's post are from http://google.com/websiteoptimizer

Brian

Anybody this far down the comments is likely to love a newish spreadsheet program called Resolver One, which is free for non-commercial and open source use (there are also very expensive versions.) It is programmable in python, and talks intelligently to databases, which basically means that you can do anything with it -- imagine a spreadsheet that is also a clean IDE. It is a real treat.

Nylund

I know a lot of econometricians and many of them are using R these days and every day more and more people seem to be switching to it. One econometrician I respect told me that R was much slower than SAS. He explained why but it went over my head.

In general, I hear of more and more people, especially those developing new theoretical estimation techniques, doing their work on R. Plus, poor grad students like free software (not that there aren't ways to get the costly ones).

The people I know working on pretty great non-parametric and semi-parametric estimation techniques are all using R.