# Statistical Slumps

My dad was a box salesman and once a year he’d take me on an overnight sales trip. My daughter and I continued the tradition recently with a “dad and daughter” trip to Boston. If you are driving on I-84 to Boston, we can recommend The Traveler Restaurant (where you get to choose three free books with your yummy meal), and in Boston, we loved kayaking at Charles River Canoe and Kayak.

But it was during a trip to the Boston Science Museum that I had an idea about calculating statistical slumps. The museum has an excellent example of a Galton Box, an apparatus where balls are dropped at the top at a high board and have to bounce off a grid of evenly spaced pegs. If the pegs are spaced properly, when the ball strikes a peg, it has a 50-50 chance of bouncing to the right or to the left as it travels down. Here’s a YouTube clip of one in action:

The a-ha moment of the demonstration is to see that the balls end up in the bins at the bottom in piles that approximate the perfect bell-curve shape of the normal distribution.

The Galton Box (also known as the Quincunx) is related to Pascal’s Triangle because the triangle tells you the number of ways to reach a particular bin:

For example, if a Galton box had four rows of pegs and five bins, there would be six (equally likely) routes to reach the middle bin. But notice that there is always just one way to reach the outermost most bin.

It occurred to me that it would be pretty easy to derive a statistical standard for determining when an athlete was having a “statistically significant slump.” For example, Alex Rodriguez recently went through a homerless drought of 72 at-bats. Over his career, A-Rod has averaged one homer for every 14.2 at bats — suggesting there is about a 93 percent chance that he will not homer on any individual at bat. It would be crazy to say that he was in a home-run slump after failing to homer after just a few at bats. But the question is how many homer-less at bats is enough to be a statistically significant drought?

The answer is 42. There is less than a 5 percent chance that Rodriguez would go homerless 42 times in a row — so we can reject the hypothesis (at a 5 percent level of statistical significance) that he is going homer-less merely as a matter of chance. You can calculate your own drought statistics for any sporting event (for example, how many losses does Tiger have to have before he’s having a statistically significant drought?) just by using the following formula:

Athlete is having a statistical significant drought if:

Total consecutive number of bad events > log(.05)/log(probability of single bad event)

You can copy and paste the right-hand side of this inequality into Google, plugging in the probability of a single bad event (yes, Google is a calculator):

For A-Rod going homer-less, you would Google: log(.05)/log(.93).

If you want to know his statistical drought number for a 1 percent level of significance, you would Google: log(.01)/log(.93).

If you think Tiger Woods has a 25 percent chance of winning any individual tournament, then he would be experiencing a statistically significant drought after: log(.05)/log(.75) = 10.4 consecutive losses.

The revolution of statistics in sports reporting has to date been almost exclusively an increase in descriptive statistics. But these examples show how it might be possible for reporters to usefully include some tests of statistical significance in their reporting. Even now, it would be possible to test whether reporters start using the term “drought” only after a player experiences a statistically significant number of bad events. It might be fun to do a study to back out the implicit level of statistical significance that reporters require before they use “slump” or “drought.” I’d predict that this implicit level varies with how much they like the athlete — so that they would start using the term more quickly with regard to Rodriguez than, say, Jeter.

Calculating the magic numbers for statistically significant droughts is also related to the civil rights problem of the “inexorable zero.” In the landmark 1977 employment discrimination case International Brotherhood of Teamsters v. United States, the United State Supreme Court was concerned because: “Between July 2, 1965, and January 1, 1969, [out of] hundreds of line drivers [hired] systemwide . . . [n]one was a Negro.” Footnote 23 of the opinion introduced a new phrase into the civil rights lexicon: “[T]he company’s inability to rebut the inference of discrimination came not from a misuse of statistics but from ‘the inexorable zero.’”

The same formula for calculating statistical droughts can be used to calculate when zero hires becomes statistically inexorable. In fact, in this old post from the Balkinization blog, I calculate when we should start to feel BOGSAT anxiety from a “bunch of guys sitting around a table.” For me, it often kicks in at five.

#### AaronS

Very interesting! However, I wonder if the Bell Curve machine is largely the function of the beads being dumped IN THE CENTER?

Consider that if I had a bucket of sand that I was pouring out, the largest amount would be right under the the bucket...with smaller amounts sliding off to the left and the right--another Bell Curve.

I don't think anyone would call this a function of probability, but rather, perhaps, a function of gravity, physics, etc.

Might that be what is "informing" the machine? And if so, that just might be important, too.

#### Mike

I'm not sure how you made the jump from the combinatorics of pascal's triangle to the simple formula with logs in it. Can you post your derivation. I'm interested and too busy at work right now to try to derive it myself.

#### Kevin C

by this method, a .333 would be in a slump after just two games without a hit.

log(.05)/log(.667) = 7.4 at bats which is about two games worth

#### bill ricker

Three Cheers for Traveler Book & Food !!

#### Omkar

This is very sloppy statistics! It completely ignores selection bias. Essentially Ayres' argument goes like this:

1. Athlete seems to be having a slump
2. Take the slump and test to see if its success probability is different from the historical success probability.

2 is fine ONLY if we choose the testing period independently of the data. But we are only looking at the athlete because there is a long run of failures!

To see why this is a problem, think about flipping a fair coin 10000 times. You will almost certainly see runs that are 8 long. Suppose we look though and find a 8 unit run. "Aha!" we think, "This run is longer than 4.3, the statistical significance threshhold. So we conclude at level 0.05 that this run must have come from an unfair coin." But of course this is nonsense! With a fair coin, you are almost certain to see runs that long. The test is simply broken.

The essence is that Ayres ignores selection / multiple testing issues completely, giving wildly optimistic significance levels. His implicit independence assumption is also dangerous, but arguably a necessary simplification of the problem (a Markov chain approach would be better, but hard to present). But the selection effect is impossible to ignore and invalidates his whole analysis.

#### Dave

Fascinating article. Seems to me that resampling statistics would be great for this kind of problem too.

#### Felix

How do you obtain that formula?

#### Jerry Tsai

Your idea has merit. Certainly, you must pre-specify the degree of statistical significance achieved to label a drought, otherwise such declarations would be made arbitrarily.

The biggest danger-- in terms of reporting-- is the multiple hypothesis test you would be conducting. If you look often enough, a 5% chance a drought is observed will EVENTUALLY be observed. (By definition, we would expect that every 20 times we check whether a 5% event happens that it will happen once, after all.) Unless the threshold for the degree of significance is set high enough, you will observe dozens, hundreds, thousands, etc. of "droughts" across many players across a season.

Still, having a rule of thumb of labeling a drought be called a drought might be worthwhile to impartially alert the audience as to how significant a streak is. But the bar should be set very high to counteract the multiple tests. I'd advocate 1% or (even better) lower. Certainly not 5%-- that would virtually guarantee that a player on every team would be in a "drought" sometime during a season.

#### Ted

No offense, but Duh? Isn't this very basic probability that you learn in high school? I don't see what Pascal's Triangle or the Galton Box have to do with anything. Cool idea anway, though. I don't mean to be discouraging. :)

#### SKG

So 1 out of 20 stretches of 43 at bats could be expected to have 0 home runs.

Given over 500 at bats per season, that implies better than a 50/50 chance in a given year he'd have such a streak.

In fact, the odds are somewhat higher than that, because you don't have a series of 12 43-at-bat groups. An unlucky run could start anywhere.

I'd have to go dig up my probability book to come up with the exact odds. Anyone have the formula off the top of their head?

#### Audrey

Good illustration also on the negative binomial distribution, which can be used to derive the formula.

#### Harimau

I feel rather stupid for asking this, but how did you derive this?

Total consecutive number of bad events > log(.05)/log(probability of single bad event)

#### eric

This makes no sense at all given the fact that we only talk about people having a "drought" that are having a bad run.

Imagine that there are 10 baseball players that we care about and each of them has 10 statistics that we care about (at bats without a homer, at bats without a hit, etc.). While it is true this "drought" formula works when we talk about a random player, it doesn't work at all when you allow the media to selectively discuss players and stats.

#### Dale

Interesting. Also interesting is that the matter gets more complex when one removes the assumption that the repeated events are statistically independent.

For binary outcomes (i.e. Win/Lose), where the probability of winning next time depends on whether or not one won this time, the contingent probability is trivial - you're always losing, so there's only one probability to worry about.

But what if Tiger's chances of winning this week are determined by *what place* he finished in last week (or even if he played)?

I like your model. But I wouldn't set out to put the bookies out of business just yet...

#### Greg

This analysis illustrates a common mistake. If I pick a run of 42 A-Rod at-bats ahead of time (I vow to start keeping track on September 1, for example), and he doesn't homer in those 42 at-bats, then I can reject (at 5% significance) the null hypothesis that his universal probability of 1 homer per 14.2 at-bats is governing his batting during that stretch.

However, if I simply wait until a drought of 42 homerless at-bats occurs, that doesn't allow me to reject the hypothesis that sheer chance is governing his batting.

For example, suppose A-Rod maintains a rate of 1 homer per 14.2 at-bats throughout his career, each at-bat being completely due to chance. Suppose also that he has 500 at-bats per season. Then he will experience a homerless drought of at least 42 at-bats in about 5 out of 6 seasons! Pure chance can and will cause droughts of that length.

So the mathematical lesson is that we should only use the type of analysis from the article if we decide on the set of observations beforehand; we get erroneous conclusions if we simply wait for some observations to catch our eye and then try to analyze them.

The baseball question, I guess, is: when we call it a "drought", are we trying to indicate that something has changed for the worse in A-Rod's technique, or do we just mean that the vagaries of chance have been against A-Rod recently?

#### zbicyclist

I don't buy this logic. Remember, a baseball regular will have over 500 at bats in a season. All of them (except the last few) can be the start of either a slump OR of a hot hand.

If we are using the 5% level, we are going to be endlessly bombarded by lists of players who are in slumps or having hot hands even when there is absolutely nothing going on that is predictive of the next at-bat (because, in many sports, there's only weak evidence for slump or hot hand effects).

Isn't sports coverage inane enough now?

#### Clifton Ealy

This is wrong.

It is true that .93^42 is just under 5%. However in an average season A-Rod gets 622 at bats. The chance having 42 consecutive at bats without a home run purely due to chance over 622 at bats is about 90%.

For an explanation that avoids the words "discrete time Markov chain" try here: http://www.bumblebeagle.org/horsehide/hitstreaks.html

That site also includes an applet so that you can play around with different streak lengths.

#### Jon Leonard

The problem is that there are very many possibly significant statistics. If you look for 20 possible properties in a genuinely random environment, on average you'll find one (and it won't mean anything).

A higher standard for significance or some way to do a controlled retest might help, of course. But just publishing the significance level along with the factoid of the moment is probably counterproductive.

#### Colin Wyers

Except there are over 8000 different 72-at-bat stretches in A-Rod's career, if looked at continuously. That means that, even if the odds of a 72-at-bat homerless streak were .05 percent, we should see roughly four such streaks in his career by chance alone.

#### Yuval

I think that "the answer is 42" pretty much sums up this world.