Our Daily Bleg: What to Do About Too Much Data?

A reader named Evan Schumacher wrote in with an interesting bleg. (Read about blegs here and send your own here.)

Tucked inside his bleg is the part that tickled me the most: a website Evan created to tell him whether it’s worth it to watch a basketball game he’d recorded. Anyway, I’ll give my answer below, after his bleg.

I was wondering if too much data is ever a bad thing? I ask because I thought one of the rules of life that I’ve learned is that it’s best to have as much data as possible.

Whether it be hard numbers or smart people around, and at least when you are starting, you want as much information as you can get. The smart guys are the ones who know how to analyze it.

However, in my personal life I was having a problem with too much data. I watch all of the Warriors basketball games on DVR. However, nothing is worse than watching for 1.5 hours and in the end your team gets blown out. However, I never want to see the score before I watch because that ruins the game. To solve the problem I created a little website to warn me if the games are bad (www.shouldiwatch.com), but it won’t tell me anything about the outcome (who won or the score) if the game was relatively close. Trust me, as a Warriors fan this is a huge time saver. It’s a stupid little example, but it makes me wonder if there are other cases when you are doing research when you need to turn away from some information.

Anyway, it’s a little bit backwards to think of, but I thought it might be interesting to explore.

His question sounds as if it is directed more at quants than writers, but as a writer I’ll say that I face this dilemma daily. Right now, in the middle of our writing SuperFreakonomics, I’m facing a number of short sections that require a bunch of historical reading and research. But the key thing is that these sections remain short — they are not the donut in this case, but the donut hole, and if they start getting swollen they will turn the book into a flabby monster.

The problem is that the reading and research is so much fun that it is really hard to limit yourself. Especially in this age of Google (and Google Books) and Amazon and even Wikipedia (yes, I was an early detractor but have come around on certain subjects), I am constantly trying to take a little sip from a firehose, and it’s nearly impossible. Reading too much inevitably turns into wanting to write too much; in this case, shorter will be better, but it takes a lot of effort and a long time to get the right three paragraphs (as opposed to a much easier but, to my mind, less effective 12 paragraphs).

The problem is that the more I’ve read — and the more data I’ve consumed, to get back to Evan’s question — the better those three paragraphs will be in the end. It reminds me of making maple syrup, which we did every winter as kids. You’d run around collecting all this sap, gallons and gallons of it from the trees you’d tapped, and then stay up all night boiling it down on an open fire — all to produce one little jar of syrup.

Was it worth the effort? Some people would say yes, others no. But in any case, it sure tasted good.


DK1

I think that website is a great idea, but I wonder if it would work better if fans (who already saw the game) could give their vote (either a thumbs up/down or an Amazon-esque star rating) which would capture more subjective criteria than just the final score.

Your team may have lost by only 8 points, but maybe there was some meaningless points in garbage time to make the score look closer than it really was. Conversely, maybe the final score wasn't compelling, but if a star player puts on a show (or if a player attacks someone in the stands) you still might want to watch.

charles

Mostly what you are collecting is noise. In Steve's case his role is different. He's actually the filter and I thank him for it. He is also working in a somewhat bounded domain. So, in that capacity, data is good, more is better and a little more is nice although probably not worth the effort.. If you are investing, or otherwise an expert in an area where there are no experts, then the extra data will lead you down a fool's path.

Indeed the marginal utility of the next bit of data can be negative, leading to health problems (stress) and an increase in confidence (exposing you to large errors). This seems to be the domain of the original post...life in general.

Like food, some is good, more is better, more than that is bad, more than that is worse, more than that and you're dead.

I've told a number of folks that I believe the next set of great advances should come in the form of data aggregation and filtering.

Read more...

Nuclear Mom

This is a fascinating problem. In the past (20 years ago) I was not concerned about the government accumulating data on us. Who would have time to pore over all the useless conversations, the useless credit card charges, the useless trips, the endless printouts and factoids?

But thanks to truly impressive algorithm development and processing speed, faces can be picked out of crowds, words (bomb, terror) can be picked out of conversations, allowing the data pile to be reduced to a more manageable and targeted level.

Beyond the Big Brother implications, there are two competing goals when reducing or filtering information overload:

1. Can you define your algorithm clearly to produce the desired output? As a non-sports fan, I was intrigued by the problem posed in this bleg. You want to know whether a game is worth watching, but you don't want to know the winner or the score. What criteria are you using to decide if a game is worth watching?

2. How much important information gets discarded by the above algorithm? Stephen addresses this problem -- he is more informed and a better writer (and maybe person) for having consumed reams of information, even though it may not be reflected in his 3 distilled paragraphs. Would you inadvertently throw out a game that was a blowout loss but had some striking features -- a record set by a member of the opposite team that it would have been neat to see?

A fascinating balancing act.

Read more...

Marc Resnick

There is another consequence of too much information. As long as information stays within manageable levels, it usually increases the quality of our decision making. But if it hits that threshold, we are more likely to ignore it and switch to a more intuitive and subjective decision making process.

So you can decrease someone's decision making quality by facing them with an intimidating mass of information.

Jake

I love the concept: Reviews for sporting events. I would love this. I usually just want to watch the games though. Example: I am a Vikings fan. If they lose terribly, I might still want to watch the game, so that I know where they need to improve. If they win big, I'd also want to watch it. If its close I definitely want to watch it. And with DVR, I can cut about 1.5 hours out of a 3 hour game. That's good enough for me.

Colin Gray

Keeping it brief is not a new problem:

"I am sorry to write such a long letter. I didn't have time to write a short one. "

Variously attributed to Mark Twain, Voltaire, Proust, Pliny the Younger, T.S Eliot, Abraham Lincoln...

-Roy Blount, Jr. expounds beautifully

The issue is that you can ONLY get a jar of syrup from all those sap-collecting saps but, by creating a "short section" limit to your fascinating research, you constrain yourself as a writer. Let an editor decide or perhaps your book should be two volumes.

misterb

@charles(#2),

I'm in the business you describe (finding faces in crowds, so to speak) . On the one hand, we still can't process all the information we gather - on the other hand, your personal privacy has never come under greater threat. Without question, a modern-day Stalin could be far more effective at wiping out dissent and independent thought. We can't turn back Moore's law, but we can demand that our privacy be protected.

The basis of the marginal utility equation for information has to be related to the cost of supplying it. Just as a tip about yesterday's stock market is useless, information has a time value. Just like electrical current, it has a cost of transmission. Clearly if the cost of transmission exceeds the time value - the information should not only be ignored, but never transmitted. That's why Evan Schumaker's solution is so clever.

thomas

“A wealth of information creates a poverty of attention.”
- economist Herbert Simon:

charles

Misterb...you got the wrong guy...you meant nuclear mom...however, I take issue with the marginal utility & needing to take into account the cost of transmission. As I stated when MU goes negative why would I need the slight added cost of transmission to make me toss the next bit out the door? Unless you're paying me to get it. Interesting idea.

AaronS

Sherlock Holmes thought that too much information cluttered the mind. He described the mind as something of a room that, if too many things are brought in, you cannot put your hand on what you want.

And so, according to Watson, Holmes was quite ignorant of many seemingly important things (e.g., astronomy, philosophy, etc., if I recall). This was done of purpose so that Holmes might feel his mind with only those things that pertained to his passion for solving human puzzles.

DJH

It's not that it's best to have as much information as possible. Rather, it's best to have as much truthful information as possible.

The Internet is an information addict's paradise, but let's be honest, it's choking on fluff, clutter, mistakes, misinformation and actual, out-and-out falsehood. Sites like Wikipedia are especially large repositories of information --some of it accurate, perhaps a lot of -- but it has been, and still can be, gamed by people who are, for whatever reason, motivated to manipulate it.

I too love to accumulate information, but must expend a great deal of time filtering it for veracity. The more information I get, the greater the effort I must make to filter out the detritus and lies. At some point one reaches a point of diminishing returns, where one acquires so much information that veracity-filtration either becomes impossible, or is just too unreliable.

Read more...

Jeffrey

Too much data is a great thing in many ways. However, a lot more data means a lot more work and massive specialization of human capital. The result is that actual human beings end up being less well-rounded--which arguably has strong drawbacks.

For example, massive amounts of data mean that Dr. Wolfers can right about the economics of happiness. This can be great for his career. Toiling over such data might prevent him from spending time with friends and family, though. You see the trade off.

atanas entchev

Data is only relevant in the context of information (I would offend this forum if I were to expound on the differences between the two). So the real question is "Can you have too much information?" My answer is -- Of course! You need THE RIGHT amount of information in any setting, and too much information can be just as bad as too little.

Danny

Here's a fairly rigorous attempt at an answer:
On Feature Selection: Learning with Exponentially many Irrelevant Features as Training Examples, Andrew Y. Ng. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998.

"... in the presence of many irrelevant features, the main source of error in wrapper model feature selection is from overfitting hold-out or cross-validation data."

In other words, if you have more irrelevant data, you're likely to find more statistical anomalies that may cause errors. If you're careful, you can design algorithms that are quite robust to irrelevant data, though, as described in the paper above. In which case, if you have a suspicion that a piece of data might be relevant, then it will usually help a model to include it.

Eric M. Jones

I used to know a stunningly attractive young woman with a filthy-rich and quite generous father. When she got bored daddy would buy her a business, or whatever she wanted. She would complain that when she gained premenstrual bloat, it was only in her bustline.

So yes, I guess you can have too much of anything. And there's too much sand at the beach, and too much water in the sea, and sooooo many stars in the sky....

So be grateful for what you have, and if having too much data is your only complaint, come here and I'll give you something to complain about....

gp

Such a difficult question. There were once problems with number crunching, but SAS and SPSS solved that. SQL databases now make maintaining massive amounts of data even easier than analyzing them.

Can there be an equivalent for article reading?

MikeM

Speaking of too much information, I just read (in comment above) about an opinion on this subject attributed to a fictional character! (Sherlock Holmes)

Magnus Falk

This is an old problem for programmers. We often log obsessively in applications to be able to figure out where the bugs are later on when we get bug reports. The problem is that you instead drown in information and is unable to find what you need. Jeff Atwood of Coding Horror wrote about this recently; http://www.codinghorror.com/blog/archives/001192.html

Steve in Pennsylvania

Many politicians and lawyers are convinced there can be too much data to be useful.
When cornered and forced to provide details about some potentially incriminating event, some of the smartest responses go like this:
"Sure, let me give you everything I have on that topic. Here's the 150,000 pages of documents I've been saving. Let me know if you need anything more."