Authors: Ian Ayres
The Lott saga also underscores why it is so important to have independent verification of results. It's so easy for people of good faith to make mistakes. Moreover, once a researcher has spent endless hours producing an interesting result, he or she becomes invested in defending it. I include myself in this tendency. It's easy and true to charge that the intuitivist and experientialist are subject to cognitive biases. Yet the Lott saga shows that empiricists are, too. Lott's adamant defense of his thesis in the face of such overwhelming evidence underscores this fact. Numbers don't have emotions or preferences, but the number crunchers that interpret them do.
My contretemps with Lott suggests the usefulness of setting up a formalized system of empirical devil's advocacy akin to the role of an
Advocatus Diaboli
in the Roman Catholic Church. For over 500 years, the canonization process followed a formal procedure in which one person (a
postulator
) presents the case in favor and another (the
promoter of the faith
) presents the case against. According to Prospero Lamertini (later Pope Benedict XIV [1740â58]):
It is [the promoter of the faith's duty] to critically examine the life of, and the miracles attributed to, the individual up for sainthood or blessedness. Because his presentation of facts must include everything unfavorable to the candidate, the promoter of the faith is popularly known as the
devil's advocate
. His duty requires him to prepare in writing all possible arguments, even at times seemingly slight, against the raising of any one to the honours of the altar.
Corporate boards could create devil's advocate positions whose job it is to poke holes in pet projects. These professional “No” men could be an antidote to overconfidence biasâwithout risking their jobs. The Lott story shows that institutionalized counterpunching may also be appropriate for Super Crunchers to make sure that their predictions are robust.
Among academic crunchers, this devil's advocacy is a two-way street. Donohue and I have crunched numbers testing the robustness of Lott's “More Guns” thesis. Lott again and again has recrunched numbers that my coauthors and I have run. Lott challenged the robustness of an article that Levitt and I wrote showing that the hidden transmitter LoJack has a big impact on reducing crime. And Lott has also recrunched numbers to challenge a Donohue and Levitt article showing that the legalization of abortion reduced crime. To my mind, none of Lott's counterpunching crunches has been persuasive. Nonetheless, the real point is that it's not for us or Lott to decide. By opening up the number crunching to contestation, we're more likely to get it right. We keep each other honest.
Contestation and counterpunching is especially important for Super Crunching, because the method leads to centralized decision making. When you are putting all your eggs in a single decisional basket, it's important to try to make sure that the decision is accurate. The carpenter's creed to “measure twice, cut once” applies. Outside of the academy, however, the useful Lott/Ayres/Donohue/Levitt contestation is often lacking. We're used to governmental or corporate committee reports giving the supposedly definitive results of some empirical study. Yet agencies and committees usually don't have empirical checks and balances. Particularly when the underlying data is proprietary or confidentialâand this is still often the case with regard to both business and governmental dataâit becomes impossible for outsiders like Lott or me to counterpunch. It thus becomes all the more important that these closed Super Crunchers make themselves answerable to loyal opposition within their own organizations. Indeed, I predict that data-quality firms will appear to provide confidential second opinionsâjust like the big four accounting firms will audit your books. Decision makers shouldn't rely on the word of just one number cruncher.
Most of this book has been chock full of examples where Super Crunchers get it right. We might or might not always like the impact of their predictions on us as consumers, employees, or citizens, but the predictions have tended to be more accurate than those of humans unaided by the power of data mining. Still, the Lott saga underscores the fact that number crunchers are not infallible oracles. We, of course, can and do get it wrong. The world suffers when it relies on bad numbers.
The onslaught of data-based decision making if not monitored (internally or externally) may unleash a wave of mistaken statistical analysis. Some databases do not easily yield up definitive answers. In the policy arena, there are still lively debates about whether (a) the death penalty, or (b) concealed handguns, or (c) abortions reduce crime. Some researchers have so comprehensively tortured the data that their datasets become like prisoners who will tell you anything you want to know. Statistical analysis casts a patina of scientific integrity over a study that can obscure the misuse of mistaken assumptions.
Even randomized studies, the gold-standard of causal testing, may yield distorted predictions. For example, Nobel Prizeâwinning econometrician James Heckman has appropriately railed against reliance on randomized results where there is a substantial decline in the number of subjects who complete the experiment. For example, at the moment I'm trying to set up a randomized study to test whether Weight Watchers plus a financial incentive to lose weight does better than Weight Watchers alone. The natural way to set this up is to find a bunch of people who are about to start Weight Watchers and then to flip a coin and give half a financial incentive and make the other half the control group. The problem comes when we try to collect the results. The Constitutional prohibition against slavery is a very good thing, but it means that we can't mandate that people will continue participating in our study. There is almost always attrition as some people after a while quit responding to your phone calls. Even though the treatment and control group were probabilistically identical in the beginning, they may be very different by the end. Indeed, in this example, I worry that people who fail to lose weight are more likely to quit the financial incentive group thus leaving me at the end with a self-censored sample of people who have succeeded. That's not a very good test of whether the financial incentive causes more weight loss.
One of the most controversial recent randomized studies concerned an even more basic question: are low-fat diets good for your health? In 2006, the Women's Health Initiative (WHI) reported the results of a $415 million federal study. The researchers randomly assigned nearly 49,000 women ages fifty to seventy-nine to follow a low-fat diet or not, and then followed their health for eight years.
The low-fat diet group “received an intensive behavioral modification program that consisted of eighteen group sessions in the first year and quarterly maintenance sessions thereafter.” These women did report eating 10.7 percent less fat at the end of the first year and 8.1 percent less fat at the end of year six. (They also reported eating on average each day an extra serving of vegetables or fruit.)
The shocking news was that, contrary to prior accepted wisdom, the low-fat diet did not improve the women's health. The women assigned to the low-fat diet weighed about the same and had the same rates of breast cancer, colon cancer, and heart disease as those whose diets were unchanged. (There was a slightly lower risk of breast cancerâ42 per 10,000 per year in the low-fat diet group, compared with 45 per 10,000 in the regular diet groupâbut the difference was statistically insignificant.)
Some researchers trumpeted the results as a triumph for evidence-based medicine. For them, this massive randomized trial conclusively refuted earlier studies which suggested that low-fat diets might reduce the incidence of breast or colon cancer. These earlier studies were based on indirect evidenceâfor example, finding that women who moved to the United States from countries where diets were low in fat acquired a higher risk of cancer. There were also some animal studies showing that a high-fat diet could lead to more mammary cancer.
So the Women's Health Initiative was a serious attempt to directly test a central and very pressing question. A sign of the researchers' diligence can be seen in the surprisingly low rate of attrition. After eight years, only 4.7 percent of the women in the low-fat diet group withdrew from participation or were lost to follow-up (compared with 4.0 percent of the women in the regular diet group).
Nonetheless, the study has been attacked. Even supporters of evidence-based medicine have argued that the study wasted hundreds of millions of dollars because it asked the wrong question. Some say that the recommended diet wasn't low fat enough. The dieters were told that 20 percent of their calories could come from fat. (Only 31 percent of them got their dietary fat that low.) Some critics thinkâespecially because of compliance issuesâthat the researchers should have recommended 10 percent fat.
Others critics think the study is useless because it tested only for the impact of reducing total fat in the diet instead of testing the impact of reducing saturated fats, which raise cholesterol levels. Randomized studies can't tell you anything about treatments that you failed to test. So we just don't know whether reducing saturated and trans fats might still reduce the risk of heart disease. And we're not likely to get the answer soon. Dr. Michael Thun, who directs epidemiological research for the American Cancer Society, called the WHI study, “the Rolls-Royce of studies,” not just because it was high quality, but also because it was so expensive. “We usually have only one shot,” he said, “at a very large-scale trial on a particular issue.”
Similar concerns have been raised about another WHI study, which tested the impact of calcium supplements. A seven-year randomized test of 36,000 women aged fifty to seventy-nine found that taking calcium supplements resulted in no significant reduction in risk of hip fracture (but did increase the risk of kidney stones). Critics again worry that the study asked the wrong question to the wrong set of women. Proponents of calcium supplements want to know whether supplements might not still help older women. Others said they should have excluded women who are already getting plenty of calcium in their regular diet, so that the study would have tested the impact of calcium supplements when there is a pre-existing deficiency. And of course some wished they had tested a higher dose supplement.
Still, even the limited nature of the results gives one pause. Dr. Ethel Siris, president of the National Osteoporosis Foundation, said the new study made her question the advice she had given women to take calcium supplements regardless of what is in their diet. “We didn't think it hurt, which is why doctors routinely gave it,” Siris said.
When she heard about the results of the calcium study, Siris's first reaction was to try to pick it apart. She changed her mind when she heard the unreasonable way that people were criticizing some of the WHI studies. Seeing the psychology of resistance in others helped her overcome it in herself. She didn't want to find herself “thinking there was something wrong with the design of this study because I don't like the results.”
Much is at stake here. The massive randomized WHI studies are changing physician practice with regard to a host of treatments. Some doctors have stopped recommending low-fat diets to their patients as a way to reduce their heart disease and cancer risk. Others, like Siris, have changed their minds about calcium supplements. Even the best studies need to be interpreted. Done well, Super Crunching is a boon to society. Done badly, database decision making can kill.
The rise of Super Crunching is a phenomenon that cannot be ignored. On net, it has and will continue to improve our lives. Having more information about “what causes what” is usually good. But the purpose of this chapter has been to point out exceptions to this general tendency. Much of the resistance that we've seen over and over in this book can be explained by self-interest. Traditional experts don't like the loss of control and status that often accompanies a shift toward Super Crunching. But some of the resistance is more visceral. Some people fear numbers. For these people, Super Crunching is their worst nightmare. To them, the spread of data-driven decision making is just the kind of thing they thought they could avoid by majoring in the humanities and then studying something nice and verbal, like law.
We should expect a Super Crunching backlash. The greater its impact, the greater the resistanceâat least pockets of resistance. Just as we have seen the rise of hormone-free milk and cruelty-free cosmetics, we should expect to see products that claim to be “data-mining free.” In a sense, we already are. In politics, there is a certain attraction to candidates who are straight shooters, who don't poll every position, who don't relentlessly stay on message to follow a focus groupâapproved script. In business, we find companies like Southwest Airlines that charge one price for any seat on a particular route. Southwest passengers don't need Farecast to counter-crunch future fares on their behalf, because Southwest doesn't play the now-you-see-it-now-you-don't pricing games (euphemistically called “revenue enhancement”) where other airlines try to squeeze as much as they can from every individual passenger.
While price resistance is reasonable, a broader quest for a life untouched by Super Crunching is both infeasible and ill-advised. Instead of a Luddite rejection of this powerful new technology, it is better to become a knowledgeable participant in the revolution. Instead of sticking your head in the sands of innumeracy, I recommend filling your head with the basic tools of Super Crunching.
CHAPTER 8
The Future of Intuition (and Expertise)
Here's a fable that happens to be true. Once upon a time, I went for a hike with my daughter Anna, who was eight years old at the time. Anna is a talkative girl who is, much to my consternation, developing a fashion sense. She's also an intricate planner. She'll start thinking about the theme and details of her birthday party half a year in advance. Recently, she's taken to designing and fabricating elaborate board games for her family to play.
While we were hiking, I asked Anna how many times in her life she had climbed the Sleeping Giant trail. Anna replied, “Six times.” I then asked what was the standard deviation of her estimate. Anna replied, “Two times.” Then she paused and said, “Daddy, I want to revise my mean to eight.”
Something in Anna's reply gets at the heart of why “thinking-by-numbers is the new way to be smart.” To understand what was going on in that little mind of hers, we have to step back and learn something about our friend, the standard deviation.
You see, Anna knows that standard deviations are an incredibly intuitive measure of dispersion. She knows that standard deviations give us a way of toggling back and forth between numbers and our intuitions about the underlying variability of some random process. This all sounds horribly abstract and unhelpful, but one concrete fact is now deeply ingrained in Anna's psyche:
There's a 95 percent chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean.
In our family, we call this the “Two Standard Deviation” rule (or 2SD for short). Understanding this simple rule is really at the heart of understanding variability. So what does it mean? Well, the average IQ score is 100 and the standard deviation is 15. So the 2SD rule tells us that 95 percent of people will have an IQ between 70 (which is 100 minus two standard deviations) and 130 (which is 100 plus two standard deviations). Using the 2SD rule gives us a simple way to translate a standard deviation number into an intuitive statement about variability. Because of the 2SD rule, we can think about variability in terms of something that we understand: probabilities and proportions. Most people (95 percent) have IQs between 70 and 130. If the distribution of IQs were less variableâsay, the standard deviation was only 5âthen the range of scores that just included 95 percent of the population would be much smaller. We'd be able to say 95 percent of people have IQs between 90 and 110. (In fact, later on, we'll learn how Larry Summers, the ousted president of Harvard, got into a world of trouble by suggesting that men and women have different IQ standard deviations.)
We now know enough to figure out what was going on in Anna's eight-year-old mind during that fateful hike. You see, Anna can recite the 2SD rule in her sleep. She knows that standard deviations are our friends and that the first thing you always do whenever you have a standard deviation and a mean is to apply the 2SD rule.
Recall that after Anna said she had hiked Sleeping Giant six times, she said the standard deviation of her estimate was two. She got the number two as her estimate for the standard deviation by thinking about the 2SD rule. Anna asked herself what the 95 percent range of her confidence was and then tried to back out a number that was consistent with her intuitions. She used the 2SD rule to translate her intuitions into a number. (If you want a challenge, see if you can use the 2SD rule and just your intuition to derive a number for the standard deviation for adult male height. You'll find help at the bottom of the page.)
*4
But Anna wasn't done. The really amazing thing was that after a pause of a few seconds, she said, “Daddy, I want to revise my mean to eight.” During that pause, after she told me her estimate was six and the standard deviation was two, she was silently thinking more about the 2SD rule. The rule told her, of course, that there was a 95 percent chance that she had walked to the top of Sleeping Giant between two and ten times. And here's the important part: without any prompting she reflected on the truth of this range using nothing more than her experience, her memories. She realized that she had clearly walked it more than two times. Her numbers didn't fit her intuitions.
Anna could have resolved the contradiction by revising down her standard deviation estimate (“Daddy, I want to revise my standard deviation to one”). But she felt instead that it was more accurate to increase her estimate of the mean. By revising the mean up to eight, Anna was now saying there was a 95 percent chance that she had walked the trail between four and twelve times. Having been her companion on these walks, I can attest that she revised the right number.
I had never been more proud of my daughter. Yet the point of the story is not (only) to kvell about Anna's talents. (She is a smart kid, but she's no genius child. Notwithstanding my best attempts to twist her intellect, she's pretty normal.) No, the point of the story is to show how statistics and intuition can comfortably interact. Anna toggled back and forth between her memories as well as her knowledge of statistics to come up with a better estimate than either could have produced by themselves.
By estimating the 95 percent probability range, Anna actually produced a more accurate estimate of the mean. This is potentially a huge finding. Imagine what it could mean for the examination of trial witnesses, where lawyers often struggle to elicit estimates of when or how many times something occurred. You might even use it yourself when trying to jog your own or someone else's memory.
The (Wo)Man of the Future
For the rational study of the law the blackletter man may be the man of the present, but the man of the future is the man of statisticsâ¦
OLIVER WENDELL HOLMES, JR.
THE PATH OF THE LAW
,
1897
        Â
The rise of statistical thinking does not mean the end of intuition or expertise. Rather, Anna's revision underscores how intuition will be reinvented to coexist with statistical thinking. Increasingly, decision makers will switch back and forth between their intuitions and data-based decision making. Their intuitions will guide them to ask new questions of the data that non-intuitive number crunchers would miss. And databases will increasingly allow decision makers to test their intuitionsânot just once, but on an ongoing basis.
This dialectic is a two-way street. The best data miners will sit back and use their intuitions and experiential expertise to query whether their statistical analysis makes sense. Statistical results that diverge widely from intuition should be carefully interrogated. While there is now great conflict between dyed-in-the-wool intuitivists and the new breed of number crunchers, the future is likely to show that these tools are complements more than substitutes. Each form of decision making can pragmatically counterbalance the greatest weaknesses of the other.
Sometimes, instead of starting with a hypothesis, Super Crunchers stumble across a puzzling result, a number that shouldn't be there. That's what happened to Australian economist Justin Wolfers, when he was teaching a seminar at the University of Pennsylvania's Wharton School on information markets and sports betting. Wolfers wanted to show his students how accurate Las Vegas bookies were at predicting college basketball games. So Wolfers pulled data on over 44,000 gamesâalmost every college basketball game over a sixteen-year period. He created a simple graph showing what the actual margin of victory was relative to the market's predicted point spread.
“The graph was bang on a normal bell curve,” he said. Almost exactly 50 percent (50.01 percent) of the time the favored team beat the point spread and almost exactly 50 percent of the time they came up short. “I wanted to show the class that this wasn't just true in general, but that it would hold true for different-size point spreads.” The graphs that Wolfers made for games where the point spread was less than six points, and for point spreads from six to twelve points, again showed that the Las Vegas line was extremely accurate. Indeed, the following graph for all games where the point spread was twelve or less shows just how accurate:
SOURCE
: Justin Wolfers, “Point Shaving: Corruption in NCAA Basketball,” PowerPoint presentation, AEA Meetings (January 7, 2006)
Look how close the actual distribution of victory margins (the solid line) was to the theoretical normal bell curve. This picture gives you an idea of why they call it the “normal” distribution. Many real-world variables are approximately normal, taking the shape of your standard-issue bell curve. Almost nothing is perfectly normal. Still, many actual distributions are close enough to the normal distribution to provide a wickedly accurate approximation until you get several standard deviations into the tails.
*5
The problem was when Wolfers graphed games where the point spread was more than twelve points. When he crunched the numbers for his class, this is what he found:
SOURCE
: Justin Wolfers, “Point Shaving: Corruption in NCAA Basketball,” PowerPoint presentation, AEA Meetings. (January 7, 2006)
Instead of a 50â50 chance that the favored team would beat the point spread, Wolfers found there was only a 47 percent chance that the favored team would beat the spread (and hence a 53 percent chance that they would fail to cover the spread). This six percentage point difference might not sound like a lot, but when you're talking about millions of dollars bet on thousands of games (more than one fifth of college games have a point spread of more than twelve points), six percentage points is a big discrepancy. Something about the graph struck Wolfers as being very fishy and he started puzzling about it.
Wolfers was a natural for this investigation. He once worked for a bookie back home in Australia. More importantly, he is a leader in the new style of Super Crunching. A 2007
New York Times
article, aptly titled “The Future of Economics Isn't So Dismal,” singled him out as one of thirteen promising young economists, Super Crunchers all, who were remaking the field. Wolfers has a toothy smile and long, platinum blond hair often worn in a ponytail. To see him present a paper is to experience an endearing mixture of substance and flash. He is very much a rock star of Super Crunching.
So when Wolfers looked at the lopsided graph, it wasn't just that the favored team didn't cover the spread enough that bothered him; it was that they failed by just a few points. It was the hump in the distribution just below the Vegas line that didn't seem right. Justin began to worry that a small proportion of time when the point spread was high, players on the favored team would shave points. Suddenly, it all made a lot of sense. When there was a large point spread, players could shave points without really hurting their team's chances of still winning the game. Justin didn't think that all the games were rigged. But the pattern in the graph is what you'd see if about 6 percent of all high-point-spread games were fixed.