Megan McArdle

« The Reality-Based Community: Byte me | Main | Priceless »

How to lie (to yourself) with statistics

26 Feb 2008 06:18 am

William Briggs has a nice piece on how easy it is to delude yourself into thinking you've found a connection between two factors:

To show you how easy it is to mislead yourself with stepwise procedures, I did the following simulation. I generated 100 observations for y’s and 50 x’s (each of 100 observations of course). All of the observations were just made up numbers, each giving no information about the other. There are no relationships between the x’s and the y2. The computer, then, should tell me that the best model is no model at all.

But here is what it found: the stepwise procedure gave me a best combination model with 7 out of the original 50 x’s. But only 4 of those x’s met the usually criterion for being kept in a model (explained below), so my final model is this one:

explan. p-value Pr(beta x| data)>0
x7 0.0053 0.991
x21 0.046 0.976
x27 0.00045 0.996
x43 0.0063 0.996

In classical statistics, an explanatory variable is kept in the model if it has a p-value< 0.05. In Bayesian statistics, an explanatory variable is kept in the model when the probability of that variable (well, of its coefficient being non-zero) is larger than, say, 0.90. Don't worry if you don't understand what any of that means---just know this: this model would pass any test, classical or modern, as being good. The model even had an adjusted R2 of 0.26, which is considered excellent in many fields (like marketing or sociology; R2 is a number between 0 and 1, higher numbers are better).

Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x7=”married or not”, x21=”number of kids”, and so on. It is just too easy to concoct a reasonable story after the fact to say, “Of course, x7 should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the Journal of Hospitality Trends showing “statistically significant” relationships between being married and travel model spent.

And you would be believed.

I wouldn’t believe you, however, until you showed me how your model performed on a set of new data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.

Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.

This sort of thing is why we're barraged with studies showing that almost everything will kill you--no, wait! they'll make you live forever!

Comments (14)

This is one reason for the truth of my dictum: "All medical research is rubbish" is a better approximation to the truth than almost all medical research.

This is the more general case of the "birthday paradox"

(If you have 25 people in a room, there is a greater than 50% chance that two of them will have the same birthday)

In more vulgar terms: coincidence is more common than you'd think.

R2 of 0.26 is very low, only a dimwit would move forward with such a bad model.

The example points out that while doing analysis professionally, people are pressed to accept low standards for outcomes that any good grad student would reject.

R2 standards depend on the field. A field that has only weak relationships (thanks to too many confounding variables, complexity, or some other reason) tends to accept very low R2 as compared to harder sciences that have strong relationships. This is but one of the reasons why Social Science and the Liberal Arts are looked down at by the Physical Sciences.

1/20 random datasets will show a correlation at the 95% level. Which means that if you put 100 monkeys in a room collecting random data, you'd end up with 5 monkeys who would've found publishable results.

R2 standards depend on the field. A field that has only weak relationships (thanks to too many confounding variables, complexity, or some other reason) tends to accept very low R2 as compared to harder sciences that have strong relationships.

Where would medical research fall into that spectrum, that being Megan's example? Would an R2 of 0.26 be considered meaningful in that field?

R^2 is really irrelavent. It is the t statistic on each variable that really matters.

The T tells you how likely it is that the variable has some effect.

The R^2 tells you how much of the total variation your model captures. A variable can be meaningful in both the statistical and economic sense and still only explain 10% of the variation in the Y variable.


This is article is an example of why data mining is so dangerous. You can always find spurious correlations. That is why you should allways start with a rational theory BEFORE looking at the data set and then test that theory against the data. The probabliity that you came up with a theory that just happened to match the data is much lower than the probability that you tried a bunch of stuff and found that it matched the data.

Benjamin Disraeli said it best: "There are three kinds of lies: Lies, damned lies, and statistics."

There is a reason why the real sciences (physics, chemistry, biology) do not have "science" in their name - they are actually "sciences" that don't have to masquerade as such. In the real sciences, you (1) make measurable and reproducable observations that lead to (2) a hypothesis that leads to (3) falsifiable measurable and reproducable predictions of future experiments, which are (4) fed back into the hypotheses. Social "sciences" don't have any way to make these sort of predictions, and most medicine is in a similar bind, because we don't think it is ethical to perform experiments on people (which is why we experiment on animals, but I really don't want to go there).

The important part of all of this is the "reproducable and measurable measurements" that everyone else can do, not just the person producing the hypotheses. If the results and the experiments cannot be reproduced, and measured, then the work is not science. You can use some non-scientific methods to try to tease out a concept, but it is not real science until you can make a falsifiable prediction about an experiment.

As I think someone else has already alluded to, setting p

As Josh points out, you would expect 5 of 100 random datasets to be "significant" at the 0.05 level -- that's what it means. The example actually underperforms chance, with only 4.

ScentOfViolets is right, except we physicists know less statistics than do sociologists. I did my grad work on data with a 600:1 signal-to-noise ratio. I didn't need to learn much statistics.

"R^2 is really irrelavent. It is the t statistic on each variable that really matters.

The T tells you how likely it is that the variable has some effect.

The R^2 tells you how much of the total variation your model captures"

So a t-stat or other measure tests your hypothesis that the variables are related. Still, if your R2 is too low, you dont have a good model. (You probably need more variables.)
Which is the case for this example.

A number of points (from someone who teaches stats at a graduate school):

-Replication of studies goes on. It's usually not done as direct replications, but comes under the name "meta-analysis" or "research synthesis." For instance, the Department of Education has a web page: http://ies.ed.gov/ncee/wwc/ with meta-analysis info. There are issues that are important, but meta-analysis is a major step forward.

-Interpreting the size of R-squared is contentious. Suffice it to say that the measure (and its many relatives) is best uses in a comparative fashion when comparing a set of theoretically motivated models. Unfortunately, many people are taught badly about what R-squared means.

-As to the t-statistics of individual coefficients, this is a major mistake. It's quite possible to reparametrize a model to alter the t-statistics seemingly arbitrarily, all of which make exactly the same predictions about the data. This isn't a deficiency in statistics, it's a fundamental mathematical fact. This means that looking at the individual t-statistics is, generally, not very useful, though many people do it.

-Finally, the relationship between p-values (or the t-statistics) and effect size is quite weak, as is discussed in Jim Berger's work (statistics, Duke university). The reason is quite technical, but simply put p-values depend on two quantities, effect size and the size of sampling variability measured by the standard error, in ratio (effect size/standard error). Basically what happens is that the effect size can be quite modest but the standard error can go towards 0, making the p-value get very small. There have been numerous efforts to "calibrate" p-values, most of which are crazy. Jim Berger has done the best job, to my knowledge, giving an approximate relationship between Bayesian quantities of interest and p-values.

-The "search and seize" or "fishing expedition" represented by stepwise regression is old hat. Most people I've encountered know that, though the sequential specification search that many people do in social science sadly mimics this. No methodologist would advocate it, but sadly, far too many practicing scientists don't remember their lessons, or, more likely, get frisky when the need for publication gets closer.

rxc said:

"There is a reason why the real sciences (physics, chemistry, biology) do not have "science" in their name - they are actually "sciences" that don't have to masquerade as such. In the real sciences, you (1) make measurable and reproducable observations that lead to (2) a hypothesis that leads to (3) falsifiable measurable and reproducable predictions of future experiments, which are (4) fed back into the hypotheses. Social "sciences" don't have any way to make these sort of predictions, and most medicine is in a similar bind, because we don't think it is ethical to perform experiments on people (which is why we experiment on animals, but I really don't want to go there)."

Which of the so-called "soft sciences" have "science" in their names? Psychology? Sociology? Economics? hmm...

All three of these disciplines, I might add, use the scientific method you describe and are therefore sciences.

Comments on this entry have been closed.