Spurious correlations
Scatter plots with regression lines can make your analysis look "brilliant and mysterious", but they also make it easier to be dishonest
This chart has been doing the rounds on Twitter for a week now, and I ought to comment about it.
This was initially tweeted by Amihai Glazer, an economist at UC Irvine. It was brought to my attention when Nassim Nicholas Taleb started ranting about this.
So what do we have here? It’s a simple scatter plot. The X axis has the average physician salary. The Y axis has the mortality rate from Covid-19. Unfortunately the units of the latter are not mentioned. Physician salary, I’m assuming, is annual.
We have a nice scatter plot (though I don’t personally like empty circles), with states that are not in the big cluster being explicitly named. So far so good. And then we also have a regression line. Still very good (?).
The question is if this graph actually conveys the information claimed by the headline, and if the headline itself is valid.
The first thing in this graph that should set your alarm bells ringing is the choice of the X axis. Unlike bar graphs, there is no rule in scatter plots that the axes need to start from zero. In fact, when it comes to scatter plots, the honourable thing to do should be to choose axes that tightly bind the set of points being shown.
Instead, here we have an X axis going down to zero, with pretty much no data points in the left half of the graph. What we have instead is this part of the graph serving for the regression line going up and up and up, way beyond the last point it encountered.
Which brings us to the next red flag - regression is fundamentally an interpolation tool, and under normal circumstances, should NOT be used for extrapolation. Regressions assume a linear relationship, and try to find the best fit for the data points within the range that independent variable covers. Regression makes no assumptions on how the points might lie outside the range covered by data used to build the model.
So extending regression lines beyond the range of the given data is bad practice (in fact, in packages like R, regression lines, by default, terminate at the ends of the data given).
Then, there is nothing in the graph to show the margin of error. Regression, after all, finds the “line of best fit”. And unless the points are collinear, there is always an error band around this line of best fit in which points can lie. Here, for example, we see several data points that lie very far from the regression line drawn (look at the top part of the graph here), and not mentioning some measure of margin of error is plain dishonest.
For example, the regression equation itself (along with R Square) would provide a good illustration of margin of error. The 95% confidence interval of the slope of the regression could be another method of building confidence in the regression. And when there is nothing mentioned that builds confidence in the model, it is best to assume no confidence in the model.
And then, there is the problem of randomness and spurious correlations. Check out this simulation by Taleb. He just takes bunches of random normal data, and then plots them along with regression lines. The points, randomly chosen, are seldom uncorrelated.
Leaving aside statistical theory, the problem with this picture is that the regression line is not intuitive at all (the purpose of statistics sometimes is to go beyond intuition, I know). If you were to give the set of data points to a bunch of people and and ask them to draw a line of best fit, it is extremely unlikely that the regression line would match what is shown in the graph.
In the past, I’ve had graphics editors rejecting my scatter plots telling me that “common readers won’t understand”. A corollary to this is that using scatter plot can somehow make you look scientific. And adding a regression line to the picture can make you look more scientific.
The problem is that “common readers” usually are unable to tell good science from bad. And so anything that looks scientific can carry a lot more weight than something that doesn’t.
Which brings me to the response to this picture from Morgan Housel, an investor who writes an excellent blog.
I would add that the fiction written in Word is easily known to be fiction. The problem with fiction written in Excel is that people can think it’s fact.
I also tried to find the data to see if I could replicate the graph and see how (dis)honest Amihai Glazer was. As you might have guessed it, someone else had tried before me.
The problem is that Dunford again uses Excel to plot the graph, and that starts the axis at 0. Even though he finds the R square as only 0.13, the slope looks steep. So I HAD to remake the graph (Dunford gives the data in a google doc in a subsequent tweet), and this is how I made (didn’t bother much with beautification etc.)
Does the implication still hold?