Obscured by graphics
We begin our exploration into data visualisation with some a map from the New York Times
Hello and welcome.
My name is Karthik Shashidhar. I’m 37 years old and I live in Bangalore, India.
For 8-9 years now, I’ve been helping companies make sense of their data - in terms of using it to solve their own strategic business problems, enhance their product offerings, discover ways to monetise their data and launch new lines of business.
Along the way, I’ve written a book on market design, written columns for Mint and Hindustan Times, taught at IIM Bangalore and done some policy research for Takshashila. To know more about me, you can follow me on Twitter, or subscribe to my blog (over 2500 posts in over 15 years), or just check me out on LinkedIn.
Through the years, as I’ve helped businesses make sense of their data, or made sense of someone else’s data for my newspaper writing, I’ve been very particular about the way information is presented, and visualised. I’ve always believed that presenting data in the right manner can deliver insights in novel ways, and a lot of my consulting experience has borne this out.
In 2018, after I had put one too many a tweet making fun of bad visualisations in the mainstream media, Krish Ashok suggested I start “collecting” them. And I did, setting up this tumblr where I would put up bad visualisations, with a critique of what made them bad. I soon ran out of steam on that one, but people continue to send me badly done visualisations, in the hope that I can at least make fun of them on twitter.
You could think of this as a more serious restart. I plan to send out this newsletter once a week (every Thursday noon IST, if I can maintain the discipline). I will still analyse one piece of visualisation from the mainstream media each week, but I intend to be more objective and rigorous, rather than simply making more fun of them.
Once again, thanks for reading this. If you like it, please subscribe, comment and share with whoever else you think might like it. This newsletter will be free for the foreseeable future.
We will start with this infographic piece (possibly paywalled) by the New York Times documenting where New Yorkers moved to escape the high incidence of the disease in the city.
The journalism here is first-rate. Getting data on mail forwarding from the US postal service and using that as a proxy for where New Yorkers have moved to is an amazing insight.
The visualisation is not particularly standard - it’s not often that you need to show geographical flows on a map. I would possibly describe this as a “Sankey diagram on a map” (some more “research” on Wikipedia tells me these are called “flow maps”).
Sankey diagrams (they have NOTHING to do with Bangalore’s Sankey Tank) are diagrams that show flows, with the width of an arrow indicating the volume of flow from the source to destination. They have their origins in thermodynamics.
So the width of the arrow in the above map is an indicator of how many people asked their mails to be forwarded from New York City to each of these locations (which is a good proxy for how many people moved from New York City to these locations).
The map gives a good initial picture of where people went. For example, it is clear that many more people went to the Miami/Fort Lauderdale area than they did to the Houston area. Many more people went to Los Angeles than to San Francisco. And so on.
However, this kind of a representation has several shortcomings.
To start with, comparing widths of arrows is not straightforward. For example, if I were to ask you how many more people went to Miami than to Orlando, it is impossible to give that answer using this map. At best you can say “a large number”.
Then, maps like this might do well when indicating flows across long distances, but not so well for short distances. Look at the area around New York in this map - it’s a muddled mix of arrows and arrowheads, and it’s very unclear what are arrows and what are arrowheads. It is highly likely that when you came across this map, you paid no attention at all to this part, from Boston to Washington DC.
To be fair, this is not the only graphic in the story. There are more graphics, including a blow-up of the North-East, a map of all the locations where mails are getting forwarded to, and a map of New York neighbourhoods most emptied in the lockdown (based on this data). So the makers of the above graphic can be excused if they were to say that this graphic is not exhaustive.
Then again, hidden in the middle of the piece (without much attention, and without any of the graphics highlighting it) is this line:
Many New Yorkers who fled their homes in the city moved to nearby areas in Long Island, New Jersey and upstate New York.
It is very likely that you would have missed this line when you read the story. I did. If the number of people who moved nearby is large, it sort of contradicts the implicit message in the story, which is that rich New Yorkers have moved all over the US to escape the pandemic.
So how many people fled their homes and moved to nearby areas?
The text doesn’t provide the answer, but you can find it in a table at the end of the piece (I really appreciate the Times for adding this table here). This shows the number of mail forwarding requests by the city of destination. I decided to use that to make a simple bar graph.
Taking only the top 20 destinations into account, over half the people who asked for their mails to be forwarded to secondary addresses moved somewhere close by, to “nearby areas”. Miami is a very distant second.
And this is the main problem I have with the representation using the “Sankey diagram on map” - the biggest data point has been completely obscured. How many more people went to Miami compared to LA is insignificant if you compare it to people who just relocated elsewhere in or around New York City.
I really don’t know if this is an honest mistake.
I hope you liked this inaugural edition. Again, please let me know what you think (replying to this email is the easiest way to comment). Feel free to forward it to whoever else you think might like it, and ask them to subscribe. I’ll be back next week with another piece of data visualisation.
If there is a particular graphic that you want me to analyse sometime on this newsletter, your inputs are welcome. Again you can simply reply to this email to send in your inputs.
To pre-empt those of you who are curious about how I made the bar chart, I used R, and the ggplot2 package. Here is the line of code I used:
newYorkData %>% ggplot(aes(x=reorder(Area, Requests), y=Requests)) + geom_col() + geom_text(aes(label=paste0(scales::label_number_si(0.1)(Requests), ' (', scales::percent(Requests/sum(Requests), .1), ')'), hjust=ifelse(Requests>max(Requests)/2, 1, 0), col=ifelse(Requests>max(Requests)/2, 'white', 'black')), fontface='bold', size=4) + coord_flip() + scale_colour_identity() + theme_minimal() + theme(legend.position = 'none', title=element_text(face='bold', size=16), text=element_text(face='bold', size=14)) + xlab('') + scale_y_continuous('', breaks=c()) + labs(title="Where New Yorkers asked their mails to be forwarded to", caption='Made by Karthik Shashidhar, using data from nytimes.com')
Pretty cool !!