LOS ANGELES—All news is local, even when it's posted on a website like BuzzFeed.com that caters to a national audience. So when the popular site posted a story last month that provided data-driven answers to the question, “Who Watches More Porn: Republicans Or Democrats?” using state-by-state comparisons based on data provided by Pornhub Insights, which offers “research and analysis directly from the Pornhub team,” no one was surprised when national and social media sat up and took notice, and local media followed with stories about the truth about local porn habits.
But statistical and policy wonks also took notice, posting a smattering of reactions over the past month, and one of them, Jacob Harris, a Senior Software Architect at the New York Times, even used the BuzzFeed/Pornhub findings to illustrate a lengthy cautionary tale for data journalists. Harris, who published “Distrust Your Data” May 22 on OpenNews.org, writes in it that his intent is not to “scold those who reported it,” but to provide “an explicit illustration of how reporting on data can go wrong and what we can learn from it.”
What went down, according to Harris, was the following: "Pornhub (which is apparently the third most-popular pornography site on the Internet) was approached by Buzzfeed (which is probably the most-popular animated GIF distributor on the Internet) to analyze its traffic and determine whether 'blue' states that voted for Obama in the last election consumed more pornography than 'red' states that voted for Romney. And so, that’s what the statisticians at Pornhub did, pulling IP addresses from their website’s traffic logs, geocoding their likely locations and deriving a figure of total traffic for each state. They then divided the total hits from each state by that state’s population to derive a hits-per-capita number for each state. As a result, they were able to report that per-capita averages for each state and that blue states averaged slightly more hits per capita than red states.
“Unfortunately," he adds, "the study and the subsequent reporting derived from the Pornhub data serves as a vivid example of six ways to make mistakes with statistics:
* Sloppy proxies
* Dichotomizing
* Correlation does not equal causation
* Ecological inference
* Geocoding
* Data naiveté"
Regarding proxies, he observes, “In this case, they used page requests to the third most-popular online porn site as a proxy for all pornography consumption and the percentage of the people who voted for Obama or Romney as proxies for registered Democrats and Republicans. These proxies are not the same thing, so distortion is inevitable.”
Explaining the “dichotomizing” effect, he states, “For their analysis, Pornhub sorted states into red and blue ones. This seems like it makes sense, but they’ve flattened a continuous variable (the percentage of the state population that voted for Obama) into a binary condition (Romney wins/Obama wins). It’s likely this dichotomizing had a palpable effect, since it makes a battleground state like Virginia seem closer to a Democratic stalwart like Vermont than its ideological ‘red state’ neighbors in the South.”
The next two items on the list are “two of the most classic mistakes people make with statistics,” says Harris, who writes of correlation-versus-causation warning, “You’ve probably heard that a hundred times before, but this here is an actual illustration of why that matters. It’s entirely possible that the suggested relationship between the two variables is a total coincidence. Far more likely though is that the variables are related but only through a confounding variable that connects the two variables observed.”
Of “ecological inference,” he notes, “For the sake of argument, let’s assume that we’ve avoided all these other problems above. Let’s decide Internet porn is a valid proxy for all pornography, that votes for a specific candidate in the last presidential election is a valid measure of party affiliation, that the correlation is not due to any hidden variables, then we can definitively say that Democrats consume more porn than Republicans, right? Wrong. Meet the ecological inference fallacy. In short, just because you’ve derived some average measure about a group that contains more of a subpopulation, that doesn’t necessarily mean it’s true for individuals in that group, especially when the difference is so slight.”
For Harris, however, “the worst error was yet to come”: the Kansas conundrum, which comes in for extra scrutiny, representing as it did a “bizarre anomaly in the data: Kansas, a very red state, consumed an extremely high amount of porn per capita compared to the average for all other states. This is readily apparent when the numbers are graphed in a simple bar chart, but it really jumps out when the states are plotted on a scatterplot of Obama vote share vs. page hits.”
He continues, “If you assumed, as Pornhub did, that average porn consumption was normally distributed across all states, Kansas’ average was highly unlikely. At more than 2.95 standard deviations above the average, there would be a 0.16% chance of that occurring if it were truly random. An extreme outlier like this should make you sit up and take notice as a data journalist, because it can only mean one of two things. Either you’ve really found an extreme case that reveals something bizarre and newsworthy. Or—as one reader of Andrew Sullivan’s website figured out while all the journalists shrugged their shoulders—the data is flawed.”
Pornhub, he adds, “omitted any explicit description of their methodology—this is never a good sign—but it seems to have involved mapping the IP addresses from which users visited the site to physical addresses and reverse geocoding those to get states. The statisticians at Pornhub (and the journalists who confidently reported their findings) assumed this was a clean process, but any programmer with experience can tell you the bitter truth: geocoding is often rubbish.
“What happened here was that a large percentage of IP addresses could not be resolved to an address any more specific than ‘USA,” he explains. “When that address was geocoded, it returned a point in the centroid of the continental United States, which placed it in the state of—you guessed it—Kansas!”
As a point of reference, he adds, “Right now, my corporate VPN makes me look like I’m surfing the web from New Jersey even though I live in Maryland.”
Is there a lesson here for data journalists? Yes, insists Harris, explaining, "If you want to call yourself a data journalist, there is one shortcut you can never take: you must validate your data. Even the cleanest-looking data might contain flaws and omissions stemming from its methodology. It’s not enough to run checks on the data itself. You must also lift your nose out of the database, ask the serious questions about how the data was collected and even use the well-honed tools of a traditional reporter to call experts when—never an if—you find questions about the data.”
Other advice proffered by Harris to journalists includes distrusting the motives of the entities pushing data-based claims. “What angered me the most about this study,” he argues, “is that it was clearly framed from the start to go viral. You’d have to be willfully naive about the motivations of Pornhub and Buzzfeed to assume they wanted anything else here.”
That sounds as though Harris is going back on his promise not to “scold” BuzzFeed, but he explains, “You might argue why should I care so much about a bit of viral silliness from Buzzfeed? First, I would argue it’s never just 'all in fun' when you’re declaring half of the electorate more perverted than the other half. But more importantly, I don’t think the errors illustrated here are an aberration. Here’s another example of blindly trusting data to reach wrong conclusions. And another.”
He adds, somewhat pessimistically, “I fear it will only get worse as publishing cycles become faster and the data analysis is done by single reporters harried by deadline pressure and nobody to cross-check their work before publication. I don’t think we can slow this trend down, but what can data journalists do to avoid slamming into these sorts of problems at full speed?”
It is of course a rhetorical question he endeavors to answer in the piece.