Data Visualiztion #10–Visual Data and Causality

Researchers and analysts use data visualizations mostly to describe phenomena of interest. That is, they are used mostly to answer “who”, “what”, “where”, and “when” questions. Sometimes, however, data visualizations are meant to explain a phenomenon of interest. In social science, when we “explain” we are answering “how” and/or “why” questions. In essence, we are discussing causality. While social scientists are taught that a simple data visualization is never enough to settle claims of causality, in the real world, we often see simple charts passed off as evidence of the existence of a causal relationship between our phenomena of interest. Here’s an example that I’ve seen on social media that has been used to argue that government policies regarding the wearing of face masks and limiting the operations of businesses have no impact on the spread of the COVID-19 virus. Here’s the chart:

What are we meant to infer from the data contained in this chart? In two (of the 50 + DC) US states, the trajectory of infections seems to be very similar over the past 10 months or so, despite the fact that in one of the states–South Dakota–there have been no restrictions on businesses and no mask mandates, while these have both been part of the policy repertoire in neighbouring North Dakota. While this chart may seem compelling, it can not be used to argue that mask mandates and business restrictions have no effect on the spread of COVID-19.

The main problem with these types of charts is that they depict simple bivariate (two variables) relationships. In this case, we presumably see “data” (I’ll address the quality of this data in the next paragraph) on mask and business policies, and on infection rates. We are then encouraged to causally link these two variables. Unfortunately, that’s not at all how social science (or any science) is done. The social world is complex and rarely is it the case that one thing is caused only by one other thing, and nothing else. This is what we call the ceteris paribus (all other things being equal) criterion. In other words,. there may be a host of factors that contribute to COVID-19 infection rates other than mask and business policies. How do we know that one, or more, of these other things is not having an impact on the infection rates? Based on this chart, we don’t. That being said, by comparing two very similar states, the creators of this chart are seemingly aware of the ceteris paribus condition. In other words, choosing states with similar demographic, economic, geographic, etc., profiles (as is often done in comparative analysis) does indeed mitigate to some extent the need to “control for” the many other factors (beside mask and business policy) that are known to affect COVID-19 infection rates. But, we still can’t be sure that something else is actually causing the variation in infection rates that we see in the chart.

There are many other issues with the chart, but I will briefly address one more before closing with what I view as the most problematic issue.

First, we address the “operationalization” of the main explanatory (or independent) variable–the mask and business policies. In the chart, these are operationalized dichotomously–that is, each state is deemed to either have them (green checks) or not have them (red crosses). But it should be blindingly obvious that this is a far from adequate measure. Here are just a couple of questions that come up: 1) How many regulations have been put in place? 2) How have they been enforced? 3) When were they enacted (this is a key issue)? 4) Are residents obeying the regulations? (There is ample evidence to suggest that even where there are mask mandates, these are not being enforced, for example).

Now we deal with what, in this case, I believe to be the major issue. The measurement of the dependent variable–the rate of infection. Unless we know that we have measured this variable correctly, any further analysis is useless. And there is strong evidence to suggest that the measurement of this variable is biased, thereby undermining the analysis.

The incidence rate used here is a measure of the number of positive tests divided by the population of each state. It should be obvious that the number of positive tests is affected to a large extent by the number of overall tests. Unless the testing rate across the two states is similar, we can’t use the number of positive tests as an indicator of the infection rate in the two states. And, lo and behold, the testing rate is far from similar: Indeed, South Dakota is testing at a far lower rate than is North Dakota.

Here we see that the rate of COVID-19 positives in the population seems to be very similar–about 12,000 per 100,000 population. However, North Dakota has conducted four times as many tests as has South Dakota. Assuming the incidence of COVID-19 positivity is the similar across all of the tested population, the data are severely undercounting the incidence rate of COVID-19 in South Dakota. Indeed, had South Dakota tested as many residents as has North Dakota, the measured COVID-19 infection rate in South Dakota would be considerably higher. If the positivity rate for the whole of the state is similar to the first 44,903 tested, there would be a total of more than 46,000 positive tests, which would equate to a infection rate of 46930/(173987/100000), or about 27,000 per 100,000 population, which is more than double the rate in North Dakota. Not only can we not prove (based on the data that is in the chart above) whether masks and businesses policies are having an effect on the dependent variable–the positive rate of COVID-19–we can see that the measurement of the dependent variable is flawed. We have to first account (or “control”) for the number of COVID-19 tests given in each state, before calculating the positivity rate per 100,000 residents. Once we do that we see that the implied premise of the first chart (that the Dakotas have relatively similar infection rates) does not stand. The infection rate in South Dakota is at least 2X the infection rate in North Dakota.

Data Visualization #9–Non-ideal use of Stacked Bar Plots

Stacked bar plots (charts) are a very useful data visualization type…when used correctly. In an otherwise excellent report on the “Escalating Terrorism Problem in the United States” from the Center for Strategic and International Studies, there is a problematic stacked bar chart (actually, a stacked percentage chart) that should have been replaced by a grouped bar chart (or something else). Here is the, in my opinion, problematic chart:

The reason I believe this chart is problematic is because the chart could potentially obscure the nature (and trend) of the underlying data. The chart above is consistent with any number of underlying data patterns. Just as an example, let’s look at 2019 and 2020. We have the following percentage breakdown over the two years:

Type of Violence20192020

While it is obvious that ethnonationalist, and left-wing, violence have decreased (they are 0% in 2020), it is not clear whether right-wing and religious violence have increased, or decreased absolutely. Does right-wing violence in 2020 comprise 93% of 14 acts of terrorist violence? Or is it 93% of 200 acts of terrorist violence? We don’t know. To be fair to the authors of the report, they do provide a breakdown in absolute numbers later in the report. Still, I believe that a more appropriate use of a stacked bar/percentage chart is when the absolute number of instances is (relatively) static over the time/area of comparison.

Here’s an example from college football. The Pacific-12 conference has two divisions–North, and South. Every year each of the 6 teams in each division plays against 4 of the teams in the other division, for a total of 24 inter-divisional games every year. In addition, there is a PAC12 Championship Game, which pits the winner of each of the two divisions against each other at the end of the year. Therefore, there are 25 total inter-divisional PAC12 football games every year. A stacked percentage chart can be used to gauge the relative winning percentages of the two divisions against each other since the establishment of the PAC12 conference in 2011 (when Utah and Colorado were added).

Created by Josip Dasović

Here, each of the years refers to a total of 25 inter-divisional games. We can easily see the nature of the quality of the respective divisions by comparing the percentage of games won by each (over the other) between the years 2011 and 2019. We see that the North (which, by the way produced 8 of the 9 PAC12 champions during this period) has generally been stronger. In 6 of the 9 years, the North won a greater percentage of the inter-divisional games than did the South. And even in those years where the South won a greater percentage of the inter-divisional games, it wasn’t a much greater percentage.

So, use stacked percentage charts only when it is appropriate.

Data Visualization #8–a Treemap Addendum

A quick addendum to my last post using treemaps to begin the new year. As a reminder, I drew a couple of treemaps that showed the distribution of votes across US counties during the 2016 Presidential Election(s). There are more than 3,200 counties in the USA, and the vast majority of them have low populations. In fact, under 200 counties (or less than 7%) contain more than half of the population. That means that the other 3,000 counties comprise about 50% of the population. In short,, the distribution of people (and, therefore, of voters) is highly skewed. In fact, here’s a bonus chart–a histogram of US counties by population.

As we can see, the vast majority of counties have small populations, while a few counties have very large populations, including Los Angeles County, in which almost 3.5 million persons voted. The counties with large populations are so few in number that we can’t even see them on the chart. A count of 1 on the chart (y-axis) is a vertical distance that isn’t even 1 pixel in size, so it doesn’t show up on the graph.

I’ve updated one of the treemaps from my previous post slightly to help reinforce the disparity in population size between the largest counties and the rest. In the treemap below, I’ve divided the counties into two groups–the largest counties versus the rest so that each group comprises 50% of the total votes cast. We see again, that a small number of counties (154 to be exact) combined to produce as many votes as the remaining ~3000 counties. Once again, we see that the counties won by Trump were, on average, so small that they there is not even a hint of red on the map. Here’s the treemap, with the R code below:

gg.tree.tot.facet <- ggplot(us_df_final_facet[us_df_final_facet$State.Name!="Hawaii",], 
        aes(area = totalvote, fill = vote_win_diff, label=NAME, subgroup=State.Name)) +
        geom_treemap() +
        geom_treemap_subgroup_border(colour="black", size=2) +
        geom_treemap_subgroup_text(place = "centre", grow=F, alpha = 0.5, colour =
                                           "black", fontface = "italic", min.size = 0) +
        geom_treemap_text(colour = "black", place = "center", reflow = T) +
        scale_fill_distiller(type = "div", palette=5, direction=1, guide="coloursteps", limits=c(-2000000,2000000), breaks=seq(-2000000,2000000, by=500000),
        labels=c("2000000","1500000","1000000","500000","0","500000","1000000","1500000","2000000")) +
        labs(title = "US 2016 Presidential Election by County (Areas Proportional to Total Votes in County",
             fill="Difference\u2013County Vote Totals between Trump (red) & Clinton (blue)") + 
        theme(legend.key.height = unit(0.6, 'cm'),
              legend.key.width = unit(2,"cm"),
              legend.text = element_text(size=7),
              plot.title = element_text(hjust = 0.5, size=14, vjust=1),
              legend.position = "bottom") +
        guides(fill = guide_coloursteps(title.position="top", title.hjust = 0.5),
               size = guide_legend(title.position="top", title.hjust = 0.5))  +
        facet_wrap( ~ countysize, scales = "free")