Data Visualization #20—”Lying” with statistics

In teaching research methods courses in the past, a tool that I’ve used to help students understand the nuances of policy analysis is to ask them to assess a claim such as:

In the last 12 months, statistics show that the government of Upper Slovobia’s policy measures have contributed to limiting the infollcillation of ramakidine to 34%.

The point of this exercise is two-fold: 1) to teach them that the concepts we use in social science are almost always socially constructed, and we should first understand the concept—how it is defined, measured, and used—before moving on to the next step of policy analysis. When the concepts used—in this case, infollcillation and ramakidine—are ones nobody has every heard of (because I invented them), step 1 becomes obvious. How are we to assess whether a policy was responsible for something when we have zero idea what that something even means? Often, though, because the concept is a familar one—homelessness, polarization, violence—we often skip right past this step and focus on the next step (assessing the data).

2) The second point of the exercise is to help students understand that assessing the data (in this case, the 34% number) can not be done adequately without context. Is 34% an outcome that was expected? How does that number compare to previous years and the situation under previous governments, or the situation with similar governments in neighbouring countries? (The final step in the policy analysis would be to set up an adequate research design that would determine the extent to which the outcome was attributable to policies implemented by the South Slovobian government.)

If there is a “takeaway” message from the above, it is that whenever one hears a numerical claim being made, first ask yourself questions about the claim that fill in the context, and only then proceed to evaluate the claim.

Let’s have a look at how this works, using a real-life example. During a recent episode of Real Time, host Bill Maher used his New Rules segment to admonish the public (especially its more left-wing members) for overestimating the danger to US society of the COVID-19 virus. He punctuated his point by using the following statistical claim:

Maher not only claims that the statistical fact that 78% of COVID-19-caused fatalities in the USA have been from those who were assessed to have been “overweight” means that the virus is not nearly as dangerous to the general USA public as has been portrayed, but he also believes that political correctness run amok is the reason that raising this issue (which Americans are dying, and why) in public is verboten. We’ll leave aside the latter claim and focus on the statistic—78% of those who died from COVID-19 were overweight.

Does the fact that more than 3-in-4 COVID-19 deaths in the USA were individuals assessed to have been overweight mean that the danger to the general public from the virus has been overhyped? Maher wants you to believe that the answer to this question is an emphatic ‘yes!’ But is it?

Whenever you are presented with such a claim follow the steps above. In this case, that means 1) understand what is meant by “overweight” and 2) compare the statistical claim to some sort of baseline.

The first is relatively easy—the US CDC has a standard definition for “overweight”, which can be found here: https://www.cdc.gov/obesity/adult/defining.html. Assuming that the definition is applied consistently across the whole of the USA, we can move on to step 2. The first question you should ask yourself is “is 78% low, or high, or in-between?” Maher wants us to believe that the number is “high”, but is it really? Let’s look for some baseline data with which to compare the 78% statistic. The obvious comparison is the incidence of “overweight” in the general US population. Only when we find this data point will we be able to assess whether 78% is a high (or low) number. What do we find? Let’s go back to the US CDC website and we find this: “Percent of adults aged 20 and over with overweight, including obesity: 73.6% (2017-2018).”

So, what can we conclude? The proportion of USA adults dying from COVID-19 who are “overweight” (78%) is almost the same proportion of the USA adult population that is “overweight (73.6%).” Put another way, the likelihood of randomly selecting a USA adult who is overweight versus randomly selecting one who is not overweight is 73.6/26.4≈3.29. If one were to randomly select an adult who died from COVID-19, one would be 78/22≈3.55 times more likely to select an overweight person than a non-overweight person. Ultimately, in the USA at least, as of the end of April overweight adults are dying from COVID-19 at a rate that is about equal to their proportion in the general adult US population.

We can show this graphically via a pie chart. For many reasons, the use of pie charts is generally frowned upon. But, in this case, where there are only two categories—overweight, and non-overweight—pie charts are a useful visualization tool, which allows for easy visual comparison. Here are the pie charts, and the R code that produced them below:

Created by: Josip Dasović

We can clearly see that the proportion of COVID-19 deaths from each cohort—overweight, non-overweight—is almost the same as the proportion of each cohort in the general USA adult population. So, a bit of critical analysis of Maher’s claim shows that he is not making the strong case that he believes he is.

# Here is the required data frame
covid.df <- data.frame("ADULT"=rep(c("Overweight", "Non-overweight"),2), 
                       "Percentage"=c(0.736,0.264,0.78,0.22),
                       "Type"=rep(c("Total Adult Population","COVID-19 Deaths"),each=2))

library(ggplot2)

# Now the code for side-by-side pie charts:

ggpie.covid <- ggplot(covid.df, aes(x="", y=Percentage, group=ADULT, fill=ADULT, )) +
  geom_bar(width = 1, stat = "identity") +
  scale_fill_manual(values=c("#33B2CC","#D71920"),name ="ADULT CATEGORY") + 
  labs(x="", y="", title="Percentage of USA Adults who are Overweight",
       subtitle="(versus percentage of USA COVID-19 deaths who were overweight)") + 
  coord_polar("y", start=0) + facet_wrap(~ Type) +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid  = element_blank(),
        plot.title = element_text(hjust = 0.5, size=16, face="bold"),
        plot.subtitle = element_text(hjust=0.5, face="bold"))

ggsave(filename="covid19overweight.png", plot=ggpie.covid, height=5, width=8)


Data Visualization #19—Using panelView in R to produce TSCS (time-series cross-section) plots

As part of a project to assess the influence, or impact, of Canadian provincial government ruling ideologies on provincial economic performance I have created a time-series cross-section summary of my party ideology variable across provinces over time. A time-series cross-section research design is one in which there is variation across space (cross-section) and also over time. The time units can literally be anything although in comparative politics they are often years. The cross-section part can be countries, cities, individuals, states, or (in my case) provinces. Here is the snippet of the data structure in my dataset (data frame in R):

canparty.df[1:20,c(2:4,23)]

Year                    Region                      Pol.Party party.ideology
1981                   Alberta Progressive Conservative Party              1
1981          British Columbia            Social Credit Party              1
1981                  Manitoba Progressive Conservative Party              1
1981             New Brunswick Progressive Conservative Party              1
1981 Newfoundland and Labrador Progressive Conservative Party              1
1981               Nova Scotia Progressive Conservative Party              1
1981                   Ontario Progressive Conservative Party              1
1981      Prince Edward Island Progressive Conservative Party              1
1981                    Quebec                Parti Quebecois              0
1981              Saskatchewan           New Democratic Party             -1
1982                   Alberta Progressive Conservative Party              1
1982          British Columbia            Social Credit Party              1
1982                  Manitoba           New Democratic Party             -1
1982             New Brunswick Progressive Conservative Party              1
1982 Newfoundland and Labrador Progressive Conservative Party              1
1982               Nova Scotia Progressive Conservative Party              1
1982                   Ontario Progressive Conservative Party              1
1982      Prince Edward Island Progressive Conservative Party              1
1982                    Quebec                Parti Quebecois              0
1982              Saskatchewan Progressive Conservative Party              1

To get the plot picture below, we use the R code at the bottom of this post. But a couple of notes: first, the year data are not in true date format. Rather, they are in periods, which I have conveniently labelled years. In other words, what is important for the analysis that I will do (generalized synthetic control method) is to periodize the data. Second, because elections occur at any point during the year, I have had to make a decision as to which party is coded as having been in government that year.

Since my main goal is to assess economic performance, and because economic policies take time to be passed, and to implement, I made the decision to use June 30th as a cutoff point. If a party was elected prior to that date, it is coded as having governed the province in that whole year. If the election was held on July 1st (or after), then the incumbent party is coded as having governed the province the year of the election and the new government is coded as having started its mandate the following year.

Here’s the plot, and the R code below:

library(gsynth)
library(panelView)
library(ggplot2)

ggpanel1 <- panelView(Prop.seats.gov ~ party.ideology + Prov.GDP.Cap, data = canparty.df, 
          index = c("Region", "Year"), main = "Provincial Ruling Party Ideology", 
          legend.labs = c("Left", "Centre", "Right"), col=c("orange", "red", "blue"), 
          axis.lab.gap = c(2,0), xlab="", ylab="")
## I've used Prop.seats.gov and Prov.GDP.Cap b/c they are two 
## of my IVs, but any other IVs could have been used to 
## create the plot. The important part is the party.ideology 
## variable and the two index variables--Region (province)  ## and Year.

## Save the plot as a .png file

ggsave(filename="ProvRulingParty.png", plot=ggpanel1, height=8,width=7)

Data Visualization #18—Maps with Inset Maps

There are many different ways to make use of inset maps. The general motivation behind their use is to focus in on an area of a larger map in order to expose more detail about a particular area. Here, I am using the patchwork package in R to place a series of inset maps of major Canadian cities on the map that I created in my previous post. Here is the map and a snippet of the R code below:

Created by: Josip Dasović

You can see that a small land area contains just under 50% of Canada’s electoral districts. Once again, in a democracy, citizens vote. Land doesn’t.

In order to use the patchwork package, it is helpful to first create each of the city plots individually and store those as ggplot objects. Here are examples for Vancouver and Calgary.

## My main map (sf) object is named can_sf. Here I'm created a plot using only 
## districts in the Vancouver, then Calgary areas, respectively. I also limit the
## districts to those that comprise the "red" (see map) 50% population group.

library(ggplot2)
library(sf)

yvr.plot <- ggplot(can_sf[can_sf$prov.region=="Vancouver and Northern Lower Mainland" &
                             can_sf$Land.50.Pop.2016==1 | can_sf$prov.region=="Fraser Valley and Southern Lower Mainland" &
                             can_sf$Land.50.Pop.2016==1,]) + 
  geom_sf(aes(fill = Land.50.Pop.2016), col="black", lwd=0.05) + 
  scale_fill_manual(values=c("red")) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "none",
        plot.margin=unit(c(0,0,0,0), "mm"))

yyc.plot <- ggplot(can_sf[can_sf$prov.region=="Calgary" &
                            can_sf$Land.50.Pop.2016==1,]) + 
  geom_sf(aes(fill = Land.50.Pop.2016), col="black", lwd=0.05) + 
  scale_fill_manual(values=c("red")) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "none",
        plot.margin=unit(c(0,0,0,0), "mm"))

I have created a separate ggplot2 object for each of the cities that are inset onto the main Canada map above. The final R code looks like this:

## Add patchwork library
library(patchwork)

## Note: gg50 is the original map used in the previous post.

gg.50.insets <- gg.50 + inset_element(yvr.plot, left = -0.175, bottom = 0.25, 
                            right = 0, top = 0.45, on_top=TRUE, align_to = "plot",
                            clip=TRUE) +
  labs(title="VANCOUVER") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yeg.plot, left = -0.175, bottom = 0, 
                right = -0.025, top = 0.2, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="EDMONTON") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yyc.plot, left = -0.1, bottom = 0, 
                right = 0.20, top = 0.25, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="CALGARY") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yxe.plot, left = 0.125, bottom = 0, 
                right = 0.275, top = 0.15, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="SASKATOON") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yqr.plot, left = 0.225, bottom = 0, 
                right = 0.45, top = 0.175, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="REGINA") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(ywg.plot, left = 0.375, bottom = 0, 
                right = 0.55, top = 0.175, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="WINNIPEG") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yyz.yhm.plot, left = 0.725, bottom = 0, 
                right = 1.15, top = 0.25, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="TORONTO/\nHAMILTON") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yul.plot, left = 1, bottom = 0, 
                right = 1.175, top = 0.175, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="MONTREAL") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
        panel.border = element_rect(colour = "black", fill=NA, size=1)) +
  inset_element(yqb.plot, left = 0.95, bottom = 0.175, 
                right = 1.2, top = 0.375, on_top=TRUE, align_to = "plot",
                clip=TRUE) +
  labs(title="QUEBEC\nCITY") +
  theme(plot.title = element_text(hjust = 0.5, size=9, vjust=-1, face="bold"),
  panel.border = element_rect(colour = "black", fill=NA, size=1))

ggsave(filename="can_2019_50_pct_population_insets.png", plot=gg.50.insets, height=10, width=13)