Data Visualization #20—”Lying” with statistics

In teaching research methods courses in the past, a tool that I’ve used to help students understand the nuances of policy analysis is to ask them to assess a claim such as:

In the last 12 months, statistics show that the government of Upper Slovobia’s policy measures have contributed to limiting the infollcillation of ramakidine to 34%.

The point of this exercise is two-fold: 1) to teach them that the concepts we use in social science are almost always socially constructed, and we should first understand the concept—how it is defined, measured, and used—before moving on to the next step of policy analysis. When the concepts used—in this case, infollcillation and ramakidine—are ones nobody has every heard of (because I invented them), step 1 becomes obvious. How are we to assess whether a policy was responsible for something when we have zero idea what that something even means? Often, though, because the concept is a familar one—homelessness, polarization, violence—we often skip right past this step and focus on the next step (assessing the data).

2) The second point of the exercise is to help students understand that assessing the data (in this case, the 34% number) can not be done adequately without context. Is 34% an outcome that was expected? How does that number compare to previous years and the situation under previous governments, or the situation with similar governments in neighbouring countries? (The final step in the policy analysis would be to set up an adequate research design that would determine the extent to which the outcome was attributable to policies implemented by the South Slovobian government.)

If there is a “takeaway” message from the above, it is that whenever one hears a numerical claim being made, first ask yourself questions about the claim that fill in the context, and only then proceed to evaluate the claim.

Let’s have a look at how this works, using a real-life example. During a recent episode of Real Time, host Bill Maher used his New Rules segment to admonish the public (especially its more left-wing members) for overestimating the danger to US society of the COVID-19 virus. He punctuated his point by using the following statistical claim:

Maher not only claims that the statistical fact that 78% of COVID-19-caused fatalities in the USA have been from those who were assessed to have been “overweight” means that the virus is not nearly as dangerous to the general USA public as has been portrayed, but he also believes that political correctness run amok is the reason that raising this issue (which Americans are dying, and why) in public is verboten. We’ll leave aside the latter claim and focus on the statistic—78% of those who died from COVID-19 were overweight.

Does the fact that more than 3-in-4 COVID-19 deaths in the USA were individuals assessed to have been overweight mean that the danger to the general public from the virus has been overhyped? Maher wants you to believe that the answer to this question is an emphatic ‘yes!’ But is it?

Whenever you are presented with such a claim follow the steps above. In this case, that means 1) understand what is meant by “overweight” and 2) compare the statistical claim to some sort of baseline.

The first is relatively easy—the US CDC has a standard definition for “overweight”, which can be found here: https://www.cdc.gov/obesity/adult/defining.html. Assuming that the definition is applied consistently across the whole of the USA, we can move on to step 2. The first question you should ask yourself is “is 78% low, or high, or in-between?” Maher wants us to believe that the number is “high”, but is it really? Let’s look for some baseline data with which to compare the 78% statistic. The obvious comparison is the incidence of “overweight” in the general US population. Only when we find this data point will we be able to assess whether 78% is a high (or low) number. What do we find? Let’s go back to the US CDC website and we find this: “Percent of adults aged 20 and over with overweight, including obesity: 73.6% (2017-2018).”

So, what can we conclude? The proportion of USA adults dying from COVID-19 who are “overweight” (78%) is almost the same proportion of the USA adult population that is “overweight (73.6%).” Put another way, the likelihood of randomly selecting a USA adult who is overweight versus randomly selecting one who is not overweight is 73.6/26.4≈3.29. If one were to randomly select an adult who died from COVID-19, one would be 78/22≈3.55 times more likely to select an overweight person than a non-overweight person. Ultimately, in the USA at least, as of the end of April overweight adults are dying from COVID-19 at a rate that is about equal to their proportion in the general adult US population.

We can show this graphically via a pie chart. For many reasons, the use of pie charts is generally frowned upon. But, in this case, where there are only two categories—overweight, and non-overweight—pie charts are a useful visualization tool, which allows for easy visual comparison. Here are the pie charts, and the R code that produced them below:

Created by: Josip Dasović

We can clearly see that the proportion of COVID-19 deaths from each cohort—overweight, non-overweight—is almost the same as the proportion of each cohort in the general USA adult population. So, a bit of critical analysis of Maher’s claim shows that he is not making the strong case that he believes he is.

# Here is the required data frame
covid.df <- data.frame("ADULT"=rep(c("Overweight", "Non-overweight"),2), 
                       "Percentage"=c(0.736,0.264,0.78,0.22),
                       "Type"=rep(c("Total Adult Population","COVID-19 Deaths"),each=2))

library(ggplot2)

# Now the code for side-by-side pie charts:

ggpie.covid <- ggplot(covid.df, aes(x="", y=Percentage, group=ADULT, fill=ADULT, )) +
  geom_bar(width = 1, stat = "identity") +
  scale_fill_manual(values=c("#33B2CC","#D71920"),name ="ADULT CATEGORY") + 
  labs(x="", y="", title="Percentage of USA Adults who are Overweight",
       subtitle="(versus percentage of USA COVID-19 deaths who were overweight)") + 
  coord_polar("y", start=0) + facet_wrap(~ Type) +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid  = element_blank(),
        plot.title = element_text(hjust = 0.5, size=16, face="bold"),
        plot.subtitle = element_text(hjust=0.5, face="bold"))

ggsave(filename="covid19overweight.png", plot=ggpie.covid, height=5, width=8)


Data Visualization #19—Using panelView in R to produce TSCS (time-series cross-section) plots

As part of a project to assess the influence, or impact, of Canadian provincial government ruling ideologies on provincial economic performance I have created a time-series cross-section summary of my party ideology variable across provinces over time. A time-series cross-section research design is one in which there is variation across space (cross-section) and also over time. The time units can literally be anything although in comparative politics they are often years. The cross-section part can be countries, cities, individuals, states, or (in my case) provinces. Here is the snippet of the data structure in my dataset (data frame in R):

canparty.df[1:20,c(2:4,23)]

Year                    Region                      Pol.Party party.ideology
1981                   Alberta Progressive Conservative Party              1
1981          British Columbia            Social Credit Party              1
1981                  Manitoba Progressive Conservative Party              1
1981             New Brunswick Progressive Conservative Party              1
1981 Newfoundland and Labrador Progressive Conservative Party              1
1981               Nova Scotia Progressive Conservative Party              1
1981                   Ontario Progressive Conservative Party              1
1981      Prince Edward Island Progressive Conservative Party              1
1981                    Quebec                Parti Quebecois              0
1981              Saskatchewan           New Democratic Party             -1
1982                   Alberta Progressive Conservative Party              1
1982          British Columbia            Social Credit Party              1
1982                  Manitoba           New Democratic Party             -1
1982             New Brunswick Progressive Conservative Party              1
1982 Newfoundland and Labrador Progressive Conservative Party              1
1982               Nova Scotia Progressive Conservative Party              1
1982                   Ontario Progressive Conservative Party              1
1982      Prince Edward Island Progressive Conservative Party              1
1982                    Quebec                Parti Quebecois              0
1982              Saskatchewan Progressive Conservative Party              1

To get the plot picture below, we use the R code at the bottom of this post. But a couple of notes: first, the year data are not in true date format. Rather, they are in periods, which I have conveniently labelled years. In other words, what is important for the analysis that I will do (generalized synthetic control method) is to periodize the data. Second, because elections occur at any point during the year, I have had to make a decision as to which party is coded as having been in government that year.

Since my main goal is to assess economic performance, and because economic policies take time to be passed, and to implement, I made the decision to use June 30th as a cutoff point. If a party was elected prior to that date, it is coded as having governed the province in that whole year. If the election was held on July 1st (or after), then the incumbent party is coded as having governed the province the year of the election and the new government is coded as having started its mandate the following year.

Here’s the plot, and the R code below:

library(gsynth)
library(panelView)
library(ggplot2)

ggpanel1 <- panelView(Prop.seats.gov ~ party.ideology + Prov.GDP.Cap, data = canparty.df, 
          index = c("Region", "Year"), main = "Provincial Ruling Party Ideology", 
          legend.labs = c("Left", "Centre", "Right"), col=c("orange", "red", "blue"), 
          axis.lab.gap = c(2,0), xlab="", ylab="")
## I've used Prop.seats.gov and Prov.GDP.Cap b/c they are two 
## of my IVs, but any other IVs could have been used to 
## create the plot. The important part is the party.ideology 
## variable and the two index variables--Region (province)  ## and Year.

## Save the plot as a .png file

ggsave(filename="ProvRulingParty.png", plot=ggpanel1, height=8,width=7)

Data Visualization #16—Canadian Federal Equalization Payments per capita

My most recent post in this series analyzed data related to the federal equalization program in Canada using a lollipop plot made with ggplot2 in R. The data that I chose to visualize—annual nominal dollar receipts by province—give the reader the impression that over the last five-plus decades the province of Quebec (QC) is the main recipient (by far) of these federal transfer funds. While this may be true, the plot also misrepresents the nature of these financial flows from the federal government to the provinces. The data does not take into account the wide variation in populations amongst the 10 provinces. For example, Prince Edward Island (PEI) as of 2019 has a population of about 156,000 residents, while Quebec has a population of approximately 8.5 million, or about 55 times as much as PEI. That is to say a better way of representing the provincial receipt of equalization funds is to calculate the annual per capita (i.e., for every resident) value, rather than a provincial total.

For the lollipop chart below, I’ve not only calculated an annual per-capita measure of the amount of money received by province, I’ve also controlled for inflation, understanding that a dollar in 1960 was worth a lot more (and could be used to buy many more resources) in 1960 than today. Using Canadian GDP deflator data compiled by the St. Louis Federal Reserve, I’ve created plotting variable—annual real per-capita federal equalization receipts by province, with a base year of 2014. Here, we see that the message of the plot is no longer Quebec’s dominance but a story in which Canadians (regardless of where they live) are treated relatively equally. Of course, every year, Canadian in some provinces receive no equalization receipts.

Here’s the plot, and the R code below it:

Source: https://open.canada.ca/data/en/dataset/4eee1558-45b7-4484-9336-e692897d393f
require(ggplot2)
require(gganimate) 

gg.anim.lol3 <- ggplot(eq.pop.df[eq.pop.df$Year!="2019-20",], aes(x=province, y=real.value.per.cap, label=real.value.per.cap.amt)) + 
  geom_point(stat='identity', size=14, aes(col=as.factor(zero.dummy))) +  #, fill="white"
  scale_color_manual(name="zero.dummy", 
                     #      labels = c("Above", "Below"), 
                     values = c("0"="#000000", "1"="red")) + 
  labs(title="Per Capita Federal Equalization Entitlements (by Province): {closest_state}",
       subtitle="(Real $ CAD—2014 Base Year)",
       x=" ", y="$ CAD (Real—2014 Base Year)") +
  geom_segment(aes(y = 0, 
                   x = province, 
                   yend = real.value.per.cap, 
                   xend = province), 
               color = "red",
               size=1.5) +
  scale_y_continuous(breaks=seq(0,3000,500)) +
  theme(legend.position="none",
        plot.title =element_text(hjust = 0.5, size=23),
        plot.subtitle =element_text(hjust = 0.5, size=19),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        axis.text.y =element_text(size = 14),
        axis.text.x=element_text(vjust=0.5,size=16, colour="black")) +
  geom_text(color="white", size=4) +
  transition_states(
    Year,
    transition_length = 1,
    state_length = 9
  ) +
  enter_fade() +
  exit_fade()

animate(gg.anim.lol3, nframes = 610, fps = 10, width=800, height=680, renderer=gifski_renderer("equal_real_per_cap_lollipop.gif"))  

Data Visualization #15—Canadian Federal Equalization Payments over time using an animated lollipop graph

There is likely no federal-provincial political issue that stokes more anger amongst Albertans (and is so misunderstood) as equalization payments (entitlements) from the Canadian federal to the country’s 10 provinces. Although some form of equalization has always been a part of the federal government’s policy arsenal, the current equalization program was initiated in the late 1950s, with the goal of providing, or at least helping achieve an equal playing field across the country in terms of basic levels of public services. As Professor Trevor Tombe notes:

Regardless of where you live, we are committed (indeed, constitutionally committed) to ensure everyone has access to “reasonably comparable levels of public services at reasonably comparable levels of taxation.”

Finances of the Nation, Trevor Tombe

For more information about the equalization program read Tombe’s article and links he provides to other information. The most basic misunderstanding of the program is that while some provinces receive payments from the federal government (the ‘have-nots’) the other, more prosperous, provinces (the ‘haves’) are the source of these payments. You often hear the phrase “Alberta sends X $billion to Quebec every year!” That’s not the case. The funds are generated and distributed from federal revenues (mostly income tax) and disbursed from this same fund of resources. The ‘have’ provinces don’t “send money” to other provinces. The federal government collects tax revenue from all individuals and if a province has a higher proportion of high-earning workers than another, it will generally receive less back in money from the federal government than its workers send to Ottawa. (To reiterate, read Tombe for more about the particulars.)

Using data provided by the Government of Canada, I have decided to show the federal equalization outlays over time using what is called a lollipop chart. I could have used a bar chart, but I like the way the lollipop chart looks. Here’s the chart and the R code below:

Created by Josip Dasović

The data source is here: https://open.canada.ca/data/en/dataset/4eee1558-45b7-4484-9336-e692897d393f, and I am using the table called Equalization Entitlements.

N.B.: The original data, for some reason, had Alberta abbreviated as AL, so I had to edit the my final data frame and the gif.

## You'll need these two libraries
require(ggplot2)
require(gganimate)

gg.anim.lol1 <- ggplot(melt.eq.df, aes(x=variable, y=value, label=amount)) + 
  geom_point(stat='identity', size=14, aes(col=as.factor(zero.dummy))) +  #, fill="white"
  scale_color_manual(name="zero.dummy", 
                     values = c("0"="#000000", "1"="red")) + 
  labs(title="Canada—Federal Equalization Entitlements (by Province): {closest_state}",
       x=" ", y="Millions of nominal $ (CAD)") +
  geom_segment(aes(y = 0, 
                   x = variable, 
                   yend = value, 
                   xend = variable), 
               color = "red",
               size=1.5) +
  scale_y_continuous(breaks=seq(0,15000,2500)) +
  theme(legend.position="none",
        plot.title =element_text(hjust = 0.5, size=23),
        axis.title.x = element_text(size = 16),
        axis.title.y = element_text(size = 16),
        axis.text.y =element_text(size = 14),
        axis.text.x=element_text(vjust=0.5,size=16, colour="black")) +
  geom_text(color="white", size=4) +
    transition_states(
  Year,
  transition_length = 2,
  state_length = 8
) +
  enter_fade() +
  exit_fade()

animate(gg.anim.lol1, nframes = 630, fps = 10, width=800, height=680, renderer=gifski_renderer("equal.lollipop.gif"))  

Data Visualization # 12—Using Roulette to Deconstruct the ‘Climate is not the Weather’ response to climate “deniers”

If you are at all familiar with the politics and communication surrounding the global warming issue you’ll almost certainly have come across one of the most popular talking points among those who dismiss (“deny”) contemporary anthropogenic (human-caused) climate change (I’ll call them “climate deniers” henceforth). The claim goes something like this:

“If scientists can’t predict the weather a week from now, how in the world can climate scientists predict what the ‘weather’ [sic!] is going to be like 10, 20, or 50 years from now?”

Notably, the statement does possess a prima facie (i.e., “commonsensical”) claim to plausibility–most people would agree that it is easier (other things being equal) to make predictions about things are closer in time to the present than things that happen well into the future. We have a fairly good idea of the chances that the Vancouver Canucks will win at least half of their games for the remainder of the month of March 2021. We have much less knowledge of how likely the Canucks will be to win at least half their games in February 2022, February 2025, or February 2040.

Notwithstanding the preceding, the problem with this denialist argument is that it relies on a fundamental misunderstanding of the difference between climate and weather. Here is an extended excerpt from the US NOAA:

We hear about weather and climate all of the time. Most of us check the local weather forecast to plan our days. And climate change is certainly a “hot” topic in the news. There is, however, still a lot of confusion over the difference between the two.

Think about it this way: Climate is what you expect, weather is what you get.

Weather is what you see outside on any particular day. So, for example, it may be 75° degrees and sunny or it could be 20° degrees with heavy snow. That’s the weather.

Climate is the average of that weather. For example, you can expect snow in the Northeast [USA] in January or for it to be hot and humid in the Southeast [USA] in July. This is climate. The climate record also includes extreme values such as record high temperatures or record amounts of rainfall. If you’ve ever heard your local weather person say “today we hit a record high for this day,” she is talking about climate records.

So when we are talking about climate change, we are talking about changes in long-term averages of daily weather. In most places, weather can change from minute-to-minute, hour-to-hour, day-to-day, and season-to-season. Climate, however, is the average of weather over time and space.

The important message to take from this is that while the weather can be very unpredictable, even at time-horizons of only hours, or minutes, the climate (long-term averages of weather) is remarkably stable over time (assuming the absence of important exogenous events like major volcanic eruptions, for example).

Although weather forecasting has become more accurate over time with the advance of meteorological science, there is still a massive amount of randomness that affects weather models. The difference between a major snowstorm, or clear blue skies with sun, could literally be a slight difference in air pressure, or wind direction/speed, etc. But, once these daily, or hourly, deviations from the expected are averaged out over the course of a year, the global mean annual temperature is remarkably stable from year-to-year. And it is an unprecedentedly rapid increase in mean annual global temperatures over the last 250 years or so that is the source of climate scientists’ claims that the earth’s temperature is rising and, indeed, is currently higher than at any point since the beginning of human civilization some 10,000 years ago.

Although the temperature at any point and place on earth in a typical year can vary from as high as the mid-50s degrees Celsius to as low as the -80s degrees Celsius (a range of some 130 degrees Celsius) the difference in the global mean annual temperature between 2018 and 2019 was only 0.14 degrees Celsius. That incorporates all of the polar vortexes, droughts, etc., over the course of a year. That is remarkably stable. And it’s not a surprise that global mean annual temperatures tend to be stable, given the nature of the earth’s energy system, and the concept of earth’s energy budget.

In the same way that earth’s mean annual temperatures tend to be very stable (accompanied by dramatic inter-temporal and inter-spatial variation), we can see that the collective result of many repeated spins of a roulette wheel is analogously stable (with similarly dramatic between-spin variation).

A roulette wheel has 38 numbered slots–36 of which are split evenly between red slots and black slots–numbered from 1 through 36–and (in North America) two green slots which are numbered 0, and 00. It is impossible to determine with any level of accuracy the precise number that will turn up on any given spin of the roulette wheel. But, we know that for a standard North American roulette wheel, over time the number of black slots that turn up will be equal to the number of red slots that turn up, with the green slots turning up about 1/9 as often as either red or black. Thus, while we have no way of knowing exactly what the next spin of the roulette wheel will be (which is a good thing for the casino’s owners), we can accurately predict the “mean outcome” of thousands of spins, and get quite close to the actual results (which is also a good thing for the casino owners and the reason that they continue to offer the game to their clients).

Below are two plots–the upper plot is an animated plot of each of 1000 simulated random spins of a roulette wheel. We can see that the value of each of the individual spins varies considerably–from a low of 0 to a high of 36. It is impossible to predict what the value of the next spin will be.

The lower plot, on the other hand is an animated plot, the line of which represents the cumulative (i.e. “running”) mean of 1000 random spins of a roulette wheel. We see that for the first few random rolls of the roulette wheel the cumulative mean is relatively unstable, but as the number of rolls increases the cumulative mean eventually settles down to a value that is very close to the ‘expected value’ (on a North Amercian roulette wheel) of 17.526. The expected value* is simply the sum of all of the individual values 0,0, 1 through 36 divided by the total number of slots, which is 38. Over time, as we spin and spin the roulette wheel, the values from spin-to-spin may be dramatically different. Over time, though, the mean value of these spins will converge on the expected value of 17.526. From the chart below, we see that this is the case.

Created by Josip Dasović

Completing the analogy to weather (and climate) prediction, on any given spin our ability to predict what the next spin of the roulette wheel will be is very low. [The analogy isn’t perfect because we are a bit more confident in our weather predictions given that the process is not completely random–it will be more likely to be cold and to snow in the winter, for example.] But, over time, we can predict with a high degree of accuracy that the mean of all spins will be very close to 17.526. So, our inability to predict short-term events accurately does not mean that we are not able to predict long-term events accurately. We can, and we do. In roulette, and for the climate as well.

TLDR: Just because a science can’t predict something short-term does not mean that it isn’t a science. Google quantum physics and randomness and you’ll understand what Einstein was referring to when he quipped that “God does not play dice.” Maybe she’s a roulette player instead?

  • Note: This is not the same as the expected dollar value of a bet given that casinos generate pay-off matrixes that are advantageous to themselves.

Data Visualization #9–Non-ideal use of Stacked Bar Plots

Stacked bar plots (charts) are a very useful data visualization type…when used correctly. In an otherwise excellent report on the “Escalating Terrorism Problem in the United States” from the Center for Strategic and International Studies, there is a problematic stacked bar chart (actually, a stacked percentage chart) that should have been replaced by a grouped bar chart (or something else). Here is the, in my opinion, problematic chart:

The reason I believe this chart is problematic is because the chart could potentially obscure the nature (and trend) of the underlying data. The chart above is consistent with any number of underlying data patterns. Just as an example, let’s look at 2019 and 2020. We have the following percentage breakdown over the two years:

Type of Violence20192020
Ethnonationalist3%0%
Left-wing4%0%
Other0%0%
Religious30%7%
Right-wing63%93%

While it is obvious that ethnonationalist, and left-wing, violence have decreased (they are 0% in 2020), it is not clear whether right-wing and religious violence have increased, or decreased absolutely. Does right-wing violence in 2020 comprise 93% of 14 acts of terrorist violence? Or is it 93% of 200 acts of terrorist violence? We don’t know. To be fair to the authors of the report, they do provide a breakdown in absolute numbers later in the report. Still, I believe that a more appropriate use of a stacked bar/percentage chart is when the absolute number of instances is (relatively) static over the time/area of comparison.

Here’s an example from college football. The Pacific-12 conference has two divisions–North, and South. Every year each of the 6 teams in each division plays against 4 of the teams in the other division, for a total of 24 inter-divisional games every year. In addition, there is a PAC12 Championship Game, which pits the winner of each of the two divisions against each other at the end of the year. Therefore, there are 25 total inter-divisional PAC12 football games every year. A stacked percentage chart can be used to gauge the relative winning percentages of the two divisions against each other since the establishment of the PAC12 conference in 2011 (when Utah and Colorado were added).

Created by Josip Dasović

Here, each of the years refers to a total of 25 inter-divisional games. We can easily see the nature of the quality of the respective divisions by comparing the percentage of games won by each (over the other) between the years 2011 and 2019. We see that the North (which, by the way produced 8 of the 9 PAC12 champions during this period) has generally been stronger. In 6 of the 9 years, the North won a greater percentage of the inter-divisional games than did the South. And even in those years where the South won a greater percentage of the inter-divisional games, it wasn’t a much greater percentage.

So, use stacked percentage charts only when it is appropriate.

Data Visualization #7–Treemaps using US Counties and 2016 Presidential Vote

While we’re still waiting on the availability of official county-level results 2020 the 2020 US Presidential Elections*, I thought I’d create a treemap of the county-level results from the 2016 election. You may be thinking to yourself, “What is a treemap?”

Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The space in the visualization is split up into rectangles that are sized and ordered by a quantitative variable.

Link to Source

Treemaps, therefore, can help us visualize the relationships within our quantitative data in a unique, visually-pleasing, and meaningfully effective manner. Let’s see how with the example of the US 2016 Presidential Election.

Here’s a picture of then newly-elected President Donald Trump looking at a map given to him by his advisers depicting the results of the 2016 election. This specific depiction of the results overstates the extent of the support across the USA for Trump in the 2016 election. As those in the know often say “land mass does not vote.” Indeed, if one were ignorant about US politics, and US political demography, looking at that map one would be most likely be perplexed were one told that the “blue” candidate actually won 3 million more votes than did the “red” candidate.

Here is my reproduction of these data\2013using publicly-available data from MIT Election Data and Science Lab, 2018, “County Presidential Election Returns 2000-2016”, https://doi.org/10.7910/DVN/VOQCHQ, Harvard Dataverse, V6, UNF:6:ZZe1xuZ5H2l4NUiSRcRf8Q== [fileUNF]. I’ve added the R-code at the end of this post.

We can see that the vast majority of counties are small, and that voters in these counties were more likely to have voted for Trump than for Clinton. Indeed, Clinton win fewer than 16% of all counties.

The problem with this map is that it essentially dichotomizes quantitative data into qualitative data. To be precise, the decision whether to colour a county blue or red is made simply on the basis of whether, of those who voted, more voted for Trump, or for Clinton. If a county voted 51-50 for Trump, it gets a red colour. If a county voted 1,000,000-100,000 for Clinton it gets coloured blue. And, to make things even more confusing, the total of red that each county receives is related ONLY to country land area, and doesn’t take account of the number of voters.

As is the case in many parts of the world today, the US is increasingly split demographically\u2013with those living in rural areas (and suburbs/exurbs) voting for the conservative parties (Republican) and those in the urban areas voting for liberal parties (Democratic). We see this clearly in the map above. The problem with US counties is that they are not uniform either in terms of their land area, or their population. There are apartment buildings in New York City and Los Angeles that have more residents than some counties.

We can use treemaps to more “accurately” depict electoral outcomes. By accurately, I mean that the visual representation of the data more closely reflects how many voted for each candidate (party).

The first example below represents the vote at the county level and describes two quantitative variables. The size of each rectangle represents the total number of voters in each county\u2013the larger the rectangle the greater the numbers of voters in that county. The second variable, which is mapped using the colour scale, represents the difference\u2013in raw vote totals between the two candidates. Reddish shades denote a county that was won by Trump, while bluish shades represent counties won by Clinton.

There are a couple of things to notice. First, the wide disparity in the total number of voters across the counties. Second, we see that most of the counties have shades that are only very lightly blue (or red) and look mostly white. This is because the range on the variable must be so expansive in order to include outliers like Los Angeles and Cook Counties. Thus, in the vast majority of US counties the raw vote total differences between Trump’s totals and Clinton’s totals are in the 1000s range. This is why Trump was able to win more than 84% of US counties and still lose the popular vote by more than 3 million.

Our next (and final) treemap is similar to the one above except that the scale for the colouring is not the raw vote difference between Trump and Clinton in each county, but the percentage-point differential in vote between the two candidates.

We see much more red and blue in this map because the scale is confined to 100% Trump win to 100% Clinton win. Notice the striking disparity in where the blue and red colours, respectively, are found. The reddish shades dominate in small-population counties (in the top-right corner of each state subgroup), while the bluish shades dominate in large-population counties (in the bottom-left corners of each state subgroup). Finally, the larger (greater population) counties tend be be much smaller geographically than the less-populous counties, which is why the map on Trump’s desk looks like it does.

gg.geom.uscounty <- ggplot(us_df_final_2163) +
        geom_sf(aes(fill = winner), col="black", lwd=0.1) + 
        scale_fill_manual(values=c("blue","red"), labels=c("Clinton","Trump"), breaks=c("Democrat","Republican")) + # breaks...to get rid of NA
        labs(title = "US 2016 Presidential Election Results by County ('Lower 48')") +
        theme_void() + 
        coord_sf(xlim = c(-1900000,2400000), ylim = c(-2050000, 625000)) +
        theme(legend.title=element_blank(),
              legend.text = element_text(size = 12),
              plot.title = element_text(hjust = 0.5, size=16, vjust=2),
              legend.position = "bottom",
              plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"),
              legend.box.margin = margin(0,0,30,0),
              legend.key.size = unit(0.75, "cm"))

gg.geom.uscounty

R Code for treemaps: (this is vote the “total vote” variable. Replace that variable with a “percentage-vote” variable–with appropriate limits and breaks (-100,100) because you are now working with percentages).

gg.tree.tot <- ggplot(us_df_final, aes(area = totalvote, fill = vote_win_diff, label=NAME, subgroup=State.Name)) +
        geom_treemap() +
        geom_treemap_subgroup_border(colour="black", size=2) +
        geom_treemap_subgroup_text(place = "centre", grow=F, alpha = 0.5, colour =
                                           "black", fontface = "italic", min.size = 0) +
        geom_treemap_text(colour = "black", place = "center", reflow = T) +
        scale_fill_distiller(type = "div", palette=5, direction=1, guide="coloursteps", limits=c(-2000000,2000000), breaks=seq(-2000000,2000000, by=500000),
        labels=c("2000000","1500000","1000000","500000","0","500000","1000000","1500000","2000000")) +
        labs(title = "US 2016 Presidential Election by County (Areas Proportional to Total Votes in County",
             fill="Difference\u2013County Vote Totals between Trump (red) & Clinton (blue)") + 
        theme(legend.key.height = unit(0.75, 'cm'),
              legend.key.width = unit(2.35,"cm"),
              legend.text = element_text(size=8),
              plot.title = element_text(hjust = 0.5, size=14, vjust=1),
              legend.position = "bottom") +
        guides(fill = guide_coloursteps(title.position="top", title.hjust = 0.5),
               size = guide_legend(title.position="top", title.hjust = 0.5))    

* The electoral process that determines who becomes president of the United States is complicated. In effect, it is a series of elections that are run by individual states, and not a single federally-run election like it is in most presidential systems.

Data Visualization #4–Bar plots with widely-dispersed data

A common issue when trying to plot numerical data is the problem of outliers. When working with data the term outliers is often used in the statistical sense, referring to data certain data values that are “far way” from the rest of the data (in statistics, this usually means data values that are a number of standard deviations away from the rest of the data). This can be especially problematic when using common bar plots, especially when the minimum and maximum values are so far apart that it leads to difficulty representing all of the values visually.

For an example of this in real life, let’s have go back to our British Columbia provincial electoral map data. As I demonstrated in my first data visualization, area-based (rather than population-, or voter-based) maps are often misleading. The primary reason for this is that the electoral districts are not nearly the same size and don’t have the same numbers of residents. In British Columbia, a large province, (almost one million square kilometres in area) this is not a surprise, especially because of the manner in which the relatively small population (just over five million) is haphazardly-dispersed across the province.

We can easily calculate the population density of each of BC’s 87 provincial electoral districts, using data about district population size and calculating the area of each district from geographic we used to create the maps in the first data visualization post.

Here is a summary of the data (the variable is Pop.Den.km2):

(s1<-summary(bc_final_final$Pop.Den.km2))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.101     9.402   355.269  1587.483  2375.926 12616.797 

The “Min.” and “Max.” are the minimum, and maximum value, respectively, of the population density (persons per square kilometre) of BC’s 87 provincial electoral districts. We see a dramatic difference between the maximum and minimum values. In fact,

paste("The most densely-populated district is ", round(s1[6]/s1[1],0), "times as dense as the least densely-populated district.")

[1] "The most densely-populated district is 124551 times as dense as the least densely-populated district."

That is astounding, and if one were to simply plot these values on a bar chart, one would immediately recognize the difficulty with representing these data accurately. Let’s use a horizontal bar chart to demonstrate:

Here, we see that the larger numbers and so large, and the smaller numbers so comparatively small, that the lowest two dozen, or so, districts do not even seem to register. (When I first plotted this, I thought that I had made some sort of mistake and that the values at the bottom were missing. It turns out that the value represented by a single pixel was larger than the values of the districts at the bottom of the bar plot.)


This is obviously an issue–we don’t want to lose valuable information. There are alternative plots we could use, but we want to keep the information (political party) embodied in the various colours of the bar plot, so we’d like to find a bar plot solution. We’ll describe and assess two potential solutions in the next post in the series.

Using R to help simulate the NHL Draft Lottery

Upon discussing the NHL game results file, I mentioned to a few of you that I have used R to generate an NHL draft lottery simulator. It’s quite simple, although you do have to install the XML package, which allows us to use R to ‘scrape’ websites. We use this functionality in order to create the lottery simulator dynamically, depending on the previous evening’s (afternoon’s) game results.

Here’s the code: (remember to un-comment the install.packages(“XML”) command the first time you run the simulator). Copy and paste this code into your R console, or save it as an R script file and run it as source.

# R code to simulate the NHL Draft Lottery
# The current draft order of teams obviously changes on a
# game-to-game basis. We have to create a vector of teams in order
# from 31st to 17th place that can be updated on a game-by-game
# (or dynamic) basis.

# To do this, we can use R's ability to interrogate, scrape,
# and parse web pages.

#install.packages("XML") # NOTE: Uncomment and install this
#                                package before running this
#                                script the first time.

require(XML) # We need this for parsing of the html code

url <- ("http://nhllotterysimulator.com/") #retrieve the web page we are using as the data source
doc <- htmlParse(url) #parse the page to extract info we'll need.

# From investigation of the web page's source code, we see that the
# team names can be found in the element [td class="text-left"]
# and the odds of each team winning the lottery are in the
# element [td class="text-right"]. Without this
# information, we wouldn't know where to tell R to find the elements
# of data that we'd like to extract from the web page.
# Now we can use xml to extract the data values we need.

result.teams <- unlist(xpathApply(doc, "//td[contains(@class,'text-left')]",xmlValue)) #unlist used to create vector
result.odds <- unlist(xpathApply(doc, "//td[contains(@class,'text-right')]",xmlValue))

# The teams elements are returned as strings (character), which is
# appropriate. Also only non-playoff teams are included, which makes
# it easier for us. The odds elements are returned as strings as
# well (and percentages), which is problematic.
# First, we have 31 elements (the values of 16 of which--the playoff
# teams --are returned as missing). We only want 15 (the non-playoff
# teams).
# Second, in these remaining # 15 elements we have to remove the
# "%" character from each.
# Third, we have to convert the character format to numeric.
# The code below does the clean-up. 

result.odds <- result.odds[1:15]
result.odds <- as.numeric(gsub("%"," ",result.odds)) #remove % symbol
teamodds.df <- data.frame("teams"=result.teams[1:15],"odds"=result.odds, stringsAsFactors=FALSE) #Create data frame for easier display 

# Let's print a nice table of the teams, with up-to-date
# corresponding odds. 

print(teamodds.df) # odds are out of 100 

#Now, let's finally 'run' the lottery, and print the winner's name.

cat("The winner of the 2018 NHL Draft Lottery is the:", sample(teamodds.df$team,1,prob=teamodds.df$odds),sep="") 

 

Domestic Emissions Targets for Greenhouse Gases and China

This week, we begin to address the politics of climate change. In the chapter from the Stevenson text, the author addresses the rise of two international norms that are related to mitigating the impact of global warming: 1) common but differentiated responsibilities (CBRD) and, 2) mitigation in the form of domestic emissions’ targets.

Stevenson argues that international negotiations regarding mitigation have slowly transitioned from a focus on domestic to global emissions’ targets. Correspondingly, the institutional framework for implementing these goals has moved from regulatory (domestic governments) to market-oriented.  China and the United States have been the main promoters (and would also be the main beneficiaries of ) the market-oriented approach to GHG mitigation. We’ll discuss why during this week’s seminar, but in short, high level emitters can use carbon trading schemes to offload their emissions to low-emitting countries, resulting in no drop in emissions of GHGs globally.

In an interesting story on China’s setting up of a domestic carbon market, which is set to begin trading in 2016, we find something interesting. First, here’s a description of the proposed Chines carbon market:

China plans to roll out its national market for carbon permit trading in 2016, an official said Sunday, adding that the government is close to finalising rules for what will be the world’s biggest emissions trading scheme.

The world’s biggest-emitting nation, accounting for nearly 30 percent of global greenhouse gas emissions, plans to use the market to slow its rapid growth in climate-changing emissions.

What caught my eye, however, was the next line:

China has pledged to reduce the amount of carbon it emits per unit of GDP to 40-45 percent below 2005 levels by 2020.

In an informal (convenience sample) survey of some friends and acquaintances, it is obvious that the impression (almost unanimously shared) of the reader was that China would be cutting its GHG emissions dramatically by 2020. Unfortunately, that is not the case.

The key words in the excerpt quoted above are “per unit of GDP.” Because China’s GDP is expected to at least double by 2020 (based on the base year 2005), China could conceivably meet their target of a 40-45-per cent cut in emissions per unit of GDP even with as much as a doubling of actual (absolute) GHG emissions!