Data Visualization #4–Bar plots with widely-dispersed data

A common issue when trying to plot numerical data is the problem of outliers. When working with data the term outliers is often used in the statistical sense, referring to data certain data values that are “far way” from the rest of the data (in statistics, this usually means data values that are a number of standard deviations away from the rest of the data). This can be especially problematic when using common bar plots, especially when the minimum and maximum values are so far apart that it leads to difficulty representing all of the values visually.

For an example of this in real life, let’s have go back to our British Columbia provincial electoral map data. As I demonstrated in my first data visualization, area-based (rather than population-, or voter-based) maps are often misleading. The primary reason for this is that the electoral districts are not nearly the same size and don’t have the same numbers of residents. In British Columbia, a large province, (almost one million square kilometres in area) this is not a surprise, especially because of the manner in which the relatively small population (just over five million) is haphazardly-dispersed across the province.

We can easily calculate the population density of each of BC’s 87 provincial electoral districts, using data about district population size and calculating the area of each district from geographic we used to create the maps in the first data visualization post.

Here is a summary of the data (the variable is Pop.Den.km2):

(s1<-summary(bc_final_final$Pop.Den.km2))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.101     9.402   355.269  1587.483  2375.926 12616.797 

The “Min.” and “Max.” are the minimum, and maximum value, respectively, of the population density (persons per square kilometre) of BC’s 87 provincial electoral districts. We see a dramatic difference between the maximum and minimum values. In fact,

paste("The most densely-populated district is ", round(s1[6]/s1[1],0), "times as dense as the least densely-populated district.")

[1] "The most densely-populated district is 124551 times as dense as the least densely-populated district."

That is astounding, and if one were to simply plot these values on a bar chart, one would immediately recognize the difficulty with representing these data accurately. Let’s use a horizontal bar chart to demonstrate:

Here, we see that the larger numbers and so large, and the smaller numbers so comparatively small, that the lowest two dozen, or so, districts do not even seem to register. (When I first plotted this, I thought that I had made some sort of mistake and that the values at the bottom were missing. It turns out that the value represented by a single pixel was larger than the values of the districts at the bottom of the bar plot.)


This is obviously an issue–we don’t want to lose valuable information. There are alternative plots we could use, but we want to keep the information (political party) embodied in the various colours of the bar plot, so we’d like to find a bar plot solution. We’ll describe and assess two potential solutions in the next post in the series.

Data Visualization #2–Animations aid in Conveying Change

The first entry in my 30-day (it will actually be 30 posts over about 2 months) data visualization challenge argued that geographically-based electoral maps have many drawbacks as data visualization techniques. I demonstrated by using the results from the 2017 and 2020 British Columbia (BC) provincial elections as supporting evidence.

Although there were some significant political changes over the course of the two elections, these were poorly-represented by these maps. Only when we zoomed into the population centres of southwestern BC were we able to partially convey the changes that had occurred. We could have made our effort to convey the underlying movement in political party support between 2017 and 2020 a bit more obvious by using animated maps, rather than the static ones that were used.

When it comes to representing change over time, animated graphs can be very useful (as long as they aren’t too complicated and busy) and are advantageous to static maps.

Below we can find the maps in the original animated to more clearly show the changes over time. Here’s the map of the whole province:

The change between 2017 and 2020 is made clear by a jarring change in the map, where a bit more NDP-orange shows up, replacing the BCLP-blue (see the previous post for descriptions of the two parties). Otherwise, there doesn’t seem to be much change in the province overall.

We know, however, that the drastic changes that took place did so in the very tiniest southwestern corner of the BC mainland. Let’s zoom in there to have a look.

We can now more clearly see the change in results (in terms of electoral districts won) between 2017 and 2020 in this populous region. Not only did the NDP (orange) win many seats in the eastern Vancouver suburbs that had not only been won by the BCLP in 2017 but had been a bastion of support for the right-wing vote over many decades, but the NDP candidate in the Victoria-area district of Oak Bay-Gordon Head won a seat that had previously been held by the former leader of BC Green Party, Andrew Weaver (it’s the small piece of green, that changes to orange, in the eastern part of the lower orange horizontal band on the lower-left of the map) . Are these changes the harbinger of a sea-change in BC provincial politics, or are they just an anomalous blip?

Going back to my original point about these types of maps being poor representations of the underlying change in voters’ preferences, we don’t know much about the level of support for the respective parties in any of these electoral districts. All that we do know, based on the “first-past-the-post” electoral system used by BC at the provincial level, is the party whose candidate finished with the most votes in each of these electoral districts. We don’t know if a district newly-won by the NDP candidate was by one vote, or by 10,000 votes. In future posts, I’ll present graphs that will allow us to answer this question visually.

Our next posts will focus on alternatives to the basic electoral geographic maps that we’ve used in these first two posts.

Data Visualization #1–Electoral Results Map

The data visualization with which I begin my 30-day challenge is a standard electoral map of the recently-completed British Columbia provincial election, the result of which is a solid (57 of 87 seats) majority government for the New Democratic Party, led by Premier John Horgan.

It’s a bit ironic that I begin with this type of map since, for a few reasons, I consider them to be poor representations of data. First, because electoral districts are mapped on the basis of territory (geography) they misrepresent and distort what they are purportedly meant to gauge–electoral support (by actual voters, not acreage) for political parties.

Though there are other pitfalls with basic electoral maps I’ll highlight what I believe to be the second major issue with them. They take what is a multinomial concept–voter support for each of a number of political parties in a specific electoral district–and summarize them into a single data point–which of the many parties in that electoral district has “won” that district. Most of these maps provide no information about either a) the relative size of the winning party’s victory in that district, or b) how many other parties competed in that district and how well each of these parties did in that district.

Although the standard electoral map provides some basic electoral information about the electoral outcome (and it is undeniable that in terms of determining who wins and runs government, it is the single most important piece of information), they are “information-poor” and in future posts I’ll show how researchers have tried to make their electoral maps more information-rich.

But, first, here are some standard electoral maps for the last two provincial elections in British Columbia (BC)–May 2017 and October 2020. Like many jurisdictions in North America, BC is comprised of relatively densely-populated urban areas–the Lower Mainland and southern Vancouver Island–combined with sparsely-populated hinterlands–forests, mountains, and deserts. Moreover, there is a strong partisan split between these areas–with the conservative BC Liberal Party (BCLP–the story of why the provincial Liberal Party in BC is actually the home of BC’s conservatives is too long for this post) dominating in the hinterlands while the left-centre New Democratic Party (NDP) generally runs more strongly in the urban southeast of the province. In Canada, electoral districts are often referred to as “ridings”, or “constituencies.”

If one were completely ignorant about BC’s provincial politics one would assume, simply from a quick perusal of the map above, that the “blue” party–the BC Liberal Party–was the dominant party in BC. In addition, it would seem that there was very little change in partisan support and electoral outcomes across the electoral districts over the course of the two elections. In fact, the BCLP lost 15 districts, all of which were won by the NDP. (The Green Party lost one of the districts it had won to the NDP as well, for a total NDP gain of 16 districts (seats on the provincial legislature) between 2017 and 2020. This factual story of a substantial increase in NDP seats in the legislature is poorly conveyed by the maps above because the maps match partisanship to area and not to voters.

To repeat, in future posts I will demonstrate some methods researchers have used to mitigate the problem of area-based electoral maps, but for now I’ll show that once we zoom into the southwest corner of the province (where most of the population resides) a simple electoral map does do a better job of conveying the change in electoral fortunes of the BCLP and NDP over the last two elections This is because there is a stronger link between area and population (voters) in these districts than in BC as a whole.

You can more easily see the orange NDP wave overtaking the population centres of the Lower Mainland (greater Vancouver area–upper left part of each map) and, to a lesser extent, southern Vancouver Island. Data visualization #2 will demonstrate how to create animated maps of the above, which more appropriately convey the nature of the change in each of the electoral districts over the two elections.

Here’s the R code that I used to create the two images in my post, using the ggplot2 package.

## Once you have created a sf_object in R (which I have named bc_final_sf, the following commands will create the image above.
 
library(ggplot2)
library(patchwork)

## First plot--2017
gg.ed.1 <- ggplot(bc_final_sf) +
  geom_sf(aes(fill = Winner_2017), col="black", lwd=0.025) + 
  scale_fill_manual(values=c("#295AB1","#26B44F","#ED8200")) +
  labs(title = "May 2017") +
  theme_void() + 
  theme(legend.title=element_blank(),
        plot.title = element_text(hjust = 0.5, size=12, face="bold"),
        legend.position = "none")

## Second plot--2020
gg.ed.2 <- ggplot(bc_final_final) +
  geom_sf(aes(fill = Winner_2020), col="black", lwd=0.025) + 
  scale_fill_manual(values=c("#295AB1","#26B44F","#ED8200")) +
  labs(title = "October 2020") +
  theme_void() +
  theme(legend.title=element_blank(),
        plot.title = element_text(hjust = 0.5, size=12, face="bold"),
        legend.position = "bottom")

## Combine the plots and do some annotation
gg.bc.comb.map <- gg.ed.1 + gg.ed.2 & theme(legend.position = "bottom") 
gg.bc.comb.map.final <- gg.bc.comb.map + plot_layout(guides = "collect") + 
  plot_annotation(
  title = "British Columbia Election Results \u2013 by Riding",
  theme = theme(plot.title = element_text(size = 16, hjust=0.5, face="bold"))
  )

gg.bc.comb.map.final    # to view the first image above

## For the maps of the Lower Mainland and southern Vancouver Island, the only difference is that we add the following line to each of the individual maps:

coord_sf(xlim = c(1140000,1300000), ylim = c(350000, 500000))  

## so, we get 

gg.ed.lmsvi.1 <- ggplot(bc_final_final) +
  geom_sf(aes(fill = Winner_2017), col="black", lwd=0.075) + 
  coord_sf(xlim = c(1140000,1300000), ylim = c(350000, 500000)) + 
  scale_fill_manual(values=c("#295AB1","#26B44F","#ED8200")) +
  labs(title = "May 2017") +
  theme_void() + 
  theme(legend.title=element_blank(),
        plot.title = element_text(hjust = 0.5, size=10, vjust=3),  
        legend.position = "none")

‘Controlling’ for confounding variables graphically

As we’ve learned (ad nauseum) basing causal claims on a simple bivariate relationship is fraught with potential roadblocks. Even though there may be a strong, and statistically significant, relationship between an independent and dependent variable, if we haven’t controlled for potentially confounding variables, we can not state with any measure of confidence that the putative relationship between the IV and DV is causal. We should always statistically control for any (and all) potentially confounding variables.

Additionally, it is often desirable to dig deeper into the data and find out if the units-of-analysis are fundamentally different on the basis of some other variable. Below you may find two plots–each of which shows the relationship between margin of victory and electoral turnout (by electoral district) for the 2017 British Columbia provincial election. The first graph plots a simple bivariate relationship, while the second plot breaks that initial relationship down by political party (which party won the electoral district). It could conceivably be the case that the relationship between turnout and margin of victory varies across the values of political party. That is, the relationship may hold in those electoral districts where party A won, but not hold in those in which party B won.

We can see here that there is little evidence to suggest a difference in the relationship based on which party won the electoral district. Can you think of another `third’ variable that may cause the relationship between turnout and margin of victory to be systematically different across different values of that variable? What about rural-versus-urban electoral districts?

Here are the plots: