Data Visualization #3–Cartograms as an alternative to standard area-based electoral maps

In my first post of this series I explained at length why basic geographically-based electoral maps are not very good at conveying the phenomena of interest (see that post for more detail), and alluded to the increased use of political geographers, and political scientists, of alternative methods of “mapping” the required information that were more clear about the message(s) contained in the data.

Let’s examine this further using the map above. This map shows the results of the Canadian federal (national) election of October 2019.The respective proportions of total area “won” by each political party as depicted in the map above are not easily translated into either the relative vote share of the parties, or the relative number of seats won. Someone ignorant about Canadian federal politics would see a relatively similar total amount of red, blue, and orange, and assume that these parties had relatively equal support across the country. The sizes (land mass), and populations of, federal electoral districts in Canada vary drastically and, as a result, these maps are not a good gauge of voter support for political parties.

Since this problem is widespread political scientists, and political geographers, have attempted to find solutions to this problem. One increasingly-common approach has been to use what are called cartograms. Cartograms are maps in which the elements (in this case, electoral districts) are usually transformed in such as way as to maintain their connections to neighbours (contiguous cartograms), but to either increase or decrease the area of the specific electoral district in order to match it to a common variable. A variable often used in the transformation of electoral maps is population size. Thus, in a completed cartogram, the size of the electoral districts is not the actual land mass of the electoral district, but is proportional to the population of the electoral district (sometimes the number of voters, or the size of the electorate is used instead of population). It’s no surprise, then, that cartograms are also called “value-by-area” maps.

Cartograms are used by geographers and social scientists to depict a wide variety of phenomena. Here are some examples. The first one is a global cartogram for which the size of the area in each country is equivalent to total public health spending by that country. We can easily see that most of the world’s spending on public health occurs in the rich countries of the global north.

Here’s one more, depicting the global share of organic agriculture, by country.

Below, I have created a cartogram that has transformed the standard electoral map of the 2019 Canadian federal election into one in which the size of the electoral districts is mostly proportional to their populations. By “mostly” I mean that they’re not perfectly proportional, since the difference in sizes between the largest and smallest districts is so large the algorithm eventually stabilizes without creating completely equal-sized electoral districts.

This map more accurately conveys the nature of political partisan support (at least as it relates to the winning of electoral districts) across the country during the 2019 election, and provides visual evidence for the reality of an election in which the Liberal Party (red) won a plurality of the seats in the federal parliament (House of Commons). Because urban districts are much smaller than rural districts, the strength of Liberal Party support in Canada’s two largest cities–Toronto and Montreal–is obfuscated by the traditional area-based electoral map, but becomes evident in this cartogram.

The next map in this series will analyze another approach to geographically-based electoral maps–the hexagon map.

Here’s the R code for the cartogram above. Here, the original R-spatial data object–can_sf–is the base for the calculation of the cartogram data.

## Here is the code to generate the cartogram object:

can_carto_sf = cartogram_cont(can_sf, "Population_2016", itermax=50)

## Now, the map, using ggplot2

gg.can.can.carto <- ggplot(data = can_carto_sf) +
  geom_sf(aes(fill = partywinner_2019), col="black", lwd=0.075) + 
  scale_fill_manual(values=c("#33B2CC","#1A4782","#3D9B35","#D71920","#F37021","#2B2D2F"),name ="Party (2019)") +
  labs(title = "Cartogram of Canadian Federal Election Results \u2013 October 2019",
       subtitle = "(by Political Party and Electoral District)") +
  theme_void() + 
        legend.text = element_text(size = 16),
        plot.title = element_text(hjust = 0.5, size=20, vjust=2, face="bold"),
        plot.subtitle = element_text(hjust=0.5, size=18, vjust=2, face="bold"),
        legend.position = "bottom",
        plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"), = margin(0,0,30,0),
        legend.key.size = unit(0.75, "cm"),
        panel.border = element_rect(colour = "black", fill=NA, size=1.5))

Data Visualization #2–Animations aid in Conveying Change

The first entry in my 30-day (it will actually be 30 posts over about 2 months) data visualization challenge argued that geographically-based electoral maps have many drawbacks as data visualization techniques. I demonstrated by using the results from the 2017 and 2020 British Columbia (BC) provincial elections as supporting evidence.

Although there were some significant political changes over the course of the two elections, these were poorly-represented by these maps. Only when we zoomed into the population centres of southwestern BC were we able to partially convey the changes that had occurred. We could have made our effort to convey the underlying movement in political party support between 2017 and 2020 a bit more obvious by using animated maps, rather than the static ones that were used.

When it comes to representing change over time, animated graphs can be very useful (as long as they aren’t too complicated and busy) and are advantageous to static maps.

Below we can find the maps in the original animated to more clearly show the changes over time. Here’s the map of the whole province:

The change between 2017 and 2020 is made clear by a jarring change in the map, where a bit more NDP-orange shows up, replacing the BCLP-blue (see the previous post for descriptions of the two parties). Otherwise, there doesn’t seem to be much change in the province overall.

We know, however, that the drastic changes that took place did so in the very tiniest southwestern corner of the BC mainland. Let’s zoom in there to have a look.

We can now more clearly see the change in results (in terms of electoral districts won) between 2017 and 2020 in this populous region. Not only did the NDP (orange) win many seats in the eastern Vancouver suburbs that had not only been won by the BCLP in 2017 but had been a bastion of support for the right-wing vote over many decades, but the NDP candidate in the Victoria-area district of Oak Bay-Gordon Head won a seat that had previously been held by the former leader of BC Green Party, Andrew Weaver (it’s the small piece of green, that changes to orange, in the eastern part of the lower orange horizontal band on the lower-left of the map) . Are these changes the harbinger of a sea-change in BC provincial politics, or are they just an anomalous blip?

Going back to my original point about these types of maps being poor representations of the underlying change in voters’ preferences, we don’t know much about the level of support for the respective parties in any of these electoral districts. All that we do know, based on the “first-past-the-post” electoral system used by BC at the provincial level, is the party whose candidate finished with the most votes in each of these electoral districts. We don’t know if a district newly-won by the NDP candidate was by one vote, or by 10,000 votes. In future posts, I’ll present graphs that will allow us to answer this question visually.

Our next posts will focus on alternatives to the basic electoral geographic maps that we’ve used in these first two posts.

Data Visualization #1–Electoral Results Map

The data visualization with which I begin my 30-day challenge is a standard electoral map of the recently-completed British Columbia provincial election, the result of which is a solid (57 of 87 seats) majority government for the New Democratic Party, led by Premier John Horgan.

It’s a bit ironic that I begin with this type of map since, for a few reasons, I consider them to be poor representations of data. First, because electoral districts are mapped on the basis of territory (geography) they misrepresent and distort what they are purportedly meant to gauge–electoral support (by actual voters, not acreage) for political parties.

Though there are other pitfalls with basic electoral maps I’ll highlight what I believe to be the second major issue with them. They take what is a multinomial concept–voter support for each of a number of political parties in a specific electoral district–and summarize them into a single data point–which of the many parties in that electoral district has “won” that district. Most of these maps provide no information about either a) the relative size of the winning party’s victory in that district, or b) how many other parties competed in that district and how well each of these parties did in that district.

Although the standard electoral map provides some basic electoral information about the electoral outcome (and it is undeniable that in terms of determining who wins and runs government, it is the single most important piece of information), they are “information-poor” and in future posts I’ll show how researchers have tried to make their electoral maps more information-rich.

But, first, here are some standard electoral maps for the last two provincial elections in British Columbia (BC)–May 2017 and October 2020. Like many jurisdictions in North America, BC is comprised of relatively densely-populated urban areas–the Lower Mainland and southern Vancouver Island–combined with sparsely-populated hinterlands–forests, mountains, and deserts. Moreover, there is a strong partisan split between these areas–with the conservative BC Liberal Party (BCLP–the story of why the provincial Liberal Party in BC is actually the home of BC’s conservatives is too long for this post) dominating in the hinterlands while the left-centre New Democratic Party (NDP) generally runs more strongly in the urban southeast of the province. In Canada, electoral districts are often referred to as “ridings”, or “constituencies.”

If one were completely ignorant about BC’s provincial politics one would assume, simply from a quick perusal of the map above, that the “blue” party–the BC Liberal Party–was the dominant party in BC. In addition, it would seem that there was very little change in partisan support and electoral outcomes across the electoral districts over the course of the two elections. In fact, the BCLP lost 15 districts, all of which were won by the NDP. (The Green Party lost one of the districts it had won to the NDP as well, for a total NDP gain of 16 districts (seats on the provincial legislature) between 2017 and 2020. This factual story of a substantial increase in NDP seats in the legislature is poorly conveyed by the maps above because the maps match partisanship to area and not to voters.

To repeat, in future posts I will demonstrate some methods researchers have used to mitigate the problem of area-based electoral maps, but for now I’ll show that once we zoom into the southwest corner of the province (where most of the population resides) a simple electoral map does do a better job of conveying the change in electoral fortunes of the BCLP and NDP over the last two elections This is because there is a stronger link between area and population (voters) in these districts than in BC as a whole.

You can more easily see the orange NDP wave overtaking the population centres of the Lower Mainland (greater Vancouver area–upper left part of each map) and, to a lesser extent, southern Vancouver Island. Data visualization #2 will demonstrate how to create animated maps of the above, which more appropriately convey the nature of the change in each of the electoral districts over the two elections.

Here’s the R code that I used to create the two images in my post, using the ggplot2 package.

## Once you have created a sf_object in R (which I have named bc_final_sf, the following commands will create the image above.

## First plot--2017
gg.ed.1 <- ggplot(bc_final_sf) +
  geom_sf(aes(fill = Winner_2017), col="black", lwd=0.025) + 
  scale_fill_manual(values=c("#295AB1","#26B44F","#ED8200")) +
  labs(title = "May 2017") +
  theme_void() + 
        plot.title = element_text(hjust = 0.5, size=12, face="bold"),
        legend.position = "none")

## Second plot--2020
gg.ed.2 <- ggplot(bc_final_final) +
  geom_sf(aes(fill = Winner_2020), col="black", lwd=0.025) + 
  scale_fill_manual(values=c("#295AB1","#26B44F","#ED8200")) +
  labs(title = "October 2020") +
  theme_void() +
        plot.title = element_text(hjust = 0.5, size=12, face="bold"),
        legend.position = "bottom")

## Combine the plots and do some annotation <- gg.ed.1 + gg.ed.2 & theme(legend.position = "bottom") <- + plot_layout(guides = "collect") + 
  title = "British Columbia Election Results \u2013 by Riding",
  theme = theme(plot.title = element_text(size = 16, hjust=0.5, face="bold"))
  )    # to view the first image above

## For the maps of the Lower Mainland and southern Vancouver Island, the only difference is that we add the following line to each of the individual maps:

coord_sf(xlim = c(1140000,1300000), ylim = c(350000, 500000))  

## so, we get 

gg.ed.lmsvi.1 <- ggplot(bc_final_final) +
  geom_sf(aes(fill = Winner_2017), col="black", lwd=0.075) + 
  coord_sf(xlim = c(1140000,1300000), ylim = c(350000, 500000)) + 
  scale_fill_manual(values=c("#295AB1","#26B44F","#ED8200")) +
  labs(title = "May 2017") +
  theme_void() + 
        plot.title = element_text(hjust = 0.5, size=10, vjust=3),  
        legend.position = "none")

‘Controlling’ for confounding variables graphically

As we’ve learned (ad nauseum) basing causal claims on a simple bivariate relationship is fraught with potential roadblocks. Even though there may be a strong, and statistically significant, relationship between an independent and dependent variable, if we haven’t controlled for potentially confounding variables, we can not state with any measure of confidence that the putative relationship between the IV and DV is causal. We should always statistically control for any (and all) potentially confounding variables.

Additionally, it is often desirable to dig deeper into the data and find out if the units-of-analysis are fundamentally different on the basis of some other variable. Below you may find two plots–each of which shows the relationship between margin of victory and electoral turnout (by electoral district) for the 2017 British Columbia provincial election. The first graph plots a simple bivariate relationship, while the second plot breaks that initial relationship down by political party (which party won the electoral district). It could conceivably be the case that the relationship between turnout and margin of victory varies across the values of political party. That is, the relationship may hold in those electoral districts where party A won, but not hold in those in which party B won.

We can see here that there is little evidence to suggest a difference in the relationship based on which party won the electoral district. Can you think of another `third’ variable that may cause the relationship between turnout and margin of victory to be systematically different across different values of that variable? What about rural-versus-urban electoral districts?

Here are the plots:

Using R to help simulate the NHL Draft Lottery

Upon discussing the NHL game results file, I mentioned to a few of you that I have used R to generate an NHL draft lottery simulator. It’s quite simple, although you do have to install the XML package, which allows us to use R to ‘scrape’ websites. We use this functionality in order to create the lottery simulator dynamically, depending on the previous evening’s (afternoon’s) game results.

Here’s the code: (remember to un-comment the install.packages(“XML”) command the first time you run the simulator). Copy and paste this code into your R console, or save it as an R script file and run it as source.

# R code to simulate the NHL Draft Lottery
# The current draft order of teams obviously changes on a
# game-to-game basis. We have to create a vector of teams in order
# from 31st to 17th place that can be updated on a game-by-game
# (or dynamic) basis.

# To do this, we can use R's ability to interrogate, scrape,
# and parse web pages.

#install.packages("XML") # NOTE: Uncomment and install this
#                                package before running this
#                                script the first time.

require(XML) # We need this for parsing of the html code

url <- ("") #retrieve the web page we are using as the data source
doc <- htmlParse(url) #parse the page to extract info we'll need.

# From investigation of the web page's source code, we see that the
# team names can be found in the element [td class="text-left"]
# and the odds of each team winning the lottery are in the
# element [td class="text-right"]. Without this
# information, we wouldn't know where to tell R to find the elements
# of data that we'd like to extract from the web page.
# Now we can use xml to extract the data values we need.

result.teams <- unlist(xpathApply(doc, "//td[contains(@class,'text-left')]",xmlValue)) #unlist used to create vector
result.odds <- unlist(xpathApply(doc, "//td[contains(@class,'text-right')]",xmlValue))

# The teams elements are returned as strings (character), which is
# appropriate. Also only non-playoff teams are included, which makes
# it easier for us. The odds elements are returned as strings as
# well (and percentages), which is problematic.
# First, we have 31 elements (the values of 16 of which--the playoff
# teams --are returned as missing). We only want 15 (the non-playoff
# teams).
# Second, in these remaining # 15 elements we have to remove the
# "%" character from each.
# Third, we have to convert the character format to numeric.
# The code below does the clean-up. 

result.odds <- result.odds[1:15]
result.odds <- as.numeric(gsub("%"," ",result.odds)) #remove % symbol
teamodds.df <- data.frame("teams"=result.teams[1:15],"odds"=result.odds, stringsAsFactors=FALSE) #Create data frame for easier display 

# Let's print a nice table of the teams, with up-to-date
# corresponding odds. 

print(teamodds.df) # odds are out of 100 

#Now, let's finally 'run' the lottery, and print the winner's name.

cat("The winner of the 2018 NHL Draft Lottery is the:", sample(teamodds.df$team,1,prob=teamodds.df$odds),sep="") 


Research Results, R coding, and mistakes you can blame on your research assistant

I have just graded and returned the second lab assignment for my introductory research methods class in International Studies (IS240). The lab required the students to answer questions using the help of the R statistical program (which, you may not know, is every pirate’s favourite statistical program).

The final homework problem asked students to find a question in the World Values Survey (WVS) that tapped into homophobic sentiment and determine which of four countries under study–Canada, Egypt, Italy, Thailand–could be considered to be the most homophobic, based only on that single question.

More than a handful of you used the code below to try and determine how the respondents in each country answered question v38. First, here is a screenshot from the WVS codebook:

wvs_v38Students (rightfully, I think) argued that those who mentioned “Homosexuals” amongst the groups of people they would not want as neighbours can be considered to be more homophobic than those who didn’t mention homosexuals in their responses. (Of course, this may not be the case if there are different levels of social desirability bias across countries.) Moreover, students hypothesized that the higher the proportion of mentions of homosexuals, the more homophobic is that country.

But, when it came time to find these proportions some students made a mistake. Let’s assume that the student wanted to know the proportion of Canadian respondents who mentioned (and didn’t mention) homosexuals as persons they wouldn’t want to have as neighbours.

Here is the code they used (four.df is the data frame name, v38 is the variable in question, and country is the country variable):

prop.table(table(four.df$v38=="mentioned" | four.df$country=="canada"))

0.372808 0.627192

Thus, these students concluded that almost 63% of Canadian respondents mentioned homosexuals as persons they did not want to have as neighbours. That’s downright un-neighbourly of us allegedly tolerant Canadians, don’tcha think?. Indeed, when compared with the other two countries (Egyptians weren’t asked this question), Canadians come off as more homophobic than either the Italians or the Thais.

prop.table(table(four.df$v38=="mentioned" | four.df$country=="italy"))

0.6106025 0.3893975

prop.table(table(four.df$v38=="mentioned" | four.df$country=="thailand"))

0.5556995 0.4443005

So, is it true that Canadians are really more homophobic than either Italians or Thais? This may be a simple homework assignment but these kinds of mistakes do happen in the real academic world, and fame (and sometimes even fortune–yes, even in academia a precious few can make a relative fortune) is often the result as these seemingly unconventional findings often cause others to notice. There is an inherent publishing bias towards results that seem to run contrary to conventional wisdom (or bias). The finding that Canadians (widely seen as amongst the most tolerant of God’s children) are really quite homophobic (I mean, close to 2/3 of us allegedly don’t want homosexuals, or any LGBT persons, as neighbours) is radical and a researcher touting these findings would be able to locate a willing publisher in no time!

But, what is really going on here? Well, the problem is a single incorrect symbol that changes the findings dramatically. Let’s go back to the code:

prop.table(table(four.df$v38=="mentioned" | four.df$country=="canada"))

The culprit is the | (“or”) character. What these students are asking R to do is to search their data and find the proportion of all responses for which the respondent either mentioned that they wouldn’t want homosexuals as neighbours OR the respondent is from Canada. Oh, oh! They should have used the & symbol instead of the | symbol to get the proportion of Canadian who mentioned homosexuals in v38.

To understand visually what’s happening let’s take a look at the following venn diagram (see the attached video above for Ali G’s clever use of what he calls “zenn” diagrams to find the perfect target market for his “ice cream glove” idea; the code for how to create this diagram in R is at the end of this post). What we want is the intersection of the blue and red areas (the purple area). What the students’ coding has given us is the sum of (all of!) the blue and (all of!) the red areas.

To get the raw number of Canadians who answered “mentioned” to v38 we need the following code:

table(four.df$v38=="mentioned" & four.df$v2=="canada")

7457   304


But what if you then created a proportional table out of this? You still wouldn’t get the correct answer, which should be the proportion that the purple area on the venn diagram comprises of the total red area.

prop.table(table(four.df$v38=="mentioned" & four.df$v2=="canada"))

0.96082979 0.03917021

Just eyeballing the venn diagram we can be sure that the proportion of homophobic Canadians is larger than 3.9%. What we need is the proportion of Canadian respondents only(!) who mentioned homosexuals in v38. The code for that is:


mentioned not mentioned
0.1404806     0.8595194

So, only about 14% of Canadians can be considered to have given a homophobic response, not the 62% our students had calculated. What are the comparative results for Italy and Thailand, respectively?


mentioned not mentioned
0.235546      0.764454


mentioned not mentioned
0.3372781     0.6627219

The moral of the story: if you mistakenly find something in your data that runs against conventional wisdom and it gets published, but someone comes along after publication and demonstrates that you’ve made a mistake, just blame it on a poorly-paid research assistant’s coding mistake.

Here’s a way to do the above using what is called a for loop:

for (i in 1:length(four)) {
+ print(prop.table(table(four.df$v38[four.df$v2==four[i]])))
+ print(four[i])
+ }

mentioned not mentioned
0.1404806     0.8595194
[1] "canada"

mentioned not mentioned

[1] "egypt"

mentioned not mentioned
0.235546      0.764454
[1] "italy"

mentioned not mentioned
0.3372781     0.6627219
[1] "thailand"

Here’s the R code to draw the venn diagram above:



v1<-venneuler(c("Mentioned"=sum(four.df$v38=="mentioned",na.rm=T),"Canada"=sum(four.df$v2=="canada",na.rm=T),"Mentioned&Canada"=sum(four.df$v2=="canada" & four.df$v38=="mentioned",na.rm=T)))

plot(v1,main="Venn Diagram of Canada and v38 from WVS", sub="v38='I wouldn't want to live next to a homosexual'", col=c("blue","red"))