December 2020 – Clouds, Clocks, and Sitting at Tables

Data Visualization #7–Treemaps using US Counties and 2016 Presidential Vote

While we’re still waiting on the availability of official county-level results 2020 the 2020 US Presidential Elections*, I thought I’d create a treemap of the county-level results from the 2016 election. You may be thinking to yourself, “What is a treemap?”

Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The space in the visualization is split up into rectangles that are sized and ordered by a quantitative variable.
Link to Source

Treemaps, therefore, can help us visualize the relationships within our quantitative data in a unique, visually-pleasing, and meaningfully effective manner. Let’s see how with the example of the US 2016 Presidential Election.

Here’s a picture of then newly-elected President Donald Trump looking at a map given to him by his advisers depicting the results of the 2016 election. This specific depiction of the results overstates the extent of the support across the USA for Trump in the 2016 election. As those in the know often say “land mass does not vote.” Indeed, if one were ignorant about US politics, and US political demography, looking at that map one would be most likely be perplexed were one told that the “blue” candidate actually won 3 million more votes than did the “red” candidate.

Here is my reproduction of these data\2013using publicly-available data from MIT Election Data and Science Lab, 2018, “County Presidential Election Returns 2000-2016”, https://doi.org/10.7910/DVN/VOQCHQ, Harvard Dataverse, V6, UNF:6:ZZe1xuZ5H2l4NUiSRcRf8Q== [fileUNF]. I’ve added the R-code at the end of this post.

We can see that the vast majority of counties are small, and that voters in these counties were more likely to have voted for Trump than for Clinton. Indeed, Clinton win fewer than 16% of all counties.

The problem with this map is that it essentially dichotomizes quantitative data into qualitative data. To be precise, the decision whether to colour a county blue or red is made simply on the basis of whether, of those who voted, more voted for Trump, or for Clinton. If a county voted 51-50 for Trump, it gets a red colour. If a county voted 1,000,000-100,000 for Clinton it gets coloured blue. And, to make things even more confusing, the total of red that each county receives is related ONLY to country land area, and doesn’t take account of the number of voters.

As is the case in many parts of the world today, the US is increasingly split demographically\u2013with those living in rural areas (and suburbs/exurbs) voting for the conservative parties (Republican) and those in the urban areas voting for liberal parties (Democratic). We see this clearly in the map above. The problem with US counties is that they are not uniform either in terms of their land area, or their population. There are apartment buildings in New York City and Los Angeles that have more residents than some counties.

We can use treemaps to more “accurately” depict electoral outcomes. By accurately, I mean that the visual representation of the data more closely reflects how many voted for each candidate (party).

The first example below represents the vote at the county level and describes two quantitative variables. The size of each rectangle represents the total number of voters in each county\u2013the larger the rectangle the greater the numbers of voters in that county. The second variable, which is mapped using the colour scale, represents the difference\u2013in raw vote totals between the two candidates. Reddish shades denote a county that was won by Trump, while bluish shades represent counties won by Clinton.

There are a couple of things to notice. First, the wide disparity in the total number of voters across the counties. Second, we see that most of the counties have shades that are only very lightly blue (or red) and look mostly white. This is because the range on the variable must be so expansive in order to include outliers like Los Angeles and Cook Counties. Thus, in the vast majority of US counties the raw vote total differences between Trump’s totals and Clinton’s totals are in the 1000s range. This is why Trump was able to win more than 84% of US counties and still lose the popular vote by more than 3 million.

Our next (and final) treemap is similar to the one above except that the scale for the colouring is not the raw vote difference between Trump and Clinton in each county, but the percentage-point differential in vote between the two candidates.

We see much more red and blue in this map because the scale is confined to 100% Trump win to 100% Clinton win. Notice the striking disparity in where the blue and red colours, respectively, are found. The reddish shades dominate in small-population counties (in the top-right corner of each state subgroup), while the bluish shades dominate in large-population counties (in the bottom-left corners of each state subgroup). Finally, the larger (greater population) counties tend be be much smaller geographically than the less-populous counties, which is why the map on Trump’s desk looks like it does.

gg.geom.uscounty <- ggplot(us_df_final_2163) +
        geom_sf(aes(fill = winner), col="black", lwd=0.1) + 
        scale_fill_manual(values=c("blue","red"), labels=c("Clinton","Trump"), breaks=c("Democrat","Republican")) + # breaks...to get rid of NA
        labs(title = "US 2016 Presidential Election Results by County ('Lower 48')") +
        theme_void() + 
        coord_sf(xlim = c(-1900000,2400000), ylim = c(-2050000, 625000)) +
        theme(legend.title=element_blank(),
              legend.text = element_text(size = 12),
              plot.title = element_text(hjust = 0.5, size=16, vjust=2),
              legend.position = "bottom",
              plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"),
              legend.box.margin = margin(0,0,30,0),
              legend.key.size = unit(0.75, "cm"))

gg.geom.uscounty

R Code for treemaps: (this is vote the “total vote” variable. Replace that variable with a “percentage-vote” variable–with appropriate limits and breaks (-100,100) because you are now working with percentages).

gg.tree.tot <- ggplot(us_df_final, aes(area = totalvote, fill = vote_win_diff, label=NAME, subgroup=State.Name)) +
        geom_treemap() +
        geom_treemap_subgroup_border(colour="black", size=2) +
        geom_treemap_subgroup_text(place = "centre", grow=F, alpha = 0.5, colour =
                                           "black", fontface = "italic", min.size = 0) +
        geom_treemap_text(colour = "black", place = "center", reflow = T) +
        scale_fill_distiller(type = "div", palette=5, direction=1, guide="coloursteps", limits=c(-2000000,2000000), breaks=seq(-2000000,2000000, by=500000),
        labels=c("2000000","1500000","1000000","500000","0","500000","1000000","1500000","2000000")) +
        labs(title = "US 2016 Presidential Election by County (Areas Proportional to Total Votes in County",
             fill="Difference\u2013County Vote Totals between Trump (red) & Clinton (blue)") + 
        theme(legend.key.height = unit(0.75, 'cm'),
              legend.key.width = unit(2.35,"cm"),
              legend.text = element_text(size=8),
              plot.title = element_text(hjust = 0.5, size=14, vjust=1),
              legend.position = "bottom") +
        guides(fill = guide_coloursteps(title.position="top", title.hjust = 0.5),
               size = guide_legend(title.position="top", title.hjust = 0.5))

* The electoral process that determines who becomes president of the United States is complicated. In effect, it is a series of elections that are run by individual states, and not a single federally-run election like it is in most presidential systems.

Data Visualization #6–US Counties are [essentially] Meaningless in Presidential Elections

The inspiration (so to speak) for this latest instalment of my Data Visualization series is a meme that I have been seeing spread across social media in the wake of the recent US Presidential Election. The meme, in essence, notes that numer of US counties (there are over 3000) that were “won” by the incumbent, Donald J. Trump. Indeed, it seems as though the challenger, Joe Biden, ironically won the most votes of any US Presidential candidate in US history while simultaneously having “won” the lowest percentage of counties (about 17%) of any winner of the Presidency ever.

Why did I place “won” in quotation marks? Two reasons: first, I am assuming that the authors of this meme suggest that Trump “won” these counties by having won (at least) a plurality of the vote in each. Which, I suppose, is true. The more important reason that I put “won” in quotation marks above is because US counties are effectively meaningless when it comes to determining who wins the US Presidency. They are only important insofar as receiving more votes than one’s opponent in any individual county helps increase the odds of winning what is important–a plurality of the vote in any individual state (or in Congressional Districts in the cases of Nebraska and Maine). Counties have no official weight when determining electoral college votes, and it doesn’t matter how many counties a candidate wins, as long as they reach at least 270 electoral votes. Counties in the USA vary in population from fewer than 100 (Kalawao County in Hawaii) to over 10,000,000 (Los Angeles County in California). So, discussing who “won” more counties is essentially meaningless.

Here’s an example of how absurd referring to counties won becomes. The aforementioned Los Angeles County is a county that Joe Biden handily “won” in November, by a margin of 72.5% to 27.5% for Donald Trump. In short, Trump was walloped by Biden in LA County. Yet, when you compare Trump’s vote in LA County (about 1.15 million) to his total vote in all of the states (and DC) it might shock you to learn that Trump won more votes in LA County than he won in 25 individual states (and in DC). For example, Trump won more total votes in LA County (which, remember, he lost 72.5%-27.5%) than he won in the state of Oklahoma, where he won all 6 Electoral College votes. Moreover, Biden won more votes in LA County alone than Donald Trump won in each of all but three states–Florida, Texas, and Ohio. To be clear, for example, Biden won more total votes in LA County (which, alone, didn’t win him a single Electoral College vote) than Trump won in North Carolina (for which Trump won 15 Electoral College votes).

Here is a bar plot that I’ve created to visualize these data (click on the image to open a larger version). The yellow bar at the far-right represents the number of votes won by Biden in LA County (just over 3 million). The other yellow bar represents the votes won by Trump in LA Country (just over 1 million). Every other bar is the number of votes won by Trump in each of the states (and DC) listed below (Texas, Ohio, and Florida are missing because Trump won more votes in each of those states than Biden won in LA County). The red bars are states won by Trump, while the blue bars represent states won by Biden. Remember, each of the bars (except for the one on the far-right) represent the number of votes Donald Trump won in that state (and LA County).

Data Visualization #5–Canadian Residential Schools–plotting change in number and federal government

At the end of Data Visualization # 4 I promised to look at a couple of alternative solutions to the problem of outliers in our data. I’ll have to do so in my next data visualization (#6) because I’d like to take some time to chart some data that I have been interested in for a while and was made more topical by some comments unearthed a few days ago that were made by the leader of Canada’s federal Conservative Party, Erin O’Toole on the issue of the history of residential schools in Canada. These schools were created for the various peoples of the Canada First Nations’ and have a long and sordid history. If you are interested in learning more, here is the final report of Canada’s Truth and Reconciliation Commission.

I wanted to use a chart that is in the PDF version of that report as the basis for plotting the chart described above. Here is the original.

I was unable to find the raw data, so I had to do some work in R to extract the data from the line in the image. There are some great R packages (magick, and tidyverse) that can be used to help you with this task should the need arise. See here for an example.

Using the following code, I was able to reproduce fairly accurately the line i the graph above.

library(tidyverse)
library(magick)

im <- image_read("residential_schools_new.jpg")

## This saturates the pic to highlight the darkest lines
im_proc <- im %>% image_channel("saturation")


## This gets rid of things that are far enough away from black--play around with the %

im_proc2 <- im_proc %>% image_threshold("white", "80%")

## Finally, invert (negate) the image so that what we want to keep is white.

im_proc3 <- im_proc2 %>% image_negate()

## Now to extract the data.

dat <- image_data(im_proc3)[1,,] %>%
  as.data.frame() %>%
  mutate(Row = 1:nrow(.)) %>%
  select(Row, everything()) %>%
  mutate_all(as.character) %>%
  gather(key = Column, value = value, 2:ncol(.)) %>%
  mutate(Column = as.numeric(gsub("V", "", Column)),
         Row = as.numeric(Row),
         value = ifelse(value == "00", NA, 1)) %>%
  filter(!is.na(value))

# Eliminate duplicate rows.

dat <- subset(dat, !duplicated(Row))  # Get rid of duplicate rows

Here’s the initial result, using the ggplot2 package.

It’s a fairly accurate re-creation of the chart above, don’t you think? After some cleaning up of the data and adding data on Primer Ministerial terms during Canada’s history since 1867, we get the completed result (with R code below).

We can see that there was an initial period of Canada’s history during which the number of schools operating increased. This period stopped with the First World War. Then there was a period of relative stabilization thereafter (some increase, then decrease through the 1940s and early 1950s, and then there was about a 10-year increase that began with Liberal Prime Minister Louis St. Laurent, and continued under Conservative Prime Minister John Diefenbaker and Liberal Prime Minister Lester B. Pearson, during whose time in power the number of residential schools topped out. Upon the ascension to power of Liberal Prime Minister Pierre Elliot Trudeau, the number of residential schools began a drastic decline, which continued under subsequent Prime Ministers.

EDIT: After reading the initial report more closely, it looks like the end point of the original chart is meant to be 1998, not 1999, so I’ve recreated the chart with that updated piece of information. Nothing changed, although it seems like the peak in the number of schools operating at any point in time was in about 1964, not a couple of years later as it had seemed. Here’s an excerpt from the report, in a section heading entitled Expansion and Decline:

From the 1880s onwards, residential school enrolment climbed annually. According to federal government annual reports, the peak enrolment of 11,539 was reached in the 1956–57 school year.¹⁴⁴ (For trends, see Graph 1.) Most of the residential schools were located in the northern and western regions of the country. With the exception of Mount Elgin and the Mohawk Institute, the Ontario schools were all in northern or northwestern Ontario. The only school in the Maritimes did not open until 1930.¹⁴⁵ Roman Catholic and Anglican missionaries opened the first two schools in Québec in the early 1930s.¹⁴⁶ It was not until later in that decade that the federal government began funding these schools.¹⁴⁷

From the 1880s onwards, residential school enrolment climbed annually. According to federal government annual reports, the peak enrolment of 11,539 was reached in the 1956–57 school year.¹⁴⁴ (For trends, see Graph 1.) Most of the residential schools were located in the northern and western regions of the country. With the exception of Mount Elgin and the Mohawk Institute, the Ontario schools were all in northern or northwestern Ontario. The only school in the Maritimes did not open until 1930.¹⁴⁵ Roman Catholic and Anglican missionaries opened the first two schools in Québec in the early 1930s.¹⁴⁶ It was not until later in that decade that the federal government began funding these schools.¹⁴⁷
The number of schools began to decline in the 1940s. Between 1940 and 1950, for example, ten school buildings were destroyed by fire.¹⁴⁸ As Graph 2 illustrates, this decrease was reversed in the mid-1950s, when the federal department of Northern Affairs and National Resources dramatically expanded the school system in the Northwest Territories and northern Québec. Prior to that time, residential schooling in the North was largely restricted to the Yukon and the Mackenzie Valley in the Northwest Territories. Large residences were built in communities such as Inuvik, Yellowknife, Whitehorse, Churchill, and eventually Iqaluit (formerly Frobisher Bay). This expansion was undertaken despite reports that recommended against the establishment of residential schools, since they would not provide children with the skills necessary to live in the North, skills they otherwise would have acquired in their home communities.¹⁴⁹ The creation of the large hostels was accompanied by the opening of what were termed “small hostels” in the smaller and more remote communities of the eastern Arctic and the western Northwest Territories.
Honouring the Truth, Reconciling for the Future:
Summary of the Final Report of the Truth and Reconciliation Commission of Canada https://web-trc.ca/

A couple of final notes: one can easily see (visualize) from this chart the domination of Liberal Party rule during the 20th century. Second, how many of you knew that there had been a couple of coalition governments in the early 20th century?

Here is the R code for the final chart:

gg.res.schools <- ggplot(data=dat) + 
  labs(title = "Canadian Residential Schools \u2013 1867-1999",
       subtitle="(Number of Schools in Operation & Federal Party in Power)", 
       y = ("Number of Schools"), x = " ") +
  geom_line(aes(x=Row.Rescale, y=Column.Rescale), color='black', lwd=0.75)  +
  scale_y_continuous(expand = c(0,0), limits=c(0,100)) +
  scale_x_continuous(limits=c(1866,2000)) + 
  geom_rect(data=pm.df,
            mapping=aes(xmin=Date_Begin.1, xmax=Date_End.1, 
                        ymin=rep(0,25), ymax=rep(100,25), fill=Government)) +
              scale_fill_manual(values = alpha(c("blue", "red", "green", "yellow"), .6)) +
  theme_bw() +
  theme(legend.title=element_blank(),
        plot.title = element_text(hjust = 0.5, size=16),
        plot.subtitle = element_text(hjust= 0.5, size=13),
        axis.text.y = element_text(size = 8))

gg.res.final.plot <- gg.res.schools + geom_line(aes(x=Row.Rescale, y=Column.Rescale), color='black', lwd=0.75, data=dat)

Data Visualization #4–Bar plots with widely-dispersed data

A common issue when trying to plot numerical data is the problem of outliers. When working with data the term outliers is often used in the statistical sense, referring to data certain data values that are “far way” from the rest of the data (in statistics, this usually means data values that are a number of standard deviations away from the rest of the data). This can be especially problematic when using common bar plots, especially when the minimum and maximum values are so far apart that it leads to difficulty representing all of the values visually.

For an example of this in real life, let’s have go back to our British Columbia provincial electoral map data. As I demonstrated in my first data visualization, area-based (rather than population-, or voter-based) maps are often misleading. The primary reason for this is that the electoral districts are not nearly the same size and don’t have the same numbers of residents. In British Columbia, a large province, (almost one million square kilometres in area) this is not a surprise, especially because of the manner in which the relatively small population (just over five million) is haphazardly-dispersed across the province.

We can easily calculate the population density of each of BC’s 87 provincial electoral districts, using data about district population size and calculating the area of each district from geographic we used to create the maps in the first data visualization post.

Here is a summary of the data (the variable is Pop.Den.km2):

(s1<-summary(bc_final_final$Pop.Den.km2))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.101     9.402   355.269  1587.483  2375.926 12616.797

The “Min.” and “Max.” are the minimum, and maximum value, respectively, of the population density (persons per square kilometre) of BC’s 87 provincial electoral districts. We see a dramatic difference between the maximum and minimum values. In fact,

paste("The most densely-populated district is ", round(s1[6]/s1[1],0), "times as dense as the least densely-populated district.")

[1] "The most densely-populated district is 124551 times as dense as the least densely-populated district."

That is astounding, and if one were to simply plot these values on a bar chart, one would immediately recognize the difficulty with representing these data accurately. Let’s use a horizontal bar chart to demonstrate:

Here, we see that the larger numbers and so large, and the smaller numbers so comparatively small, that the lowest two dozen, or so, districts do not even seem to register. (When I first plotted this, I thought that I had made some sort of mistake and that the values at the bottom were missing. It turns out that the value represented by a single pixel was larger than the values of the districts at the bottom of the bar plot.)

This is obviously an issue–we don’t want to lose valuable information. There are alternative plots we could use, but we want to keep the information (political party) embodied in the various colours of the bar plot, so we’d like to find a bar plot solution. We’ll describe and assess two potential solutions in the next post in the series.

Data Visualization #3–Cartograms as an alternative to standard area-based electoral maps

In my first post of this series I explained at length why basic geographically-based electoral maps are not very good at conveying the phenomena of interest (see that post for more detail), and alluded to the increased use of political geographers, and political scientists, of alternative methods of “mapping” the required information that were more clear about the message(s) contained in the data.

Let’s examine this further using the map above. This map shows the results of the Canadian federal (national) election of October 2019.The respective proportions of total area “won” by each political party as depicted in the map above are not easily translated into either the relative vote share of the parties, or the relative number of seats won. Someone ignorant about Canadian federal politics would see a relatively similar total amount of red, blue, and orange, and assume that these parties had relatively equal support across the country. The sizes (land mass), and populations of, federal electoral districts in Canada vary drastically and, as a result, these maps are not a good gauge of voter support for political parties.

Since this problem is widespread political scientists, and political geographers, have attempted to find solutions to this problem. One increasingly-common approach has been to use what are called cartograms. Cartograms are maps in which the elements (in this case, electoral districts) are usually transformed in such as way as to maintain their connections to neighbours (contiguous cartograms), but to either increase or decrease the area of the specific electoral district in order to match it to a common variable. A variable often used in the transformation of electoral maps is population size. Thus, in a completed cartogram, the size of the electoral districts is not the actual land mass of the electoral district, but is proportional to the population of the electoral district (sometimes the number of voters, or the size of the electorate is used instead of population). It’s no surprise, then, that cartograms are also called “value-by-area” maps.

Cartograms are used by geographers and social scientists to depict a wide variety of phenomena. Here are some examples. The first one is a global cartogram for which the size of the area in each country is equivalent to total public health spending by that country. We can easily see that most of the world’s spending on public health occurs in the rich countries of the global north.

Here’s one more, depicting the global share of organic agriculture, by country.

Below, I have created a cartogram that has transformed the standard electoral map of the 2019 Canadian federal election into one in which the size of the electoral districts is mostly proportional to their populations. By “mostly” I mean that they’re not perfectly proportional, since the difference in sizes between the largest and smallest districts is so large the algorithm eventually stabilizes without creating completely equal-sized electoral districts.

This map more accurately conveys the nature of political partisan support (at least as it relates to the winning of electoral districts) across the country during the 2019 election, and provides visual evidence for the reality of an election in which the Liberal Party (red) won a plurality of the seats in the federal parliament (House of Commons). Because urban districts are much smaller than rural districts, the strength of Liberal Party support in Canada’s two largest cities–Toronto and Montreal–is obfuscated by the traditional area-based electoral map, but becomes evident in this cartogram.

The next map in this series will analyze another approach to geographically-based electoral maps–the hexagon map.

Here’s the R code for the cartogram above. Here, the original R-spatial data object–can_sf–is the base for the calculation of the cartogram data.

## Here is the code to generate the cartogram object:

library(cartogram)
can_carto_sf = cartogram_cont(can_sf, "Population_2016", itermax=50)

## Now, the map, using ggplot2
library(ggplot2)

gg.can.can.carto <- ggplot(data = can_carto_sf) +
  geom_sf(aes(fill = partywinner_2019), col="black", lwd=0.075) + 
  scale_fill_manual(values=c("#33B2CC","#1A4782","#3D9B35","#D71920","#F37021","#2B2D2F"),name ="Party (2019)") +
  labs(title = "Cartogram of Canadian Federal Election Results \u2013 October 2019",
       subtitle = "(by Political Party and Electoral District)") +
  theme_void() + 
  theme(legend.title=element_blank(),
        legend.text = element_text(size = 16),
        plot.title = element_text(hjust = 0.5, size=20, vjust=2, face="bold"),
        plot.subtitle = element_text(hjust=0.5, size=18, vjust=2, face="bold"),
        legend.position = "bottom",
        plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"),
        legend.box.margin = margin(0,0,30,0),
        legend.key.size = unit(0.75, "cm"),
        panel.border = element_rect(colour = "black", fill=NA, size=1.5))

	rogercaiazza on Data Visualization # 12—Using…
	Kevin on Stephen Harper says voting is…
	braiden24 on ‘Game-changing’ ne…
	braiden24 on Canadian Minister Aglukkaq…
	bodabodame on US Midterm Election Results an…