geom_bar() – Clouds, Clocks, and Sitting at Tables

A common issue when trying to plot numerical data is the problem of outliers. When working with data the term outliers is often used in the statistical sense, referring to data certain data values that are “far way” from the rest of the data (in statistics, this usually means data values that are a number of standard deviations away from the rest of the data). This can be especially problematic when using common bar plots, especially when the minimum and maximum values are so far apart that it leads to difficulty representing all of the values visually.

For an example of this in real life, let’s have go back to our British Columbia provincial electoral map data. As I demonstrated in my first data visualization, area-based (rather than population-, or voter-based) maps are often misleading. The primary reason for this is that the electoral districts are not nearly the same size and don’t have the same numbers of residents. In British Columbia, a large province, (almost one million square kilometres in area) this is not a surprise, especially because of the manner in which the relatively small population (just over five million) is haphazardly-dispersed across the province.

We can easily calculate the population density of each of BC’s 87 provincial electoral districts, using data about district population size and calculating the area of each district from geographic we used to create the maps in the first data visualization post.

Here is a summary of the data (the variable is Pop.Den.km2):

(s1<-summary(bc_final_final$Pop.Den.km2))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.101     9.402   355.269  1587.483  2375.926 12616.797

The “Min.” and “Max.” are the minimum, and maximum value, respectively, of the population density (persons per square kilometre) of BC’s 87 provincial electoral districts. We see a dramatic difference between the maximum and minimum values. In fact,

paste("The most densely-populated district is ", round(s1[6]/s1[1],0), "times as dense as the least densely-populated district.")

[1] "The most densely-populated district is 124551 times as dense as the least densely-populated district."

That is astounding, and if one were to simply plot these values on a bar chart, one would immediately recognize the difficulty with representing these data accurately. Let’s use a horizontal bar chart to demonstrate:

Here, we see that the larger numbers and so large, and the smaller numbers so comparatively small, that the lowest two dozen, or so, districts do not even seem to register. (When I first plotted this, I thought that I had made some sort of mistake and that the values at the bottom were missing. It turns out that the value represented by a single pixel was larger than the values of the districts at the bottom of the bar plot.)

This is obviously an issue–we don’t want to lose valuable information. There are alternative plots we could use, but we want to keep the information (political party) embodied in the various colours of the bar plot, so we’d like to find a bar plot solution. We’ll describe and assess two potential solutions in the next post in the series.

	rogercaiazza on Data Visualization # 12—Using…
	Kevin on Stephen Harper says voting is…
	braiden24 on ‘Game-changing’ ne…
	braiden24 on Canadian Minister Aglukkaq…
	bodabodame on US Midterm Election Results an…

Tag: geom_bar()

Data Visualization #4–Bar plots with widely-dispersed data