Chapter 3 Describing Data using plots

3.1 Seminar

In the last seminar, we looked at getting to know our data with simple descriptors of central tendency and dispersion. We then started to see how plots, through visualisation, can take data description further. This week, we continue talking about plots and get to grips with how to create them in R using ggplot().

3.1.1 Loading Dataset in CSV Format

In this seminar, we load a file in comma separated format (.csv). The load() function from last week works only for the native R file format. To load our csv-file, we use the read.csv() function.

Our dataset is the same one we used in the first week, containing data about London Boroughs:

#Load a dataset in a csv file
pop <- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/census-historic-population-borough.csv")

Go ahead and (1) check the dimensions of pop, (2) the names of the variables of the dataset, (3) print the first six rows of the dataset.

# the dimensions: rows (observations) and columns (variables) 
dim(pop)

[1] 33 24

# the variable names
names(pop)

 [1] "Area.Code"    "Area.Name"    "Persons.1801" "Persons.1811" "Persons.1821"
 [6] "Persons.1831" "Persons.1841" "Persons.1851" "Persons.1861" "Persons.1871"
[11] "Persons.1881" "Persons.1891" "Persons.1901" "Persons.1911" "Persons.1921"
[16] "Persons.1931" "Persons.1939" "Persons.1951" "Persons.1961" "Persons.1971"
[21] "Persons.1981" "Persons.1991" "Persons.2001" "Persons.2011"

# top 6 rows of the data
head(pop)

  Area.Code            Area.Name Persons.1801 Persons.1811 Persons.1821
1      00AA       City of London       129000       121000       125000
2      00AB Barking and Dagenham         3000         4000         5000
3      00AC               Barnet         8000         9000        11000
4      00AD               Bexley         5000         6000         7000
5      00AE                Brent         2000         2000         3000
6      00AF              Bromley         8000         9000        11000
  Persons.1831 Persons.1841 Persons.1851 Persons.1861 Persons.1871 Persons.1881
1       123000       124000       128000       112000        75000        51000
2         6000         7000         8000         8000        10000        13000
3        13000        14000        15000        20000        29000        41000
4         9000        11000        12000        15000        22000        29000
5         3000         5000         5000         6000        19000        31000
6        12000        14000        16000        22000        42000        62000
  Persons.1891 Persons.1901 Persons.1911 Persons.1921 Persons.1931 Persons.1939
1        38000        27000        20000        14000        11000         9000
2        19000        27000        39000        44000       138000       184000
3        58000        76000       118000       147000       231000       296000
4        37000        54000        60000        76000        95000       179000
5        65000       120000       166000       184000       251000       310000
6        83000       100000       116000       127000       165000       237000
  Persons.1951 Persons.1961 Persons.1971 Persons.1981 Persons.1991 Persons.2001
1         5000         4767         4000         5864         4230         7181
2       189000       177092       161000       149786       140728       163944
3       320000       318373       307000       293436       284106       314565
4       205000       209893       217000       215233       211404       218301
5       311000       295893       281000       253275       227903       263466
6       268000       294440       305000       296539       282920       295535
  Persons.2011
1         7375
2       185911
3       356386
4       231997
5       311215
6       309392

3.1.2 Plotting data with R

Tools to create high quality plots have become one of R’s greatest assets. This is a relatively recent development since the software has traditionally been focused on the statistics rather than visualisation. The standard installation of R has base graphic functionality built in to produce very simple plots. For example we can plot the relationship between the London population in 1811 and 1911

# left of the comma is the x-axis, right is the y-axis. Also note how we  are using the $ command to select the columns of the data frame we want.

plot(pop$Persons.1811,pop$Persons.1911)

You should see a very simple scatter graph.

3.1.3 ggplot2

A different method of creating plots in R requires the ggplot2 package, from the tidyverse. There are many hundreds of packages in R each designed for a specific purpose. These are not installed automatically, so each one has to be downloaded and then we need to tell R to use it.

We need to use ggplot2, but we will also want to use some functions from other parts of the tidyverse (such as the filter() function you have used before). So, we need to make sure tidyverse is installed:

#When you hit enter R will ask you to select a mirror to download the package contents from. It doesn't really matter which one you choose, I tend to pick the UK based ones.

install.packages("tidyverse")

The install.packages step only needs to be performed once. You don’t need to install a package every time you want to use it. However, each time you open R and wish to use a package you need to use the library() command to tell R that it will be required.

library("tidyverse")

ggplot2 is an implementation of the ‘Grammar of Graphics’ (Wilkinson 2005) - a general scheme for data visualisation that breaks up graphs into semantic components such as scales and layers. ggplot2 can serve as a replacement for the base graphics in R and contains a number of default options that match good visualisation practice. This is an increasingly popular way to visualise data in R, because it is both more flexible and more powerful than the base plot approach.

While the instructions below take you through the approach step-by-step, you are encouraged to deviate from them (trying different colours for example) to get a better understanding of what we are doing. For further help, ggplot2 is one of the best documented packages in R and has an extensive website. Good examples of graphs can also be found on the R Cookbook website. We’ll name our graphs using ‘gg’ plus some word indicating their content, but remember that these names are arbitrary.

gg_pops <- ggplot(data = pop, mapping = aes(Persons.1811, Persons.1911))

What you have just done is set up a ggplot object where you say where you want the input data to come from (the data = argument -– in this case it is the pop object. The column headings within the aes() brackets refer to the parts of that data frame you wish to use (the variables Persons.1811 and Persons.1911), specified by the mapping = argument. aes is short for ‘aesthetics that vary’ – this is a complicated way of saying the data variables used in the plot. In practice, these arguments are used so frequently that it is quite rare to see data = and mapping = typed out like this. As so many people use ggplot, and essentially every ggplot starts with a first line like this, the code p <- ggplot(pop, aes(Persons.1811, Persons.1911)) is perfectly clear and legible to most R users.

If you just type gg_pops and hit enter, R will not plot any data, just axes. This is because you have not told ggplot what you want to do with the data. We do this by adding so-called ‘geoms’, in this case geom_point(), to create a scatter plot.

gg_pops + geom_point()

You can already see that this plot is looking a bit nicer than the one we created with the base plot() function used above. Within the geom_point() brackets you can alter the appearance of the points in the plot. Try something like gg_pops + geom_point(colour = "red", size=2) and also experiment with your own colours/sizes. If you want to colour the points according to another variable it is possible to do this by adding the desired variable into the aes() section after geom_point(). Here will indicate the size of the population in 2011 as well as the relationship between in the size of the population in 19811 and 1911.

gg_pops + geom_point(aes(colour = Persons.2011, size = 2))

You will notice that ggplot has also created a key that shows the values associated with each colour. In this slightly contrived example it is also possible to resize each of the points according to the Persons.2011 variable.

gg_pops + geom_point(aes(size = Persons.2011))

The real power of ggplot2 lies in its ability build a plot up as a series of layers. This is done by stringing plot functions (geoms) together with the + sign. In this case we can add a text layer to the plot using geom_text().

gg_pops + geom_point(aes(size = Persons.2011)) + geom_text(size = 2, 
   colour = "red", aes(label = Area.Name))

This idea of layers (or geoms) is quite different from the standard plot functions in R, but you will find that each of the functions does a lot of clever stuff to make plotting much easier (see the ggplot2 documentation for a full list). The above code adds London Borough labels to the plot over the points they correspond to. This isn’t perfect since many of the labels overlap but they serve as a useful illustration of the layers. To make things a little easier the plot can be saved as a PDF using the ggsave() command. When saving the plot can be enlarged to help make the labels more legible.

ggsave("first.ggplot.pdf", scale=2)

ggsave only works with plots that were created with ggplot. Within the brackets you should create a file name for the plot - this needs to include the file format (in this case .pdf you could also save the plot as a .jpg file). The file will be saved to your working directory (or, in Rstudio cloud, your project). The scale controls how many times bigger you want the exported plot to be than it currently is in the plot window. Once executed you should be able to see a PDF file in your working directory.

3.1.4 Histograms

For the rest of this tutorial we will change our dataset to one containing the number of assault incidents that ambulances have been called to in London between 2009 and 2011. It is in the same data format (CSV) as our London population file so we use the read.csv() command.

#read in the ambulance_assault datafile

assaults <- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/ambulance_assault.csv")

#Check that the data have been loaded in correctly by viewing the top 6 rows with the head() command.

head(assaults)

  Bor_Code     WardName WardCode assault_09_11
1     00AA   Aldersgate   00AAFA            10
2     00AA      Aldgate   00AAFB             0
3     00AA    Bassishaw   00AAFC             0
4     00AA Billingsgate   00AAFD             0
5     00AA  Bishopsgate   00AAFE           188
6     00AA Bread Street   00AAFF             0

#To get a sense of how large the data frame is, look at how many rows you have

nrow(assaults)

[1] 649

You will notice that the data table has 4 columns and 649 rows. The column headings are abbreviations of the following:

Bor_Code: Borough Code. London has 32 Boroughs (such as Camden, Islington, Westminster etc) plus the City of London at the centre. These codes are used as a quick way of referring to them from official data sources.
WardName: Boroughs can be broken into much smaller areas known as Wards. These are electoral districts and have existed in London for centuries.
WardCode: A statistical code for the Wards above. WardType: a classification that groups wards based on similar characteristics.
assault_09_11: The number of assault incidents requiring an ambulance between 2009 and 2011 for each Ward.

Through plotting we can provide graphical representations of the data to support the statistics above. A frequency distribution plot in the form of a histogram could be informative here. This can be done very easily in ggplot.

gg_assaults <- ggplot(assaults, aes(x = assault_09_11))

The ggplot(input, aes(x=assault_09_11)) section means create a generic plot object (called gg_assaults) from the assaults object using the assault_09_11 column as the data for the x axis. Remember the data variables are required as aesthetics parameters so the assault_09_11 appears in the aes() brackets.

Histograms provide a nice way of graphically summarising a dataset. To create the histogram you need to add the relevant ggplot2 command (geom).

gg_assaults + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The height of each bar (the x-axis) shows the count of the datapoints and the width of each bar is the value range of datapoints included. If you want the bars to be thinner (to represent a narrower range of values and capture some more of the variation in the distribution) you can adjust the binwidth. Binwidth controls the size of ‘bins’ that the data are split up into. We will discuss this in more detail later in the course, but put simply, the bigger the bin (larger binwidth) the more data it can hold. Try:

gg_assaults + geom_histogram(binwidth = 10)

You can also overlay a density distribution over the top of the histogram. This will be discussed in more detail later in the term, but think of the plotted line as a summary of the underlying histogram. For this we need to produce a second plot object that says we wish to use the density distribution as the y variable.

gg_assaults_dens <- ggplot(assaults, aes(x = assault_09_11, y = ..density..))

gg_assaults_dens + geom_histogram() + geom_density(fill = NA, colour = "red")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot has provided a good impression of the overall distribution, but it would be interesting to see characteristics of the data within each of the Boroughs. We can do this since each Borough in the input object is made up of multiple wards. To see what I mean, we can select all the wards that fall within the Borough of Camden, which has the code 00AG (if you want to see what each Borough the code corresponds to, and learn a little more about the statistical geography of England and Wales, then see here).

camden <- filter(assaults, Bor_Code == "00AG")

We are subsetting the input object, but instead of telling R what column names or numbers we require, we are requesting all rows in the Bor_Code column that contain 00AG. 00AG is a text string so it needs to go in speech marks “” and we need to use two equals signs == in R to mean “equals to”. A single equals sign = is another way of assigning objects (it works the same way as <- but is much less widely used for this purpose because it is also used when paramaterising functions/assigning arguments).

What we are doing here, then, is telling filter() the dataset we want to filter, and then the variable we want it to filter by, and the particular value of that variable we want it to keep. The camden object therefore only includes assaults that happened in Camden, indicated by its borough code.

So to produce Camden’s frequency distribution the code above needs to be replicated using the camden object in the place of assaults.

gg_camden <- ggplot(camden, aes(x = assault_09_11))

gg_camden + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#We can also add a title using the ggtitle() option

gg_camden + geom_histogram() + ggtitle("Camden Assaults")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see this looks a little different from the density of the entire dataset. This is largely becasue we have relatively few rows of data in the camden object (use nrow(camden) to find out just how many). Nevertheless it would be interesting to see the data distributions for each of the London Boroughs. It is a chance to use the facet_wrap() function in R. This brilliant function lets you create a whole load of graphs at once.

#note that we are back to using the p.ass ggplot object since we need all our data for this. This code may generate a large number of warning messages relating to the plot binwidth, don't worry about them. 

gg_assaults + geom_histogram() + facet_wrap(~ Bor_Code)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The facet_wrap() part of the code simply needs the name of the column you would like to use to subset the data into individual plots. Before the column name a tilde ~ is used as shorthand for “by” - so using the function we are asking R to facet the input object into lots of smaller plots based on the Bor_Code column.

3.1.5 Box and Whisker plots

In addition to histograms, a type of plot that shows the core characteristics of the distribution of values within a dataset, and includes some of the summary() information we generated earlier, is a box and whisker plot (boxplot for short). These too can be easily produced in R.

The diagram below illustrates the components of a box and whisker plot.

We can create a third plot object for this from the input object:

#note that the `assault_09_11` column is now y and not x and that we have specified x=1. This aligns the plot to the x-axis (any single number would work)

gg_box <- ggplot(assaults, aes(x = 1, y = assault_09_11))

And then convert it to a boxplot using the geom_boxplot() command.

gg_box + geom_boxplot()

If we are just interested in Camden then we can use the camden object created above in the code.

gg_camden_box <- ggplot(camden, aes(x = 1, y = assault_09_11))
gg_camden_box + geom_boxplot()

#If you prefer you can flip the plot 90 degrees so that it reads from left to right.

gg_camden_box + geom_boxplot() + coord_flip()

You can see that Camden looks a little different from the boxplot of the entire dataset. It would therefore be useful to compare the distributions of data within each of the Boroughs in a single plot as we did with the frequency distributions above. ggplot makes this very easy, we just need to change the x = parameter to the Borough code column (Bor_Code).

gg_compare <- ggplot(assaults, aes(x = Bor_Code, y = assault_09_11))

gg_compare + geom_boxplot() + coord_flip()

3.1.6 Exercises

Load the ambulance_assault dataset.
Use the facet_wrap() help file to learn how to create the plots with facets by Borough, this time with the graphs arranged into 4 columns.
Add a title to the grahps using the ggtitle() layer. If you need help, try ?ggtitle.
Save your graphs using the ggsave() function.
Now, using the census-historic-population-borough.csv dataset used to produce the scatter plots of London’s population, create boxplots for the years 1801 to 1851. You will have to subset your data for these years and create a new object that can be used to plot.
What do you observe? Describe your findings.