Chapter 3 Describing Data using plots
3.1 Seminar
In the last seminar, we looked at getting to know our data with simple descriptors of central tendency and dispersion. In the lecture videos, we introduced plots. We showed you that, through visualisation, you can take data description further. This week, we continue talking about plots and get to grips with how to create them in R
using ggplot()
.
3.1.1 Loading Dataset in CSV Format
In this seminar, we load a file in comma separated format (.csv
). The load()
function from last week works only for the native R
file format. To load our csv-file, we use the read.csv()
function.
Our dataset contains data about London Boroughs:
# load a dataset in a csv file
<- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/census-historic-population-borough.csv") pop
Go ahead and (1) check the dimensions of pop
, (2) the names of the variables of the dataset, (3) print the first six rows of the dataset.
# the dimensions: rows (observations) and columns (variables)
dim(pop)
[1] 33 24
# the variable names
names(pop)
[1] "Area.Code" "Area.Name" "Persons.1801" "Persons.1811" "Persons.1821"
[6] "Persons.1831" "Persons.1841" "Persons.1851" "Persons.1861" "Persons.1871"
[11] "Persons.1881" "Persons.1891" "Persons.1901" "Persons.1911" "Persons.1921"
[16] "Persons.1931" "Persons.1939" "Persons.1951" "Persons.1961" "Persons.1971"
[21] "Persons.1981" "Persons.1991" "Persons.2001" "Persons.2011"
# top 6 rows of the data
head(pop)
Area.Code Area.Name Persons.1801 Persons.1811 Persons.1821
1 00AA City of London 129000 121000 125000
2 00AB Barking and Dagenham 3000 4000 5000
3 00AC Barnet 8000 9000 11000
4 00AD Bexley 5000 6000 7000
5 00AE Brent 2000 2000 3000
6 00AF Bromley 8000 9000 11000
Persons.1831 Persons.1841 Persons.1851 Persons.1861 Persons.1871 Persons.1881
1 123000 124000 128000 112000 75000 51000
2 6000 7000 8000 8000 10000 13000
3 13000 14000 15000 20000 29000 41000
4 9000 11000 12000 15000 22000 29000
5 3000 5000 5000 6000 19000 31000
6 12000 14000 16000 22000 42000 62000
Persons.1891 Persons.1901 Persons.1911 Persons.1921 Persons.1931 Persons.1939
1 38000 27000 20000 14000 11000 9000
2 19000 27000 39000 44000 138000 184000
3 58000 76000 118000 147000 231000 296000
4 37000 54000 60000 76000 95000 179000
5 65000 120000 166000 184000 251000 310000
6 83000 100000 116000 127000 165000 237000
Persons.1951 Persons.1961 Persons.1971 Persons.1981 Persons.1991 Persons.2001
1 5000 4767 4000 5864 4230 7181
2 189000 177092 161000 149786 140728 163944
3 320000 318373 307000 293436 284106 314565
4 205000 209893 217000 215233 211404 218301
5 311000 295893 281000 253275 227903 263466
6 268000 294440 305000 296539 282920 295535
Persons.2011
1 7375
2 185911
3 356386
4 231997
5 311215
6 309392
3.1.2 Plotting data with R
Tools to create high quality plots have become one of R
’s greatest assets. The standard installation of R
has graphic functionality built in to produce very simple plots. For example we can plot the relationship between the London population in 1811 and 1911:
# left of the comma is the x-axis, right is the y-axis. Also note how we are using the $ command to select the columns of the data frame we want.
plot(pop$Persons.1811,pop$Persons.1911)
You should see a very simple scatter graph.
3.1.3 ggplot2
A different, more flexible method of creating plots in R
requires the ggplot2
package, from the tidyverse
(more on this next week). There are many hundreds of packages in R
each designed for a specific purpose. These are not installed automatically, so each one has to be downloaded and then we need to tell R
to use it. We have already installed ggplot2
for you, so you just need to call it into your working environment using the library()
command, which tells R
that it will be required.
library("ggplot2")
ggplot2
is an implementation of the ‘Grammar of Graphics’ (Wilkinson 2005) - a general scheme for data visualisation that breaks up graphs into semantic components such as scales and layers. ggplot2
can serve as a replacement for the base graphics in R
and contains a number of default options that match good visualisation practice. This is an increasingly popular way to visualise data in R, because it is both more flexible and more powerful than the base plot
approach.
While the instructions below take you through the approach step-by-step, you are encouraged to deviate from them (trying different colours for example) to get a better understanding of what we are doing. For further help, ggplot2
is one of the best documented packages in R
and has an extensive website. Good examples of graphs can also be found on the R Cookbook website. We’ll name our graphs using ‘gg’ plus some word indicating their content, but remember that these names are arbitrary.
<- ggplot(data = pop,
gg_pops mapping = aes(x = Persons.1811, y = Persons.1911))
What you have just done is set up a ggplot object in which you say where you want the input data to come from (the data =
argument) -– in this case it is the pop
object. The column headings within the aes()
brackets refer to the parts of that data frame you wish to use (the variables Persons.1811
on the x
axis and Persons.1911
on the y
axis), specified by the mapping =
argument. aes
is short for ‘aesthetics that vary’ – this is a complicated way of saying the data variables used in the plot. In practice, these arguments are used so frequently that it is quite rare to see data =
and mapping =
typed out like this. As so many people use ggplot
, and essentially every ggplot
starts with a first line like this, the code gg_pops <- ggplot(pop, aes(Persons.1811, Persons.1911))
is perfectly clear and legible to most R
users.
If you just type gg_pops
and hit enter, R
will not plot any data, just axes. This is because you have not told ggplot
what you want to do with the data. We do this by adding so-called ‘geoms’, in this case geom_point()
, to create a scatter plot.
+ geom_point() gg_pops
You can already see that this plot is looking a bit nicer than the one we created with the base plot()
function used above. Within the geom_point()
brackets you can alter the appearance of the points in the plot. Try something like gg_pops + geom_point(colour = "red", size=2)
and also experiment with your own colours/sizes. If you want to colour the points according to another variable it is possible to do this by adding the desired variable into the aes()
section after geom_point()
. Here will indicate the size of the population in 2011 as well as the relationship between in the size of the population in 1811 and 1911.
+ geom_point(aes(colour = Persons.2011)) gg_pops
You will notice that ggplot has also created a key that shows the values associated with each colour. In this slightly contrived example it is also possible to resize each of the points according to the Persons.2011
variable.
+ geom_point(aes(size = Persons.2011)) gg_pops
The real power of ggplot2 lies in its ability build a plot up as a series of layers. This is done by stringing plot functions (geoms) together with the + sign. In this case we can add a text layer to the plot using geom_text()
.
+
gg_pops geom_point(aes(size = Persons.2011)) +
geom_text(size = 2,
colour = "red",
aes(label = Area.Name))
This idea of layers (or geoms) is quite different from the standard plot functions in R
, but you will find that each of the functions does a lot of clever stuff to make plotting much easier (see the ggplot2
documentation for a full list). The above code adds London Borough labels to the plot over the points they correspond to. This isn’t perfect since many of the labels overlap but they serve as a useful illustration of the layers. To make things a little easier the plot can be saved as a PDF using the ggsave()
command. When saving the plot can be enlarged to help make the labels more legible.
ggsave("first.ggplot.pdf", scale = 2)
ggsave
only works with plots that were created with ggplot. Within the brackets you should create a file name for the plot - this needs to include the file format (in this case .pdf you could also save the plot as a .jpg file). The file will be saved to your working directory (or, in Rstudio cloud, your project). The scale controls how many times bigger you want the exported plot to be than it currently is in the plot window. Once executed you should be able to see a PDF file in your working directory.
3.1.4 Histograms
For the rest of this tutorial we will change our dataset to one containing all of Donald Trump’s tweets from the 2016 election up to 2018. It is in Rds
format so we load it using readRDS
.
# read in the trump datafile
<- readRDS(url("https://github.com/QMUL-SPIR/Public_files/blob/master/datasets/trump_twitter.rds?raw=true")) trump_twitter
#Check that the data have been loaded in correctly by viewing the top 6 rows with the head() command.
head(trump_twitter)
text
1 Happy Birthday @DonaldJTrumpJr!\nhttps://t.co/uRxyCD3hBz
2 Russians are playing @CNN and @NBCNews for such fools - funny to watch, they don't have a clue! @FoxNews totally gets it!
3 Join @AmerIcan32, founded by Hall of Fame legend @JimBrownNFL32 on 1/19/2017 in Washington, D.C.… https://t.co/9WJZ8iTCQV
4 Great move on delay (by V. Putin) - I always knew he was very smart!
5 My Administration will follow two simple rules: https://t.co/ZWk0j4H8Qy
6 'Economists say Trump delivered hope' https://t.co/SjGBgglIuQ
created_at retweet_count favorite_count
1 2016-12-31 13:58:12 9529 55601
2 2016-12-30 17:18:18 23213 84254
3 2016-12-30 14:46:55 7366 25336
4 2016-12-30 14:41:33 34415 97669
5 2016-12-29 09:54:21 11330 45609
6 2016-12-28 17:06:28 13919 51857
#To get a sense of how large the data frame is, look at how many rows you have
nrow(trump_twitter)
[1] 2470
You will notice that the data table has 4 columns and 2476 rows. The variables are the text content of the tweet, when it was posted, how many times it was retweeted, and how many times it was favourited.
We could describe this data as we did last week, by for example calculating the mean number of times Trump is retweeted, and the standard deviation of this estimate. We could also look at this distribution visually, in the form of a histogram. This can be done very easily in ggplot.
<- ggplot(trump_twitter, aes(x = retweet_count)) gg_retweets
The ggplot(trump, aes(x = retweet_count))
section means create a generic plot object (called gg_retweets
) from the trump
object using the retweet_count
column as the data for the x
axis. Remember the data variables are required as aesthetics parameters so the retweet_count
appears in the aes()
brackets.
To create the histogram you need to add the relevant geom:
+ geom_histogram() gg_retweets
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The height of each bar shows the count of the datapoints and the width of each bar is the value range of datapoints included. If you want the bars to be thinner (to represent a narrower range of values and capture some more of the variation in the distribution) you can adjust the binwidth. Binwidth controls the size of ‘bins’ that the data are split up into. We will discuss this in more detail later in the course, but put simply, the bigger the bin (larger binwidth) the more data it can hold. Try:
+ geom_histogram(binwidth = 100) gg_retweets
You can also overlay a density distribution over the top of the histogram. This will be discussed in more detail later in the term, but think of the plotted line as a summary of the underlying histogram. For this we need to produce a second plot object that says we wish to use the density distribution as the y variable.
<- ggplot(trump_twitter, aes(x = retweet_count, y = ..density..))
gg_trump_dens
+ geom_histogram() + geom_density(fill = NA, colour = "red") gg_trump_dens
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
3.1.5 Box and Whisker plots
In addition to histograms, a type of plot that shows the core characteristics of the distribution of values within a dataset is a box and whisker plot (boxplot for short). These too can be easily produced in R
.
The diagram below illustrates the components of a box and whisker plot.
We can create a third plot object for this from the input object:
# note that the `retweet_count` column is now y and not x and that we have specified x = 1. This aligns the plot to the x-axis (any single number would work).
<- ggplot(trump_twitter, aes(x = 1, y = retweet_count)) gg_box
And then convert it to a boxplot using the geom_boxplot() command.
+ geom_boxplot() gg_box
3.1.6 Facets
These plot examples provide good overall descriptions of the variable we’re interested in: how often Trump is retweeted. But sometimes we want to know if such a variable differs according to another variable. To see this, take the following simple example. The heights
object contains the reported heights of a group of students.
# read in the trump datafile
<- readRDS(url("https://github.com/QMUL-SPIR/Public_files/blob/master/datasets/heights.rds?raw=true")) heights
We can plot this as a histogram:
# histogram
<- ggplot(data = heights, aes(x = height)) +
gg_height geom_histogram()
gg_height
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We could also plot this as a boxplot:
# boxplot
<- ggplot(data = heights, aes(x = 1, y = height)) +
gg_height_box geom_boxplot()
gg_height_box
However, we also have data on the students’ reported sex. We know that, on average, men tend to be taller than women. So, we can look at their heights separately using facet_wrap
, which splits the plot into different ‘facets’:
# histogram facets
+ facet_wrap(~ sex) gg_height
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The variable after the ‘tilde’ (~
) is used to create the different ‘facets’ or panes. This reveals, indeed, that male students are more clustered around taller heights, while female students tend to be less tall on average. This also shows us that we seem to have far more male students in the dataset.
We can do the same thing with virtually any plot, as long as we have a variable to facet by. This also works for boxplots:
# boxplot facets
+ facet_wrap(~ sex) gg_height_box
But also, with boxplots, it might be clearer to compare within the same plot, by setting sex as the x axis variable:
# boxplot comparison
<- ggplot(data = heights, aes(x = sex, y = height)) +
gg_height_box geom_boxplot()
gg_height_box
3.1.7 Exercises
Note: some of these exercises might be slightly tricky this week. Just have a go at them, and make sure to look at the solutions when they are posted on Monday.
- Load this dataset about penguins:
# read in the penguins datafile
<- readRDS(url("https://github.com/QMUL-SPIR/Public_files/blob/master/datasets/penguins.rds?raw=true")) penguins
- Inspect the dataframe using
head()
,names()
, anddim()
. - Create a histogram of penguin flipper length.
- How does the distribution of flipper length vary by species? Take your histogram and add a
facet_wrap
layer to see this. - Add a title to the faceted plot using the
ggtitle()
layer. If you need help, try?ggtitle
. - Create a scatterplot to show the relationship between bill depth and bill length. How would you describe this relationship?
- How does this relationship change when we break it down into species? Rewrite your scatterplot code so that the colour of the points varies by species.
- Save this plot using
ggsave()
. - What do you observe? Describe your findings.