Chapter 1 Introduction to R and data manipulation

1.1 Seminar

In this seminar session, we introduce working with R. We illustrate some basic functionality and help you familiarise yourself with the look and feel of RStudio. Measures of central tendency and dispersion are easy to calculate in R. We focus on introducing the logic of R first and then describe how central tendency and dispersion are calculated in the end of the seminar.

1.1.1 Getting Started

Install R and RStudio on your computer by downloading them from the following sources:

1.1.2 RStudio

Let’s get acquainted with R. When you start RStudio for the first time, you’ll see three panes:

1.1.3 Console

The Console in RStudio is the simplest way to interact with R. You can type some code at the Console and when you press ENTER, R will run that code. Depending on what you type, you may see some output in the Console or if you make a mistake, you may get a warning or an error message.

Let’s familiarize ourselves with the console by using R as a simple calculator:

2 + 4
[1] 6

Now that we know how to use the + sign for addition, let’s try some other mathematical operations such as subtraction (-), multiplication (*), and division (/).

10 - 4
[1] 6
5 * 3
[1] 15
7 / 2
[1] 3.5
You can use the cursor or arrow keys on your keyboard to edit your code at the console:
- Use the UP and DOWN keys to re-run something without typing it again
- Use the LEFT and RIGHT keys to edit

Take a few minutes to play around at the console and try different things out. Don’t worry if you make a mistake, you can’t break anything easily!

1.1.4 Functions

Functions are a set of instructions that carry out a specific task. Functions often require some input and generate some output. For example, instead of using the + operator for addition, we can use the sum function to add two or more numbers.

sum(1, 4, 10)
[1] 15

In the example above, 1, 4, 10 are the inputs and 15 is the output. A function always requires the use of parenthesis or round brackets (). Inputs to the function are called arguments and go inside the brackets. The output of a function is displayed on the screen but we can also have the option of saving the result of the output. More on this later.

1.1.5 Getting Help

Another useful function in R is help which we can use to display online documentation. For example, if we wanted to know how to use the sum function, we could type help(sum) and look at the online documentation.

help(sum)

The question mark ? can also be used as a shortcut to access online help.

?sum

Use the toolbar button shown in the picture above to expand and display the help in a new window.

Help pages for functions in R follow a consistent layout generally include these sections:

Description A brief description of the function
Usage The complete syntax or grammar including all arguments (inputs)
Arguments Explanation of each argument
Details Any relevant details about the function and its arguments
Value The output value of the function
Examples Example of how to use the function

1.1.6 The Assignment Operator

Now we know how to provide inputs to a function using parenthesis or round brackets (), but what about the output of a function?

We use the assignment operator <- for creating or updating objects. If we wanted to save the result of adding sum(1, 4, 10), we would do the following:

myresult <- sum(1, 4, 10)

The line above creates a new object called myresult in our environment and saves the result of the sum(1, 4, 10) in it. To see what’s in myresult, just type it at the console:

myresult
[1] 15

Take a look at the Environment pane in RStudio and you’ll see myresult there.

To delete all objects from the environment, you can use the broom button as shown in the picture above.

We called our object myresult but we can call it anything as long as we follow a few simple rules. Object names can contain upper or lower case letters (A-Z, a-z), numbers (0-9), underscores (_) or a dot (.) but all object names must start with a letter. Choose names that are descriptive and easy to type.

Good Object Names Bad Object Names
result a
myresult x1
my.result this.name.is.just.too.long
my_result
data1

1.1.7 Sequences

We often need to create sequences when manipulating data. For instance, you might want to perform an operation on the first 10 rows of a dataset so we need a way to select the range we’re interested in.

There are two ways to create a sequence. Let’s try to create a sequence of numbers from 1 to 10 using the two methods:

  1. Using the colon : operator. If you’re familiar with spreadsheets then you might’ve already used : to select cells, for example A1:A20. In R, you can use the : to create a sequence in a similar fashion:
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
  1. Using the seq function we get the exact same result:
seq(from = 1, to = 10)
 [1]  1  2  3  4  5  6  7  8  9 10

The seq function has a number of options which control how the sequence is generated. For example to create a sequence from 0 to 100 in increments of 5, we can use the optional by argument. Notice how we wrote by = 5 as the third argument. It is a common practice to specify the name of argument when the argument is optional. The arguments from and to are not optional, se we can write seq(0, 100, by = 5) instead of seq(from = 0, to = 100, by = 5). Both, are valid ways of achieving the same outcome. You can code whichever way you like. We recommend to write code such that you make it easy for your future self and others to read and understand the code.

seq(from = 0, to = 100, by = 5)
 [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90
[20]  95 100

Another common use of the seq function is to create a sequence of a specific length. Here, we create a sequence from 0 to 100 with length 9, i.e., the result is a vector with 9 elements.

seq(from = 0, to = 100, length.out =  9)
[1]   0.0  12.5  25.0  37.5  50.0  62.5  75.0  87.5 100.0

Now it’s your turn:

  • Create a sequence of odd numbers between 0 and 100 and save it in an object called odd_numbers
odd_numbers <- seq(1, 100, 2)
  • Next, display odd_numbers on the console to verify that you did it correctly
odd_numbers
 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
[26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
  • What do the numbers in square brackets [ ] mean? Look at the number of values displayed in each line to find out the answer.

  • Use the length function to find out how many values are in the object odd_numbers.
    • HINT: Try help(length) and look at the examples section at the end of the help screen.
length(odd_numbers)
[1] 50

1.1.8 Scripts

The Console is great for simple tasks but if you’re working on a project you would mostly likely want to save your work in some sort of a document or a file. Scripts in R are just plain text files that contain R code. You can edit a script just like you would edit a file in any word processing or note-taking application.

Create a new script using the menu or the toolbar button as shown below.

Once you’ve created a script, it is generally a good idea to give it a meaningful name and save it immediately. For our first session save your script as seminar1.R

Familiarize yourself with the script window in RStudio, and especially the two buttons labeled Run and Source

There are a few different ways to run your code from a script.

One line at a time Place the cursor on the line you want to run and hit CTRL-ENTER or use the Run button
Multiple lines Select the lines you want to run and hit CTRL-ENTER or use the Run button
Entire script Use the Source button

1.1.9 Working with Data

In the previous section, R may have seemed fairly labour-intensive. We had to enter all our data manually and each line of code had to be written into the command line. Fortunately this isn’t routinely the case, as we can use scripts to keep track of our code. Type the following into the script:

#This is my first R script
My.Data<- data.frame(0:10, 20:30)
print(My.Data)
   X0.10 X20.30
1      0     20
2      1     21
3      2     22
4      3     23
5      4     24
6      5     25
7      6     26
8      7     27
9      8     28
10     9     29
11    10     30

In the script window, if you highlight all the code you have written and press the “Run” button on the top on the scripting window you will see that the code is sent to the command line and the text on the line after the # (known as a ‘comment’) is ignored. From now on, to run a command, type your code in the script window, ensure the cursor is on the correct line and press CTRL + ENTER or use the Run button. If you have an error, edit the line in the script and run the code again. The My.Data object is a data frame in need of some sensible column headings. You can add these by typing:

#Add column names
colnames(My.Data)<- c("X", "Y")

#Print My.Data object to check names were added successfully
print(My.Data)
    X  Y
1   0 20
2   1 21
3   2 22
4   3 23
5   4 24
6   5 25
7   6 26
8   7 27
9   8 28
10  9 29
11 10 30
#You can also print by just typing the object name
My.Data
    X  Y
1   0 20
2   1 21
3   2 22
4   3 23
5   4 24
6   5 25
7   6 26
8   7 27
9   8 28
10  9 29
11 10 30

Until now we have generated the data used in the examples above. One of R’s great strengths is its ability to load in data from almost any file format. Comma Separated Value (CSV) files are our preferred choice. These can be thought of as stripped down Excel spreadsheets. They are an extremely simple format so they are easily machine readable and can therefore be easily read in and written out of R. Since we are now reading and writing files it is good practice to tell R what your working directory is.

The working directory is the folder on the computer where you wish to store the data files you are working with. You can create a folder called “POL252” for example. If you are using RStudio, on the lower right of the screen is a window with a “Files” tab. If you click on this tab you can then navigate to the folder you wish to use. You can then click on the “More” button and then “Set as Working Directory”. You should then see some code similar to the below appear in the command line. It is also possible to type the code in manually.

#Set the working directory. The bit between the "" needs to specify the path to the folder you wish to use (you will see my file path below).  You may need to create the folder first.

setwd("~/POL252") # Note the single / (\\ will also work).

Note that, as we are mostly using RStudio Cloud on this course, there shouldn’t be any need to set a working directory. Instead, each week automatically has its own project directory which you will access when working on the relevant seminar. This is still worth knowing, though, as if you use R beyond this module then you will likely want the RStudio programme installed directly on your computer, in which case you’ll need to get to grips with working directories.

One way of opening data is by using the read.csv() function. The file can be saved in your computer (preferably in your working directory), in your RStudio Cloud project directory, or even online on a separate website. When we want to download a data file from a website, we need to add the url between quotation marks (“”). We are going to load a dataset that shows London’s historic population for each of its Boroughs.

#Load a dataset in a csv file
pop <- read.csv("https://data.london.gov.uk/download/historic-census-population/2c7867e5-3682-4fdd-8b9d-c63e289b92a6/census-historic-population-borough.csv")

To view the object type, we use:

class(pop)
[1] "data.frame"

We now know that the type is data frame, which will have implications for our future seminars. We can also look at the contents of the object. Data frames can be too large to show in the console, so we usually look at the beginning of the dataset - head() - or the end - tail().

head(pop)
  Area.Code            Area.Name Persons.1801 Persons.1811 Persons.1821
1      00AA       City of London       129000       121000       125000
2      00AB Barking and Dagenham         3000         4000         5000
3      00AC               Barnet         8000         9000        11000
4      00AD               Bexley         5000         6000         7000
5      00AE                Brent         2000         2000         3000
6      00AF              Bromley         8000         9000        11000
  Persons.1831 Persons.1841 Persons.1851 Persons.1861 Persons.1871 Persons.1881
1       123000       124000       128000       112000        75000        51000
2         6000         7000         8000         8000        10000        13000
3        13000        14000        15000        20000        29000        41000
4         9000        11000        12000        15000        22000        29000
5         3000         5000         5000         6000        19000        31000
6        12000        14000        16000        22000        42000        62000
  Persons.1891 Persons.1901 Persons.1911 Persons.1921 Persons.1931 Persons.1939
1        38000        27000        20000        14000        11000         9000
2        19000        27000        39000        44000       138000       184000
3        58000        76000       118000       147000       231000       296000
4        37000        54000        60000        76000        95000       179000
5        65000       120000       166000       184000       251000       310000
6        83000       100000       116000       127000       165000       237000
  Persons.1951 Persons.1961 Persons.1971 Persons.1981 Persons.1991 Persons.2001
1         5000         4767         4000         5864         4230         7181
2       189000       177092       161000       149786       140728       163944
3       320000       318373       307000       293436       284106       314565
4       205000       209893       217000       215233       211404       218301
5       311000       295893       281000       253275       227903       263466
6       268000       294440       305000       296539       282920       295535
  Persons.2011
1         7375
2       185911
3       356386
4       231997
5       311215
6       309392
tail(pop)
   Area.Code      Area.Name Persons.1801 Persons.1811 Persons.1821 Persons.1831
31      00BH Waltham Forest         8000         9000        11000        11000
32      00BJ     Wandsworth        13000        16000        19000        23000
33      00BK    Westminster       231000       254000       298000       354000
34      UKI1   Inner London       939000      1109000      1349000      1624000
35      UKI2   Outer London       162000       193000       224000       256000
36         H Greater London      1097000      1303000      1573000      1878000
   Persons.1841 Persons.1851 Persons.1861 Persons.1871 Persons.1881
31        12000        13000        17000        28000        54000
32        28000        36000        52000        99000       173000
33       402000       462000       512000       524000       513000
34      1904000      2308000      2745000      3244000      3906000
35       302000       342000       445000       602000       817000
36      2207000      2651000      3188000      3841000      4713000
   Persons.1891 Persons.1901 Persons.1911 Persons.1921 Persons.1931
31       112000       198000       258000       267000       283000
32       251000       319000       369000       380000       388000
33       481000       460000       421000       390000       372000
34      4432000      4898000      5002000      4978000      4898000
35      1142000      1609000      2160000      2408000      3213000
36      5572000      6510000      7162000      7387000      8110000
   Persons.1939 Persons.1951 Persons.1961 Persons.1971 Persons.1981
31       286000       275000       248591       235000       215947
32       358000       331000       335451       302000       254898
33       347000       300000       271703       240000       191098
34      4441000      3680000      3492881      3031000      2497978
35      4176000      4513000      4504213      4422000      4215187
36      8615000      8197000      7997094      7452000      6713165
   Persons.1991 Persons.2001 Persons.2011
31       203343       218335       258249
32       239162       260379       306995
33       177743       181284       219396
34      2343133      2766065      3231901
35      4050435      4405992      4942040
36      6393568      7172057      8173941

To get to know a bit more about the file you have loaded R has a number of useful functions. We can use these to find out how many columns (variables) and rows (cases) the data frame (dataset) contains.

#Get the number of columns

ncol(pop)
[1] 24
#Get the number of rows

nrow(pop)
[1] 36
#List the column headings

names(pop)
 [1] "Area.Code"    "Area.Name"    "Persons.1801" "Persons.1811" "Persons.1821"
 [6] "Persons.1831" "Persons.1841" "Persons.1851" "Persons.1861" "Persons.1871"
[11] "Persons.1881" "Persons.1891" "Persons.1901" "Persons.1911" "Persons.1921"
[16] "Persons.1931" "Persons.1939" "Persons.1951" "Persons.1961" "Persons.1971"
[21] "Persons.1981" "Persons.1991" "Persons.2001" "Persons.2011"

Given the number of columns in the pop data frame, subsetting by selecting on the columns of interest would make it easier to handle. In R there are two ways of doing this. The first uses the $ symbol to select columns by name and then create a new data frame object.

#Select the columns containing the Borough names and the 2011 population

pop.2011<- data.frame(pop$Area.Name, pop$Persons.2011)


 
head(pop.2011)
         pop.Area.Name pop.Persons.2011
1       City of London             7375
2 Barking and Dagenham           185911
3               Barnet           356386
4               Bexley           231997
5                Brent           311215
6              Bromley           309392

A second approach to selecting particular data is to use [Row, Column].

#Select the 1st row of the second column

pop[1,2]
[1] City of London
36 Levels: Barking and Dagenham Barnet Bexley Brent Bromley ... Westminster
#Select the first 5 rows of the second column

pop[1:5,1]
[1] 00AA 00AB 00AC 00AD 00AE
36 Levels: 00AA 00AB 00AC 00AD 00AE 00AF 00AG 00AH 00AJ 00AK 00AL 00AM ... UKI2
#Select the first 5 rows of columns 8 to 11

pop[1:5, 8:11]
  Persons.1851 Persons.1861 Persons.1871 Persons.1881
1       128000       112000        75000        51000
2         8000         8000        10000        13000
3        15000        20000        29000        41000
4        12000        15000        22000        29000
5         5000         6000        19000        31000
#Assign the previous selection to a new object

pop.subset <- pop[1:5, 8:11]

In the code snippet, note how the colon : is used to specify a range of values. We used the same technique to create the My.Data object above. The abilty to select particular columns means we can see how the population of London’s Boroughs have changed over the past century.

# Within the brackets you can add additional columns to the data frame so long as they are separated by commas

PopChange <- data.frame(pop$Area.Name, pop$Persons.2011-pop$Persons.1911)

If you type head(PopChange) you will see that the population change column (created to the right of the comma above) has a very long name. This can be changed using the names(), or colnamnes(), function

colnames(PopChange)<- c("Borough", "Change_1911_2011")

Since we have done some new analysis and created additional information it would be good to save the PopChange object to our working directory. This is done using the code below. Within the brackets we put the name of the R object we wish to save on the left of the comma and the file name on the right of the comma (this needs to be in inverted commas). Remember to put “.csv” after since this is the file format we are saving in.

write.csv(PopChange, "Population_Change_1911_2011.csv")

1.1.10 Exercises

  1. Create a script and call it assignment01. Save your script.
  2. Download this cheat-sheet and go over it. You won’t understand most of it right a away. But it will become a useful resource. Look at it often.
  3. Calculate the square root of 1369 using the sqrt() function.
  4. Square the number 13 using the ^ operator.
  5. What is the result of summing all numbers from 1 to 100?

  6. Using the London borough data, create a CSV file that contains the following columns:

  • The names of the London Boroughs

  • Population change between 1811 and 1911

  • Population change between 1911 and 1961

  • Population change 1961 and 2011

  1. Which Boroughs had the most population growth during the 19th Century, and which had the slowest?

  2. You may have noticed that there is an additional column in the pop data frame called Borough-Type. This indicates if a Borough is in inner (1) or outer (2) London. Is this variable ordinal or nominal?

  3. Save your R script by pressing the Save button in the script window.