4.2 Solutions
- Load the quality of government dataset.
<- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/QoG2012.csv") qog
- Rename the variable
wdi_gdpc
togdpc
usingrename()
from thetidyverse
. - Delete all rows with missing values on
gdpc
. - Inspect
former_col
and delete rows with missing values on it. - Turn
former_col
into a factor variable with appropriate labels. - Subset the dataset so that it only has the
gdpc
,hdi
, andformer_col
variables and sort the dataset byhdi
usingarrange()
, all in one ‘pipe’. - Plot the distribution of
gdpc
conditional on former colony status, usingtheme_minimal()
,labs()
andscale_colour_discrete()
to make sure your plot is clear and easy to understand. - Compute the probability that a country is richer than 55,000 per capita.
- Compute the conditional expectation of wealth for a country that is not a former colony.
- What is the probability that a former colony is 2 standard deviations below the mean wealth level?
- Compute the probability that any country is in the wealth interval from 20,000 to 50,000.
4.2.0.1 Exercise 1
Load the quality of government dataset.
<- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/QoG2012.csv") qog
4.2.0.2 Exercise 2
Rename the variable wdi_gdpc
to gdpc
using rename()
from the tidyverse
.
library(tidyverse)
<- qog %>%
qog rename(gdpc = wdi_gdpc)
4.2.0.3 Exercise 3
Delete all rows with missing values on gdpc
.
# to check whether there are any missings or not
<- qog %>%
qog drop_na(gdpc)
4.2.0.4 Exercise 4
Inspect former_col
and delete rows with missing values on it.
# we instpect the variable and check for missings
summary(qog$former_col)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 1.0000 0.6348 1.0000 1.0000
# no missing data!
4.2.0.5 Exercise 5
Turn former_col
into a factor variable with appropriate labels.
# we check the current storage type
str(qog$former_col)
int [1:178] 0 0 1 1 1 0 1 0 0 1 ...
# it's numeric (or 'integer'), so we change it to nominal
$former_col <- factor(qog$former_col,
qoglevels = c(0,1),
labels = c("not colonised", "former colony" ))
# or we could pipe it
# qog <- qog %>%
# mutate(former_col = factor(former_col,
# levels = c(0,1),
# labels = c("not colonised", "former colony")))
# let's check the results
table(qog$former_col)
not colonised former colony
65 113
4.2.0.6 Exercise 6
Subset the dataset so that it only has the gdpc
, undp_hdi
, and former_col
variables and sort the dataset by hdi
using arrange()
, all in one ‘pipe’.
<- qog %>%
qog select(gdpc,
undp_hdi,%>%
former_col) arrange(gdpc)
4.2.0.7 Exercise 7
Plot the distribution of gdpc
conditional on former colony status, using theme_minimal()
, labs()
and scale_colour_discrete()
to make sure your plot is clear and easy to understand.
<- ggplot(qog,
gg_gdpc aes(gdpc,
group = former_col)) +
geom_density(aes(colour = factor(former_col))) +
labs(x = "GDP per capita", # clearer x axis label
y = "Density",
title = "Distribution of GDP conditional on former colony status") + # clearer y axis label
scale_color_discrete(name = "Former colony", # change legend title
labels = c("Never colonised", # change legend labels
"Colonised")) +
theme_minimal()
gg_gdpc
4.2.0.8 Exercise 8
Compute the probability that a country is richer than 55,000 per capita.
# get the empirical cumulative distribution of wealth
<- ecdf(qog$gdpc)
dist_wealth
# probability
1 - dist_wealth(55000)
[1] 0.01123596
The probability is 0.01. Put differently 1 percent of countries are richer than 55,000 US dollars per capita.
4.2.0.9 Exercise 9
Compute the conditional expectation of wealth for a country that is not a former colony.
The conditional expectation is the mean of wealth among all former colonies.
# dataset of just non-former colonies
<- qog %>%
non_colonies filter(former_col == "not colonised")
# take mean of gdp
mean(non_colonies$gdpc)
[1] 16415.39
The conditional expectation of wealth for non-former colonies is approximately 17,250 US dollars per capita.
4.2.0.10 Exercise 10
What is the probability that a former colony is 2 standard deviations below the mean wealth level?
We first find out what the standard deviation and mean of wealth are in the conditional distribution of wealth for former colonies.
# dataset of just former colonies
<- qog %>%
colonies filter(former_col == "former colony")
# standard deviation of wealth for former colonies
<- sd(colonies$gdpc)
sd_wealth_cols sd_wealth_cols
[1] 9783.914
# but what is the mean?
<- mean(colonies$gdpc)
mean_wealth_cols mean_wealth_cols
[1] 6599.714
The standard deviation is greater than the mean. Apparently, former colonies are very different. Some do poorly and some extremely well. Negative wealth does not exist. Consequently, the answer is that there is 0 probability of a country having a gdp two standard deviations below the mean.
4.2.0.11 Exercise 11
Compute the probability that any country is the wealth interval from 20,000 to 50,000.
We compute the cumulative probabilities that a country has 20,000 and 50,000 and then take the difference.
# get the empirical cumulative distribution of wealth
<- ecdf(qog$gdpc)
dist_wealth_2
# cumulative probability of 20,000
<- dist_wealth_2(20000)
p1 # cumulative probability of 50,000
<- dist_wealth_2(50000)
p2 # probability of country in the interval
- p1 p2
[1] 0.1741573
The probability is approximately 0.17, so we expect about 17% of countries to fall in this interval.