6.2 Solutions

6.2.1 Exercise 1

Using better_model, where we included the square of GDP/capita, what is the effect of: a. an increase of GDP/capita from 5000 to 15000? b. an increase of GDP/capita from 25000 to 35000?

# load world data 
a <- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/QoG2012.csv")

# subsetrename variables
a <- rename(a,
            human_development = undp_hdi,
            institutions_quality = wbgi_cce,
            gdp_capita = wdi_gdpc)

# drop missings
a <- drop_na(a,
             human_development, institutions_quality, gdp_capita)

# create factor again
a$former_col <- factor(a$former_col, labels = c("never colonies", "ex colonies"))

# re-run better model
better_model <- lm(human_development ~ poly(gdp_capita, 2), data = a)

For a. we make two predictions. One, where gdp/capita is 5000 and one where it is 15000.

y_hat1 <- predict(better_model, newdata = data.frame(gdp_capita = 5000))
# predicted quality of life if gdp/capita is 5000
y_hat1

        1 
0.6443723

y_hat2 <- predict(better_model, newdata = data.frame(gdp_capita = 15000))
# predicted quality of life if gdp/capita is 15000
y_hat2

        1 
0.8272318

The effect of raising gdp/capita from 5000 to 15000 is the difference between our two predictions (called the first difference).

y_hat2 - y_hat1

        1 
0.1828595

The quality of life imporves by 0.18 according to our model when we raise gdp/capita from 5000 to 15000. Given that the human development index ranges from 0 - 1 (theoretical range), the effect is extremely large.

For b. we go through the same procedure.

y_hat1 <- predict(better_model, newdata = data.frame(gdp_capita = 25000))
y_hat2 <- predict(better_model, newdata = data.frame(gdp_capita = 35000))
y_hat2 - y_hat1

         1 
0.04116257

The quality of life improves by only 0.04 when we increase gdp/capita by 10 000 US$. Although, the increase in wealth was 10 000 in both scenarios, the effect is a lot more effective if the society is not already rich.

6.2.2 Exercise 2

You can see that the curve in our quadratic plot curves down when countries become very rich. Speculate whether that results make sense and what the reason for this might be.

The downward curve does not make sense because it does not reflect a relationship that we actually observe in our data. The decline in life quality is due to the functional form of the square of gdp. It has to slope down at some point. We would not want to draw the conclusion that increasing wealth at some point leads to decline in the quality of life.

6.2.3 Exercise 3

Raise GDP/capita to the highest power using the poly() that significantly improves model fit. a. Does your new model solve the potentially artefical down-curve for rich countries? b. Does the new model improve upon the old model? c. Plot the new model.

To answer that question, we raise gdp/capita by one and compare model fit until adding another power does not improve model fit.

# power of 3
m.p3 <- lm(human_development ~ poly(gdp_capita, 3), data = a)
# compare cubic with quadratic using f test
anova(better_model, m.p3) # p < 0.05, so cubic is better

Analysis of Variance Table

Model 1: human_development ~ poly(gdp_capita, 2)
Model 2: human_development ~ poly(gdp_capita, 3)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    169 1.8600                                  
2    168 1.4414  1   0.41852 48.779 6.378e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# power of 4
m.p4 <- lm(human_development ~ poly(gdp_capita, 4), data = a)
# compare models using f test
anova(m.p3, m.p4) # p < 0.05, so new model is better

Analysis of Variance Table

Model 1: human_development ~ poly(gdp_capita, 3)
Model 2: human_development ~ poly(gdp_capita, 4)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    168 1.4414                                  
2    167 1.2653  1   0.17612 23.244 3.191e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# power of 5
m.p5 <- lm(human_development ~ poly(gdp_capita, 5), data = a)
# compare models using f test
anova(m.p4, m.p5) # p < 0.05, so new model is better

Analysis of Variance Table

Model 1: human_development ~ poly(gdp_capita, 4)
Model 2: human_development ~ poly(gdp_capita, 5)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    167 1.2653                                  
2    166 1.0193  1   0.24597 40.056 2.213e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# power of 6
m.p6 <- lm(human_development ~ poly(gdp_capita, 6), data = a)
# compare models using f test
anova(m.p5, m.p6) # p < 0.05, so new model is better

Analysis of Variance Table

Model 1: human_development ~ poly(gdp_capita, 5)
Model 2: human_development ~ poly(gdp_capita, 6)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    166 1.01935                                  
2    165 0.93283  1  0.086524 15.305 0.0001335 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# power of 7
m.p7 <- lm(human_development ~ poly(gdp_capita, 7), data = a)
# compare models using f test
anova(m.p6, m.p7) # p < 0.05, so new model is better

Analysis of Variance Table

Model 1: human_development ~ poly(gdp_capita, 6)
Model 2: human_development ~ poly(gdp_capita, 7)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    165 0.93283                                  
2    164 0.87032  1  0.062509 11.779 0.0007582 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# power of 8
m.p8 <- lm(human_development ~ poly(gdp_capita, 8), data = a)
# compare models using f test
anova(m.p7, m.p8) # p > 0.05, so new model is worse!

Analysis of Variance Table

Model 1: human_development ~ poly(gdp_capita, 7)
Model 2: human_development ~ poly(gdp_capita, 8)
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    164 0.87032                           
2    163 0.86650  1 0.0038174 0.7181  0.398

The result is that raising gdp/capita to the power of seven provides the best model fit. We had to manually add powers of gdp to find the answer. For those of you are interested, there is a programmatic way to solve this problem quicker by writing a loop. We show you how to do so below. If you are interested, play around with this but you will not be required to be able to do this (we will not test you on this).

# the initial modle to compare to
comparison.model <- better_model
p <- 0.05 # setting a p-value
power <- 2 # the initial power

# loop until p is larger than 0.05
while(p <= 0.05){
  # raise the power by 1
  power <- power + 1
  # fit the new model with the power raised up by 1
  current.model <- lm(human_development ~ poly(gdp_capita, power), data = a)
  # run the f-test
  f <- anova(comparison.model, current.model)
  # extract p value
  p <- f$`Pr(>F)`[2]
  # comparison model becomes the current model if current model is better
  if (p <= 0.05) comparison.model <- current.model
}
screenreg(comparison.model)


====================================
                          Model 1   
------------------------------------
(Intercept)                 0.70 ***
                           (0.01)   
poly(gdp_capita, power)1    1.66 ***
                           (0.07)   
poly(gdp_capita, power)2   -1.00 ***
                           (0.07)   
poly(gdp_capita, power)3    0.65 ***
                           (0.07)   
poly(gdp_capita, power)4   -0.42 ***
                           (0.07)   
poly(gdp_capita, power)5    0.50 ***
                           (0.07)   
poly(gdp_capita, power)6   -0.29 ***
                           (0.07)   
poly(gdp_capita, power)7    0.25 ***
                           (0.07)   
------------------------------------
R^2                         0.84    
Adj. R^2                    0.84    
Num. obs.                 172       
====================================
*** p < 0.001; ** p < 0.01; * p < 0.05

a. Does your new model solve the potentially artificial down-curve for rich countries?
b. Does the new model improve upon the old model?
c. Plot the new model.

We plot the polynomial to answer a) . To do so, we vary gdp/capita from its minimum to the maximum. This is the value of gdp values that we plot on the x axis. We use the predict() function to predict outcomes($\hat{Y}$).

# our sequence of 100 GDP/capita values
gdp_seq <- seq(from = 226, to = 63686, length.out = 100)

# we set our covarite values (here we only have one covariate: GDP/capita)
x <- data.frame(gdp_capita = gdp_seq)

# we predict the outcome (human development index) for each of the 100 GDP levels
y_hat <- predict(m.p7, newdata = x)
x$y_hat <- y_hat

# plot
ggplot(data = a,
       mapping = aes(x = gdp_capita,
                     y = human_development)) +
  geom_point() +
  geom_line(data = x, mapping = aes(x = gdp_capita, y = y_hat))

The model fit improves when we fit a 7th degree polynomial to the data. A seventh degree polynomial is very flexible, it can fit the points well. However, it is very important to remember that we have a sample of data. This sample is subject to sampling variability. That means our sample contains some idiosyncratic aspects that do not reflect the systematic pattern between GDP/capita and the human development index. We call the systematic pattern the “signal” and the random idiosyncratic part “noise”.

Our 7th degree polynomial is too flexible. It fits the data in our sample too well. We almost certainly fit our model not just to the signal but also to the noise. We want to be parsimonious with our use of polynomials. Without advanced statistics, the general advice is to stay clear of higher degree polynomials. In published articles you often see a quadratic term. You may see a cubic term. Anything above is unusual.

6.2.4 Exercise 4

Run a model on the human development index (hdi), interacting an independent judiciary (h_j) and control of corruption (corruption_control). What is the effect of control of corruption: a. In countries without an independent judiciary? b. When there is an independent judiciary? c. Illustrate your results. d. Does the interaction improve model fit?

m1 <- lm(human_development ~ institutions_quality * h_j, data = a)
screenreg(m1)


====================================
                          Model 1   
------------------------------------
(Intercept)                 0.67 ***
                           (0.02)   
institutions_quality        0.10 ***
                           (0.02)   
h_j                         0.05 *  
                           (0.03)   
institutions_quality:h_j    0.01    
                           (0.03)   
------------------------------------
R^2                         0.48    
Adj. R^2                    0.47    
Num. obs.                 158       
====================================
*** p < 0.001; ** p < 0.01; * p < 0.05

What is the effect of quality of institutions in countries without an independent judiciary?

The effect of institutions quality is $\beta_1 = 0.10$.

What is the effect of quality of institutions when there is an independent judiciary?

The effect of institutions quality is $\beta_1 + \beta_3 = 0.10 + 0.01 = 0.11$.

Illustrate your results.

# vary institutions quality
summary(a$institutions_quality)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.69953 -0.81039 -0.28942 -0.01987  0.54041  2.44565

# sequence of quality of institutions
inst.qual <- seq(-1.7, 2.4, length.out = 100)

# set covariates when free judiciary is 0
x1 <- data.frame(institutions_quality = inst.qual, h_j = 0)

# set covariates when free judiciary is 1
x2 <- data.frame(institutions_quality = inst.qual, h_j = 1)

# predictions
y_hat1 <- predict(m1, newdata = x1)
x1$y_hat <- y_hat1
y_hat2 <- predict(m1, newdata = x2)
x2$y_hat <- y_hat2

# free judiciary
a$h_j <- factor(a$h_j, c(0, 1), c("controlled judiciary", "independent judiciary"))

ggplot(data = a,
       mapping = aes(x = institutions_quality,
                     y = human_development)) +
  geom_point() +
  geom_line(data = x1, 
            mapping = aes(x = institutions_quality, y = y_hat),
            colour = "red") +
  geom_line(data = x2, 
            mapping = aes(x = institutions_quality, y = y_hat),
            colour = "blue")

The effect of the quality of institutions does not seem to be conditional on whether a country has a controlled or an independent judiciary. The interaction term is insignificant and we can see that the slope of the lines is quite similar. We would not interpret the effect of the quality of institutions as conditional. It’s substantially similar in both groups.

m1 <- lm(human_development ~ institutions_quality * h_j, data = a)
screenreg(m1)


=========================================================
                                               Model 1   
---------------------------------------------------------
(Intercept)                                      0.67 ***
                                                (0.02)   
institutions_quality                             0.10 ***
                                                (0.02)   
h_jindependent judiciary                         0.05 *  
                                                (0.03)   
institutions_quality:h_jindependent judiciary    0.01    
                                                (0.03)   
---------------------------------------------------------
R^2                                              0.48    
Adj. R^2                                         0.47    
Num. obs.                                      158       
=========================================================
*** p < 0.001; ** p < 0.01; * p < 0.05

Does the interaction improve model fit?

m_no_interaction <- lm(human_development ~ institutions_quality + h_j, data = a)
anova(m_no_interaction, m1)

Analysis of Variance Table

Model 1: human_development ~ institutions_quality + h_j
Model 2: human_development ~ institutions_quality * h_j
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    155 2.8102                           
2    154 2.8061  1 0.0041704 0.2289  0.633

The f test confirms that the interaction model does not improve model quality. We fail to reject the null hypothesis that the interaction model does not explain the quality of life better.