Chapter 5 Multiple linear regression models (I)

5.1 Seminar – in class exercises

Today we will again be using texreg, so if you haven’t installed it, make sure you do so! Otherwise, let’s load that and, of course, the tidyverse.

install.packages("texreg") # install only once
library(texreg) # load in the beginning of every R session
library(tidyverse) # also load tidyverse!

To get a deeper understanding of what the tidyverse has to offer, see the information in Going further with the tidyverse: dplyr, magrittr and ggplot2.

5.1.1 Loading, Understanding and Cleaning our Data

Today, we load the full standard (cross-sectional) dataset from the Quality of Government Institute (this is a newer version than the one we have used previously). This is a great data source for comparative political science research. The codebook is available from their main website. You can also find time-series datasets on this page.

The dataset is saved on the Queen Mary GitHub account, so let’s load it from there:

# load dataset 
world_data <- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/qog_20.csv")

# check the dimensions of the dataset
dim(world_data)
[1]  194 1758

The dataset contains many variables. We will select a subset of variables that we want to work with. We are interested in political stability. Specifically, we want to find out what predicts the level of political stability. Therefore, political_stability is our dependent variable (also called response variable, left-hand-side variable, explained/predicted variable). We will also select a variable that identifies each row (observation) in the dataset uniquely: cname which is the name of the country. Potential predictors (independent variables, right-hand-side variables, covariates) are:

  1. lp_lat_abst is the distance to the equator which we rename into latitude
  2. ti_cpi is Transparency International’s Corruptions Perceptions Index, renamed to institutions_quality (larger values mean better quality institutions, i.e. less corruption)
  3. bti_ds is a score out of 10 for ‘democracy status’ – i.e. the extent to which democratic norms are followed in the country’s political system, renamed to democracy (larger values mean more democratic)

Our dependent variable:

  • wbgi_pve which we rename into political_stability (larger values mean more stability)

We know from previous weeks that there are multiple ways to subset datasets. Here we use select(), to subset and rename the columns.

world_data <- select(
  world_data, # dataset
  country = cname, # keep these variables but change their names
  political_stability = wbgi_pve,
  latitude = lp_lat_abst,
  democracy = bti_ds,
  institutions_quality = ti_cpi
  )

head(world_data)
              country political_stability  latitude democracy
1         Afghanistan          -2.6710539 0.3666667  3.016667
2             Albania           0.3446447 0.4555556  7.050000
3             Algeria          -1.0975257 0.3111111  4.750000
4             Andorra           1.4134194 0.4700000        NA
5              Angola          -0.3158988 0.1366667  4.200000
6 Antigua and Barbuda           0.8786072        NA        NA
  institutions_quality
1                   15
2                   39
3                   34
4                   NA
5                   18
6                   NA

These two steps could have been combined in one ‘pipe’ – to see how, take a look at Going further with the tidyverse: dplyr, magrittr and ggplot2.

The function summary() lets you summarize data sets. We will look at the dataset now. When the dataset is small in the sense that you have few variables (columns) then this is a very good way to get a good overview. It gives you an idea about the level of measurement of the variables and the scale. country, for example, is a character variable as opposed to a number. Countries do not have any order, so the level of measurement is categorical.

If you think about the next variable, political stability, and how one could measure it you know there is an order implicit in the measurement: more or less stability. From there, what you need to know is whether the more or less is ordinal or interval scaled. Checking political_stability you see a range from roughly -3 to 1.5. The variable is numerical and has decimal places. This tells you that the variable is at least interval scaled. You will not see ordinally scaled variables with decimal places. Examine the summaries of the other variables and determine their level of measurement.

summary(world_data)
   country          political_stability    latitude        democracy    
 Length:194         Min.   :-2.916323   Min.   :0.0000   Min.   :1.433  
 Class :character   1st Qu.:-0.642414   1st Qu.:0.1111   1st Qu.:3.725  
 Mode  :character   Median :-0.003028   Median :0.2222   Median :5.700  
                    Mean   :-0.068998   Mean   :0.2572   Mean   :5.551  
                    3rd Qu.: 0.730046   3rd Qu.:0.3778   3rd Qu.:7.112  
                    Max.   : 1.519183   Max.   :0.7222   Max.   :9.950  
                                        NA's   :41       NA's   :58     
 institutions_quality
 Min.   :10.00       
 1st Qu.:29.00       
 Median :38.00       
 Mean   :42.72       
 3rd Qu.:56.75       
 Max.   :90.00       
 NA's   :16          

The variables latitude, institutions_quality, and democracy have 41, 16 and 58 missing values respectively, each marked as NA. Missing values could cause trouble because operations including an NA will produce NA as a result (e.g.: 1 + NA = NA). We will drop these missing values from our data set using the drop_na() function we have used before.

world_data <- drop_na(world_data)

Generally, we want to make sure we drop missing values only from variables that we care about. Here, because there are only missing data on the variables that we care about, we can run drop_na() on the whole dataset and achieve what we want. However, we could also target the specific variables, as below.

world_data <- drop_na(world_data, # dataset
                      latitude, # variables
                      institutions_quality,
                      democracy)

If we summarise the data, we can now see that we have no NAs.

summary(world_data)
   country          political_stability    latitude        democracy    
 Length:106         Min.   :-2.9163     Min.   :0.0000   Min.   :1.433  
 Class :character   1st Qu.:-0.9013     1st Qu.:0.1111   1st Qu.:3.788  
 Mode  :character   Median :-0.3855     Median :0.1944   Median :5.492  
                    Mean   :-0.4505     Mean   :0.2110   Mean   :5.476  
                    3rd Qu.: 0.1426     3rd Qu.:0.3092   3rd Qu.:6.862  
                    Max.   : 1.4958     Max.   :0.5778   Max.   :9.950  
 institutions_quality
 Min.   :10.00       
 1st Qu.:27.00       
 Median :34.50       
 Mean   :35.49       
 3rd Qu.:41.00       
 Max.   :84.00       

The range of the variable latitude is between 0 and 1 (it’s actually 0 to 0.5777778 here but it could be up to 1). The codebook clarifies that the latitude of a country’s capital has been divided by 90 to get a variable that ranges from 0 to 1. This would make interpretation difficult. When interpreting the effect of such a variable a unit change (a change of 1) covers the entire range or put differently, it is a change from a country at the equator to a country at one of the poles.

We therefore multiply by 90 again. This will turn the units of the latitude variable into degrees again which makes interpretation easier.

# transform latitude variable
world_data$latitude <- world_data$latitude * 90

5.1.2 Estimating a Bivariate Regression

Is there a correlation between the distance of a country to the equator and the level of political stability? Both political stability (dependent variable) and distance to the equator (independent variable) are continuous. Therefore, we will get an idea about the relationship using a scatter plot.

p1 <- ggplot(data = world_data, # dataset to work from
             mapping = aes(x = latitude, # x axis
                           y = political_stability)) + # y axis
  geom_point() + # make it a scatterplot
  labs(x = 'Latitude', # x axis label
       y = 'Political stability') + # y axis label
  theme_bw() # clean background

p1

Looking at the cloud of points suggests that there might be a positive relationship: increases in our independent variable latitude appear to be associated with increases in the dependent variable political_stability (the further from the equator, the more stable).

We can fit a line of best fit through the points. To do this we can simply ask ggplot() to add a ‘smooth’ line which is estimated through the linear model of interest (y ~ x, or political_stability ~ latitude).

# add the line
p1 <- p1 + 
  geom_smooth(method = "lm", # 'lm' = 'linear model' - it automatically used your x and y values
              se = FALSE) # don't show confidence interval around the line

p1
`geom_smooth()` using formula = 'y ~ x'

The blue line represents the ‘line of best fit’ or regression line. It is the line that minimises the squared distance of all the points from the line itself - the ‘least squares’ line. It looks like this line shows a very slight positive association between latitude and political stability. The further away from the equator a country is, the higher its level of political stability. But is this actually a systematic relationship? Is there a statistically significant association?

We can assess this by running a regression model (the same one R is running behind the scenes to fit the line above) and looking at its output. We fit such a model, like last week, using lm() and how its output using screenreg().

# bivariate model
latitude_model <- lm(political_stability ~ latitude, data = world_data)

screenreg(latitude_model)

======================
             Model 1  
----------------------
(Intercept)   -0.53 **
              (0.16)  
latitude       0.00   
              (0.01)  
----------------------
R^2            0.00   
Adj. R^2      -0.01   
Num. obs.    106      
======================
*** p < 0.001; ** p < 0.01; * p < 0.05

Thinking back to last week, how can we interpret this regression ouput?

  • The coefficient for the variable latitude (\(\beta_1\)) indicates that a one-unit increase in a country’s latitude is associated with approximately 0.00 increase in the measure of political stability, on average.
  • The coefficient for the (intercept) term (\(\beta_0\)) indicates that the average level of political stability for a country with a latitude of 0 is \(-0.53\) (where latitude = 0 is a country positioned at the equator)
  • The \(R^2\) of the model is approximately 0.00. This implies that none of the variation in the dependent variable (political stability) is explained by the independent variable (latitude) in the model.

So… we don’t seem to have done a very good job of predicting political stability here. Latitude seems to have very little to do with it. That’s OK! Failing to reject the null hypothesis of no relationship between our variables is not actually a failure. It’s always good to remember that finding no relationship can be as informative as finding one, and lots of the best social science research is work that demonstrates there is likely to be no relationship between a predictor and an outcome.

But, remember, we removed a lot of cases with missing values earlier. There might be a lot of cases we removed that actually have no missing data on the latitude variable. What if, when looking at all the data, there actually is a significant association?

world_data_2 <- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/qog_20.csv")


world_data_2 <- 
  select(world_data_2,
         country = cname,
         political_stability = wbgi_pve,
         latitude = lp_lat_abst,
         democracy = bti_ds,
         institutions_quality = ti_cpi)


world_data_2 <- drop_na(world_data_2, latitude) # only drop missings on predictor of interest


world_data_2 <- mutate(world_data_2, latitude = latitude*90) # put latitude on correct scale again

latitude_model_2 <- lm(political_stability ~ latitude, data = world_data_2)

screenreg(latitude_model_2)

=======================
             Model 1   
-----------------------
(Intercept)   -0.48 ***
              (0.13)   
latitude       0.02 ***
              (0.00)   
-----------------------
R^2            0.09    
Adj. R^2       0.08    
Num. obs.    153       
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05

Now, we see that there does appear to be a highly statistically significant association here. Each one unit increase in latitude increases political stability by about 0.02, a relationship which is statistically significant at the .001 level. This shows you the importance of making sure you remove NAs in an appropriate way for your analysis. Because we are only working with a bivariate relationship here, we only need our independent and dependent variables to be free of missing data. When we start running multiple regression models, though, we need to make sure all of the predictors are free from missing data. Or at least, we need to know how much missingness there is.

5.1.3 Multivariate Regression

The regression above suggests that there is a significant association between these variables. We might be able to see why this is the case, but we probably don’t think that it tells a complete picture. At the least, it is unlikely that there is a simple causal link between being further away from the equator and having higher political stability. We could therefore consider other variables to include in our model.

Perhaps it’s not that hypothetically moving further away from the equator would bring a country more political stability, but instead that those countries closer to the equator happen to tend to be less democratic, and this lower level of democracy makes them less politically stable. We can therefore ask if latitude is still associated with political stability while ‘controlling’ for level of democracy. That is, once we know how democratic a country is, does knowing about its latitude tell us any more about its political stability?

We can assess this using a multivariate, or multiple, regression model. To specify a multiple linear regression model, the only thing we need to change is what we pass to the formula argument of the lm() function. In particular, if we wish to add additional explanatory variables, the formula argument will take the following form:

dependent.variable ~ independent.variable.1 + independent.variable.2 ... independent.variable.k

where k indicates the total number of independent variables we would like to include in the model. In the example here, our model would therefore look like the following:

# model with more explanatory variables
multi_model_1 <- lm(political_stability ~ 
                      latitude + democracy,
                    data = world_data_2)

screenreg(multi_model_1)

=======================
             Model 1   
-----------------------
(Intercept)   -1.77 ***
              (0.25)   
latitude       0.00    
              (0.01)   
democracy      0.24 ***
              (0.04)   
-----------------------
R^2            0.26    
Adj. R^2       0.25    
Num. obs.    106       
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05

Remember, political_stability is our dependent variable, as before, and now we have two independent variables: latitude and democracy. Again, just as with the bivariate model, we can view the summarised output of the regression by using screenreg(). It looks like we might be right. When accounting for democracy (which appears to be highly associated with political stability), latitude is not significantly associated with political stability. As we now have two models (a simple regression model, and a multiple regression model), we can join them together using the list() function, and then put all of that inside screenreg().

screenreg(list(latitude_model_2, multi_model_1))

===================================
             Model 1     Model 2   
-----------------------------------
(Intercept)   -0.48 ***   -1.77 ***
              (0.13)      (0.25)   
latitude       0.02 ***    0.00    
              (0.00)      (0.01)   
democracy                  0.24 ***
                          (0.04)   
-----------------------------------
R^2            0.09        0.26    
Adj. R^2       0.08        0.25    
Num. obs.    153         106       
===================================
*** p < 0.001; ** p < 0.01; * p < 0.05

But… look at the Ns of the two models. Notice that when we ran the multivariate model, we used the version of the dataset where only the missing values on latitude were removed. But lm() removed the NAs from democracy, too, by default. This means that the data used here, like the first time we ran the bivariate latitude model, has a large amount of cases removed. The models are therefore not comparable. It could also be this that is once again bringing about the null relationship between latitude and political stability, rather than our modelling strategy. This sort of thing is why it is so important to think about missing data and its potential implications.

Let’s try an example where missing data doesn’t complicate things quite so much. Loading the data from scratch again, we subset to include variables on political stability, the functioning of government, the rule of law, and freedom of expression:

  1. wbgi_pve - Political Stability: measures perceptions of the likelihood of political instability and/or politically motivated violence, including terrorism.
  2. fh_fog - Functioning of Government: measures the policy-making power of elected legislature, including extent of corruption as well as accountability and transparency. 0 (worst) - 12 (best) scale.
  3. fh_rol - Rule of Law: measures the independence of the judiciary, prevalence of rule of law in civil and criminal matters, civic control over police, protection from injustice, etc. 0 (worst) - 16 (best) scale.
  4. fh_feb - Freedom of Expression: measures the freedom and independence of media and cultural expressions, religious groups, academia, etc. 0 (worst) - 16 (best) scale.
qog <- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/qog_20.csv")


qog <-
  select(qog,
         country = cname,
         political_stability = wbgi_pve,
         gov_func = fh_fog,
         rule_law = fh_rol,
         free_exp = fh_feb)

summary(qog)
   country          political_stability    gov_func         rule_law     
 Length:194         Min.   :-2.916323   Min.   : 0.000   Min.   : 0.000  
 Class :character   1st Qu.:-0.642414   1st Qu.: 3.000   1st Qu.: 4.000  
 Mode  :character   Median :-0.003028   Median : 7.000   Median : 8.000  
                    Mean   :-0.068998   Mean   : 6.356   Mean   : 8.098  
                    3rd Qu.: 0.730046   3rd Qu.:10.000   3rd Qu.:12.750  
                    Max.   : 1.519183   Max.   :12.000   Max.   :16.000  
    free_exp    
 Min.   : 0.00  
 1st Qu.: 7.00  
 Median :12.00  
 Mean   :10.89  
 3rd Qu.:15.00  
 Max.   :16.00  

These variables have no missing data: these measures are available for every country in the dataset. We also give this dataset a different name, qog, to prevent confusion.

We are interested here in how democratic indicators are related to political stability. We might hypothesise that more democratic countries tend to be more politically stable. For example, those countries where individuals have greater freedom of expression might have higher political stability because people do not feel the need to resort to more radical, violent forms of expression. Let’s check this.

# visualise relationship
p2 <- ggplot(data = qog, # dataset to work from
             mapping = aes(x = free_exp, # x axis
                           y = political_stability, # y axis
                           group = free_exp)) + # group data on x axis
  geom_boxplot() + # make it a boxplot
  labs(x = "Freedom of expression", # x axis label
       y = "Political Stability") + # y axis label
  theme_bw()

p2

It does indeed appear that higher levels of freedom of expression are associated with higher average levels of political stability. To understand boxplots better, see [Going further with ggplot2]. We can also get a simple correlation coefficient for this relationship, and test for statistical significance.

# correlation
cor(x = qog$free_exp, 
    y = qog$political_stability, 
    use = "complete.obs")
[1] 0.6131883
# statistically significant correlation?
cor.test(x = qog$free_exp, 
         y = qog$political_stability)

    Pearson's product-moment correlation

data:  qog$free_exp and qog$political_stability
t = 10.756, df = 192, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5169708 0.6941044
sample estimates:
      cor 
0.6131883 

There does indeed appear to be a significant association. We can quantify this association more usefully using a regression model.

# regression model
exp_mod <- lm(political_stability ~ free_exp, data = qog)

# summarise results in a table
screenreg(exp_mod)

=======================
             Model 1   
-----------------------
(Intercept)   -1.50 ***
              (0.14)   
free_exp       0.13 ***
              (0.01)   
-----------------------
R^2            0.38    
Adj. R^2       0.37    
Num. obs.    194       
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05

It looks like countries with greater freedom of expression are more politically stable. For each one unit increase in freedom of expression, we expect political stability to increase by approximately 0.13 units, and this relationship is highly statistically significant (p < 0.001). This model explains about 38% of the variance in political stability.

Yet, it could be the case that, even if freedom of expression has a causal effect on political stability, there is more to it than this. Perhaps we are not getting a good estimate of the causal relationship, because we aren’t accounting for confounding. Maybe freedom of expression is guaranteed by the rule of law, and it is the rule of law which brings about political stability. This would mean that rule of law confounds the relationship between freedom of expression and political stability. Given that political stability is defined here as the absence of political violence, the rule of law seems likely to have an effect on its prevalence. If we think that this is likely to be the case, we can run a multiple regression model to control for the rule of law in estimating the effect of freedom of expression on political stability.

# multiple regression model
confound_mod <- lm(political_stability ~ free_exp + rule_law, data = qog)

screenreg(confound_mod)

=======================
             Model 1   
-----------------------
(Intercept)   -1.02 ***
              (0.13)   
free_exp      -0.07 ** 
              (0.02)   
rule_law       0.21 ***
              (0.02)   
-----------------------
R^2            0.59    
Adj. R^2       0.59    
Num. obs.    194       
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05

The results of this are very interesting. First of all, we can see that in this model, both predictors are statistically significantly associated with political stability. We also account for almost 60% of the variance in political stability here.

Moving beyond this, though, we can also see that freedom of expression is now negatively associated with political stability. This tells us that when we know about a country’s degree of rule of law, greater levels of freedom of expression actually lead to less political stability. Were we to set rule of law at a fixed level, varying freedom of expression independently of this would produce a negative effect on political stability.

We can think about this in substantive terms. Imagine fixing the rule of law to a set level in all countries. The judiciaries are all of the same independence, civic control over police is even everywhere, etc. Now, if we allow some of those countries a greater level of freedom of expression than others, those with greater levels of freedom of expression might actually experience more political instability because radical political expressions are not immediately shut down and revolutionary movements are allowed to grow. Remember that in this case, those places with greater freedom of expression do not also have greater rule of law, because we are controlling for this relationship in our model.

We might, similarly, assert theoretically that the functioning of government would have a similar effect on the relationship between freedom of expression and political stability. We can control for this, too, by simply adding it in.

# regression model
confound_mod_2 <- lm(political_stability ~ free_exp + rule_law + gov_func, data = qog)

screenreg(confound_mod_2)

=======================
             Model 1   
-----------------------
(Intercept)   -1.02 ***
              (0.13)   
free_exp      -0.07 ** 
              (0.02)   
rule_law       0.22 ***
              (0.03)   
gov_func      -0.01    
              (0.04)   
-----------------------
R^2            0.59    
Adj. R^2       0.59    
Num. obs.    194       
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05

We see that this changes very little. The estimates for freedom of expression and rule of law remain unchanged, as does the amount of variance explained. (This will actually have increased very slightly, but not at the two decimal place level.) This does not mean, however, that we should remove functioning of government from our model. This depends on your goal, of course, but generally variables should be included for theoretical reasons and not because of relationships we observe empirically in the data.

What we have done here is take a principled approach to multiple regression. Unfortunately, it is fairly common to include more and more variables in regression models simply to ‘control’ for everything. But this is bad practice. We had theoretical reasons to introduce additional variables, and doing so allowed us to come up with potentially interesting, useful inferences about how political stability might vary across democracies. We could add more variables to these models, but we should not just do this for the sake of it. If nothing else, interpreting models becomes more difficult when they become more complex.

5.1.4 Joint Significance Test (F-statistic)

Whenever you add variables to your model, you will explain more of the variance in the dependent variable. That means, using your data, your model will better predict outcomes. We would like to know whether the difference (the added explanatory power) is statistically significant. The null hypothesis is that the added explanatory power is zero and the p-value gives us the probability of observing such a difference as the one we actually computed assuming that null hypothesis (no difference) is true.

The F-test is a joint hypothesis test that lets us compute that p-value. Two conditions must be fulfilled to run an F-test:

Conditions for F-test model comparison
Both models must be estimated from the same sample! If your added variables contain lots of missing values and therefore your n (number of observations) are reduced substantially, you are not estimating from the same sample.
The models must be nested. That means, the model with more variables must contain all of the variables that are also in the model with fewer variables.

We specify two models: a restricted model and an unrestricted model. The restricted model is the one with fewer variables. The unrestricted model is the one including the extra variables. We say restricted model because we are “restricting” it to NOT depend on the extra variables. Once we estimated those two models we compare the residual sum of squares (RSS). The RSS is the sum over the squared deviations from the regression line and that is the unexplained error. The restricted model (fewer variables) is always expected to have a larger RSS than the unrestricted model. Notice that this is same as saying: the restricted model (fewer variables) has less explanatory power.

We test whether the reduction in the RSS is statistically significant using a distribution called “F distribution”. If it is, the added variables are jointly (but not necessarily individually) significant. You do not need to know how to calculate p-values from the F distribution, as we can use the anova() function in R to do this for us.

anova(exp_mod, confound_mod_2)
Analysis of Variance Table

Model 1: political_stability ~ free_exp
Model 2: political_stability ~ free_exp + rule_law + gov_func
  Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
1    192 116.324                                 
2    190  75.623  2    40.701 51.13 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see from the output, the p-value here is very small, which means that we can reject the null hypothesis that the unrestricted model has no more explanatory power than the restricted model.

5.1.5 Predicting outcome conditional on institutional quality

Just as we did with the simple regression model last week, we can use the fitted model object to calculate the fitted values of our dependent variable for different values of our explanatory variables. To do so, we again use the predict() function.

We proceed in three steps.

  1. We set the values of the covariates for which we would like to produce fitted values.
    • You will need to set covariate values for every explanatory variable that you included in your model.
    • As we are primarily studying the effect of freedom of expression on political stability, it is mainly the level of this variable that we are most interested in.
    • Therefore, we will calculate fitted values over the range of free_exp, while setting the values of rule_law and gov_func to their median values (they are ordinal so we shouldn’t use the mean).
  2. We calculate the fitted values.
  3. We report the results (here we will produce a plot).

For step one, the following code produces a data.frame of new covariate values for which we would like to calculate a fitted value from our model:

# set the values for the explanatory variables
data_for_fitted_values <- data.frame(free_exp = seq(from = 0, to = 16, by = 1),
                                     # all possible levels of freedom of expression
                                     rule_law = median(qog$rule_law),
                                     # median level of rule of law
                                     gov_func = median(qog$gov_func)
                                     # median level of functioning of govt
                                     )

head(data_for_fitted_values)
  free_exp rule_law gov_func
1        0        8        7
2        1        8        7
3        2        8        7
4        3        8        7
5        4        8        7
6        5        8        7

As you can see, this has produced a data.frame in which every observation has a different value of free_exp but the same value for rule_law and gov_func.

We can now calculate the fitted values for each of these combinations of our explanatory variables by passing the data_for_fitted_values object to the newdata argument of the predict() function.

# calculate the fitted values
fitted_values <- predict(confound_mod_2, # model to use
                         newdata = data_for_fitted_values) # new data with fixed variable levels

# save the fitted values as a new variable in the data_for_fitted_values object
data_for_fitted_values$fitted_values <- fitted_values 

head(data_for_fitted_values)
  free_exp rule_law gov_func fitted_values
1        0        8        7     0.6359869
2        1        8        7     0.5686409
3        2        8        7     0.5012949
4        3        8        7     0.4339490
5        4        8        7     0.3666030
6        5        8        7     0.2992571

Now, for each of our explanatory variable combinations, we have the corresponding fitted values as calculated from our estimated regression.

Finally, we can plot these values:

p3 <- ggplot(data = data_for_fitted_values, # dataset to work from
             mapping = aes(x = free_exp, # x axis
                           y = fitted_values)) + # y axis
  geom_line() + # geom_line() will produce a line plot, rather than the scatter plot
  labs(x =  "Freedom of expression", # x axis label
       y = "Fitted value for political stability") + # y axis label
  theme_bw() # clean background

p3

We can see here exactly what we would expect from our regression models. Once we control for the effect of the rule of law, freedom of expression actually has a negative effect on political stability. There is a clear negative relationship here.

However, our plot as it is definitely makes this relationship look a bit too clear and a bit too certain. It is better to try to show the uncertainty of analyses like this, and this is what is usually done in real quantitative research. We can do this by calculating the confidence interval for our fitted values and plotting this too.

# calculate the fitted values
fitted_values <- predict(confound_mod_2, # model to base it on
                         newdata = data_for_fitted_values, # dataset with fixed variable levels
                         interval = 'confidence' # get confidence interval
                         )

# attach the fitted values and their confidence intervals to the dataset
data_for_fitted_values <- cbind(data_for_fitted_values, 
                                fitted_values) # combine original dataset with fitted values and upper and lower bounds of confidence interval

# plot 
p4 <- ggplot(data_for_fitted_values, # dataset to work from
             aes(x = free_exp, # x axis
                 y = fit)) + # y axis
  geom_pointrange(aes(ymin = lwr, # add estimate and intervals for each level of free_exp
                      ymax = upr)) + # ymin is the lower bound, ymax upper bound
  geom_line() + # add line through all fitted values
  geom_ribbon(aes(ymin = lwr, # add shaded area to highlight confidence interval
                  ymax = upr), # ymin is the lower bound, ymax upper bound
                  alpha = 0.2) + # make the shaded area very transparent
  labs(title = "Fitted values and 95% CIs of political stability by freedom of expression", # plot title
       subtitle = "Estimated when controlling for functioning of government and rule of law", # plot subtitle
       x = "Freedom of expression", # x axis label
       y = "Fitted value for political stability") + # y axis label
  theme_bw() # clean background

p4

This type of plot is very common in published research, because it gives us a clear idea of the nature of the relationship and the uncertainty in our estimate of it. The points show each fitted value, and the range of the vertical lines (and the shaded area) show the uncertainty of these estimates as a 95% confidence interval.

As well as getting a full range of predictions in this way, we can also ask our model for predictions for certain specific combinations of levels of our predictors. All we have to do is change the data we pass to the predict function. Say we wanted to know what level of political stability is predicted when freedom of expression is high at 14, rule of law is low at 2, and functioning of government is also low at 2. We can simply specify this in the predict() call. The result tells us what level of political stability we would expect, based on our model, for these levels of our predictors.

# calculate the fitted values
fitted_values_2 <- predict(confound_mod_2, 
                           newdata = data.frame(
                             free_exp = 14,
                             rule_law = 2,
                             gov_func = 2),
                           interval = 'confidence')

head(fitted_values_2)
        fit       lwr      upr
1 -1.552916 -1.969072 -1.13676

We see here that political stability is very low. Let’s try another combination of values.

# calculate the fitted values
fitted_values_3 <- predict(confound_mod_2,
                           newdata = data.frame(
                             free_exp = 6,         
                             rule_law = 12,      
                             gov_func = 8), 
                           interval = 'confidence')

head(fitted_values_3)
       fit       lwr      upr
1 1.088947 0.7074011 1.470493

Here, unsurprisingly, political stability is much higher.

5.1.6 Additional Resources

5.1.7 Homework exercises

  1. Pretend that you are writing a research paper. Select one dependent variable that you want to explain with a statistical model. You might want to consult the QoG codebook on the website. Run a simple regression (with one explanatory variable) on that dependent variable and justify your choice of explanatory variable. This will require you to think about the outcome that you are trying to explain, and what you think will be a good predictor for that outcome.
  2. Add an additional explanatory variable to your model and justify the choice again. See how the results change. Carry out the F-test to check whether the new model explains significantly more variance than the old model.
  3. Calculate fitted values for your dependent variable whilst varying at least one of your continuous explanatory variables, and plot the result.