10.2 Solutions
<- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/chechen.csv")
chechen
<- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/who_voted_nazi_in_1932.csv")
nazi
<- read.csv("https://raw.githubusercontent.com/QMUL-SPIR/Public_files/master/datasets/MA_Schools.csv") ma
10.2.0.1 Exercise 1
Run a bivariate regression model with diff as the dependent variable and treat as the predictor.
# bivariate model
<- lm(diff ~ treat, data = chechen)
m1 screenreg(m1)
======================
Model 1
----------------------
(Intercept) -0.10
(0.12)
treat -0.52 **
(0.17)
----------------------
R^2 0.03
Adj. R^2 0.02
Num. obs. 318
======================
*** p < 0.001; ** p < 0.01; * p < 0.05
The effect of Russian artillery shelling is significant and negative. Therefore, it seems that indiscriminate violence in this case reduces insurgent violence. There are many potential confounders, however, that we will control for in the next exercise.
The interpretation on the magnitude of the effect is the following: The dependent variable is the change in the number of insurgent attacks. The independent variable is binary. Therefore, there was half an attack less, on average, in villages that had been bombarded by Russian artillery.
10.2.0.2 Exercise 2
Control for additional variables. Discuss your choice of variables and refer to omitted variable bias.
We control for the amount of attacks before the village was bombarded (pret
). We reason that villages that saw more insurgent activity already, were more likely to be bombarded and might see more insurgent activity after bombardment as well. Furhtermore, we control for logged elevation lelev
because attacks may be more likely in remote mountainous terrain and it may have been harder for the Russians to control such areas. With a similar argument we control groznyy
and lpop2000
, were the argument is that urban areas may see more insurgent actitivities and are harder to control. We also control for swept
to account for the possibility that previous sweeps in the area have reduced the likelihood of subsequent shelling and insurgent activity.
# multivariate model
<- lm(diff ~
m2 +
treat +
pret +
lelev +
groznyy +
lpop2000 +
swept factor(disno),
data = chechen)
screenreg(list(m1, m2))
======================================
Model 1 Model 2
--------------------------------------
(Intercept) -0.10 -0.93
(0.12) (1.64)
treat -0.52 ** -0.39 *
(0.17) (0.16)
pret -0.28 ***
(0.03)
lelev 0.07
(0.24)
groznyy 2.03 **
(0.67)
lpop2000 0.13
(0.07)
swept -0.02
(0.19)
factor(disno)3 -0.14
(0.52)
factor(disno)4 0.44
(0.66)
factor(disno)5 -0.09
(1.11)
factor(disno)6 0.08
(0.59)
factor(disno)7 -0.37
(1.41)
factor(disno)8 -0.74
(0.78)
factor(disno)9 -0.14
(0.68)
factor(disno)10 -0.49
(0.51)
factor(disno)11 -0.22
(0.60)
factor(disno)14 0.04
(0.53)
factor(disno)15 -0.14
(0.56)
--------------------------------------
R^2 0.03 0.32
Adj. R^2 0.02 0.28
Num. obs. 318 318
======================================
*** p < 0.001; ** p < 0.01; * p < 0.05
10.2.0.3 Exercise 3
Discuss the improvement in model fit and use the appropriate test statistic.
# f-test
anova(m1, m2)
Analysis of Variance Table
Model 1: diff ~ treat
Model 2: diff ~ treat + pret + lelev + groznyy + lpop2000 + swept + factor(disno)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 316 751.99
2 300 526.42 16 225.57 8.0342 5.841e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We increase adjusted R^2 substantially. While R^2 almost always increases and never decreases, adjusted R^2 might decrease. We use the F test to test whether are added covariates are jointly significant - whether the improvement in explanatory power of our model could be due to chance. We reject the null hypothesis that both models do equally well at explaining the outcome. Our larger model improves model fit.
10.2.0.4 Exercise 4
Check for non-linearity and if there is, solve the problem.
# create residuals variable in data set
$residuals <- m2$residuals
chechen
# check residual plots with continuous variable
plot(residuals ~ pret, data = chechen)
We only have one continuous variable in our model that has not already been transformed to deal with non-linearity: the number of insurgent attacks before shelling pret
. The residual plot does not reveal a non-linear relation.
10.2.0.5 Exercise 5
Check for heteroskedasticity and if there is solve the problem.
library(lmtest) # contains BP test function
bptest(m2) # null is constant error which we reject
studentized Breusch-Pagan test
data: m2
BP = 78.61, df = 17, p-value = 6.764e-10
library(sandwich) # contains sandwich estimator robust SE's
# estimate model 2 again with corrected standard errors
<- coeftest(m2, vcov. = vcovHC(m2, type = "HC1"))
m3
# compare models
screenreg(list(m2,m3))
======================================
Model 1 Model 2
--------------------------------------
(Intercept) -0.93 -0.93
(1.64) (1.25)
treat -0.39 * -0.39 *
(0.16) (0.15)
pret -0.28 *** -0.28 ***
(0.03) (0.04)
lelev 0.07 0.07
(0.24) (0.20)
groznyy 2.03 ** 2.03 *
(0.67) (1.02)
lpop2000 0.13 0.13 *
(0.07) (0.05)
swept -0.02 -0.02
(0.19) (0.23)
factor(disno)3 -0.14 -0.14
(0.52) (0.67)
factor(disno)4 0.44 0.44
(0.66) (0.82)
factor(disno)5 -0.09 -0.09
(1.11) (0.75)
factor(disno)6 0.08 0.08
(0.59) (0.77)
factor(disno)7 -0.37 -0.37
(1.41) (0.67)
factor(disno)8 -0.74 -0.74
(0.78) (0.84)
factor(disno)9 -0.14 -0.14
(0.68) (0.72)
factor(disno)10 -0.49 -0.49
(0.51) (0.70)
factor(disno)11 -0.22 -0.22
(0.60) (0.73)
factor(disno)14 0.04 0.04
(0.53) (0.71)
factor(disno)15 -0.14 -0.14
(0.56) (0.71)
--------------------------------------
R^2 0.32
Adj. R^2 0.28
Num. obs. 318
======================================
*** p < 0.001; ** p < 0.01; * p < 0.05
The result is that heteroskedasticity robust standard errors do not change our findings. The estimated robust standard errors are actually smaller than the ones estimated under the constant error assumption. The effect remains significant either way but we would choose the larger standard errors.
10.2.0.6 Exercise 6
Thoroughly compare the models, and interpret the outcome in substantial as well statistical terms - do not forget to talk about model fit. What are the implications of this?
# compare models
screenreg(list(m1,m2,m3))
=================================================
Model 1 Model 2 Model 3
-------------------------------------------------
(Intercept) -0.10 -0.93 -0.93
(0.12) (1.64) (1.25)
treat -0.52 ** -0.39 * -0.39 *
(0.17) (0.16) (0.15)
pret -0.28 *** -0.28 ***
(0.03) (0.04)
lelev 0.07 0.07
(0.24) (0.20)
groznyy 2.03 ** 2.03 *
(0.67) (1.02)
lpop2000 0.13 0.13 *
(0.07) (0.05)
swept -0.02 -0.02
(0.19) (0.23)
factor(disno)3 -0.14 -0.14
(0.52) (0.67)
factor(disno)4 0.44 0.44
(0.66) (0.82)
factor(disno)5 -0.09 -0.09
(1.11) (0.75)
factor(disno)6 0.08 0.08
(0.59) (0.77)
factor(disno)7 -0.37 -0.37
(1.41) (0.67)
factor(disno)8 -0.74 -0.74
(0.78) (0.84)
factor(disno)9 -0.14 -0.14
(0.68) (0.72)
factor(disno)10 -0.49 -0.49
(0.51) (0.70)
factor(disno)11 -0.22 -0.22
(0.60) (0.73)
factor(disno)14 0.04 0.04
(0.53) (0.71)
factor(disno)15 -0.14 -0.14
(0.56) (0.71)
-------------------------------------------------
R^2 0.03 0.32
Adj. R^2 0.02 0.28
Num. obs. 318 318
=================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
The goal of analysis was to test whether indiscriminate violence leads to more resistance. The variable of interest is, therefore, treat
which indicates whether a Chechnian village was bombarded or not. In our initial model, we found a significant and negative effect. Thereby, we contradict the theory because the effect seems to go the other way.
We were concerned about confounding and including a number of variables in the model. We controled for terrain, geographic areas in the country, urban areas, attacks before artillery bombardment, and prior sweeps. The change in the coeffcient is large. The difference is (\(-.52 - -0.39 = -0.13\)). That is 25% of the original effect. Although, the effect decreased substantially when we control for confounders, it is still significant. The difference between villages that were bombarded and those that were not is \(0.39\) attacks on average.
We are not substantially interested in the control variables. According to the model with the larger standard errors, pret
, and groznyy
are also significant. Insurgent attacks were more numerous in Groznyy. Interestingly, the more attacks there were before the bombardment , the fewer attacks happened afterwards.
Model fit improves substantially, from \(0.03\) in our first model to \(0.32\). Although, the second model explains the outcome better that is not what we care about. We are interested in the effect of indiscriminate violence. We prefer the larger model because we control for potential confounders and our estimate less of an overestimate.
10.2.0.7 Exercise 8
Run a bivariate model to test whether blue-collar voters voted for the Nazis.
# summary stats
summary(nazi)
shareself shareblue sharewhite sharedomestic
Min. : 5.799 Min. :11.44 Min. : 2.205 Min. : 3.721
1st Qu.:14.716 1st Qu.:25.26 1st Qu.: 6.579 1st Qu.:14.410
Median :18.196 Median :30.35 Median : 8.751 Median :26.230
Mean :18.589 Mean :30.82 Mean :11.423 Mean :25.443
3rd Qu.:22.186 3rd Qu.:35.87 3rd Qu.:13.577 3rd Qu.:35.250
Max. :31.340 Max. :54.14 Max. :51.864 Max. :52.639
shareunemployed nvoter nazivote sharenazis
Min. : 1.958 Min. : 5969 Min. : 2415 Min. :12.32
1st Qu.: 7.456 1st Qu.: 23404 1st Qu.: 9196 1st Qu.:33.96
Median :11.934 Median : 35203 Median : 15708 Median :41.47
Mean :13.722 Mean : 65118 Mean : 25193 Mean :41.57
3rd Qu.:18.979 3rd Qu.: 67606 3rd Qu.: 27810 3rd Qu.:48.61
Max. :39.678 Max. :932831 Max. :318746 Max. :74.09
<- lm(sharenazis ~ shareblue, data = nazi)
m1 screenreg(m1)
=======================
Model 1
-----------------------
(Intercept) 39.56 ***
(1.66)
shareblue 0.07
(0.05)
-----------------------
R^2 0.00
Adj. R^2 0.00
Num. obs. 681
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05
10.2.0.8 Exercise 9
Our dataset does not contain much information. We only know about the share of different groups in the district: white collar workers, blue collar workers, self employed, domestically employed, and unemployed. We control for these variables.
A model that would only include blue collar workers omits information about the social make of the district that is necessarily related to the share of blue collar workers. For example if there are mostly white collar workers in a district that means, there are less blue collar workers, all else equal. If we only have blue collar in our model, we do not know whether most other people are unemployed or white collar workers and so on.
<- lm(sharenazis ~ shareblue + sharewhite + shareself + sharedomestic, data = nazi)
m2
screenreg(m2)
=========================
Model 1
-------------------------
(Intercept) -2.82
(7.01)
shareblue 0.57 ***
(0.09)
sharewhite 0.31 *
(0.13)
shareself 1.14 ***
(0.19)
sharedomestic 0.08
(0.10)
-------------------------
R^2 0.11
Adj. R^2 0.10
Num. obs. 681
=========================
*** p < 0.001; ** p < 0.01; * p < 0.05
10.2.0.9 Exercise 10
Can you fit a model using shareblue
, sharewhite
, shareself
, sharedomestic
, and shareunemployed
as predictors? Why or why not?
Such a model would suffer from perfect multicollinearity. Whichever variable we entered last into our model would be dropped automatically. If we add the share of the five groups, we get 100%. Thus, if we know the share of four gropus in a district, we necessarily know the share of the remaining group in the district.
For example if shareblue
=30, shareunemployed
=15, sharedomestic
=25, shareself
=20, the four groups are together 90% of the district. The remaining 10% must be white collar workers.
Therefore, adding shareblue
, shareunemployed
, sharedomestic
, and shareself
to our model already gives us information about how many white collar workers there are in a district. Were we to add the variable sharewhite
, we would enter the same information twice - leading to perfect multicollinearity.
10.2.0.10 Exercise 11
Check for heteroskedasticity and if there is solve the problem.
# tests on our models
bptest(m1)
studentized Breusch-Pagan test
data: m1
BP = 18.578, df = 1, p-value = 1.631e-05
bptest(m2)
studentized Breusch-Pagan test
data: m2
BP = 95.811, df = 4, p-value < 2.2e-16
# robust SE's
<- coeftest(m1, vcov. = vcovHC(m1))
m1.robust <- coeftest(m2, vcov. = vcovHC(m2))
m2.robust
# compare standard errors
screenreg(list(m1, m1.robust, m2, m2.robust),
custom.model.names = c("m1", "m1 robust", "m2", "m2 robust"))
===========================================================
m1 m1 robust m2 m2 robust
-----------------------------------------------------------
(Intercept) 39.56 *** 39.56 *** -2.82 -2.82
(1.66) (1.68) (7.01) (6.28)
shareblue 0.07 0.07 0.57 *** 0.57 ***
(0.05) (0.05) (0.09) (0.08)
sharewhite 0.31 * 0.31 **
(0.13) (0.11)
shareself 1.14 *** 1.14 ***
(0.19) (0.23)
sharedomestic 0.08 0.08
(0.10) (0.11)
-----------------------------------------------------------
R^2 0.00 0.11
Adj. R^2 0.00 0.10
Num. obs. 681 681
===========================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Correcting for heteroskedasticity does not change our results.
10.2.0.11 Exercise 13
Interpret your models.
screenreg(list(m1,m2))
=====================================
Model 1 Model 2
-------------------------------------
(Intercept) 39.56 *** -2.82
(1.66) (7.01)
shareblue 0.07 0.57 ***
(0.05) (0.09)
sharewhite 0.31 *
(0.13)
shareself 1.14 ***
(0.19)
sharedomestic 0.08
(0.10)
-------------------------------------
R^2 0.00 0.11
Adj. R^2 0.00 0.10
Num. obs. 681 681
=====================================
*** p < 0.001; ** p < 0.01; * p < 0.05
A prominent theory suggests that Nazis received a lot of their support from blue-collar workers. In model 1, we only include the percentage of blue collar workers in a district. The variable is insignificant but we are likely suffering from omitted variable bias because we are omitting other aspects of the social make-up of the district that are related to the share of blue collar workers.
We omit one category from our models to avoid perfect multicollinearity. In model 2, it seems that blue-collar workers did support the Nazis. More so, than white-collar workers. They received most of their support from the self-employed, though.
10.2.0.12 Exercise 13
Fit a model to explain the relationship between percentage of English learners and average 8th grade scores.
<- lm(score8 ~ english, data = ma)
m1 screenreg(m1)
=======================
Model 1
-----------------------
(Intercept) 702.99 ***
(1.44)
english -3.64 ***
(0.44)
-----------------------
R^2 0.28
Adj. R^2 0.28
Num. obs. 180
=======================
*** p < 0.001; ** p < 0.01; * p < 0.05
10.2.0.13 Exercise 14
Plot the model and the regression line.
ggplot(ma, aes(english, score8)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 40 rows containing non-finite values (stat_smooth).
Warning: Removed 40 rows containing missing values (geom_point).
10.2.0.14 Exercise 15
Check for correlation between the variables listed above to see if there are other variables that could be included in the model.
<- ma %>%
ma select(score8, stratio, english, lunch, income)
summary(ma)
score8 stratio english lunch
Min. :641.0 Min. :11.40 Min. : 0.0000 Min. : 0.40
1st Qu.:685.0 1st Qu.:15.80 1st Qu.: 0.0000 1st Qu.: 5.30
Median :698.0 Median :17.10 Median : 0.0000 Median :10.55
Mean :698.4 Mean :17.34 Mean : 1.1177 Mean :15.32
3rd Qu.:712.0 3rd Qu.:19.02 3rd Qu.: 0.8859 3rd Qu.:20.02
Max. :747.0 Max. :27.00 Max. :24.4939 Max. :76.20
NA's :40
income
Min. : 9.686
1st Qu.:15.223
Median :17.128
Mean :18.747
3rd Qu.:20.376
Max. :46.855
# remove NA's
<- drop_na(ma)
ma
# check correlation
cor(ma)
score8 stratio english lunch income
score8 1.0000000 -0.3164449 -0.5312991 -0.8337520 0.7765043
stratio -0.3164449 1.0000000 0.2126209 0.2875544 -0.2616973
english -0.5312991 0.2126209 1.0000000 0.6946009 -0.2510956
lunch -0.8337520 0.2875544 0.6946009 1.0000000 -0.5738153
income 0.7765043 -0.2616973 -0.2510956 -0.5738153 1.0000000
10.2.0.15 Exercise 16
Both lunch
and income
are highly correlated with the percent of English learners and with the response variable. We add both to our model.
<- lm(score8 ~ english + lunch + income, data = ma) m2
10.2.0.16 Exercise 17
Compare the two models. Did the first model suffer from omitted variable bias?
screenreg(list(m1,m2))
===================================
Model 1 Model 2
-----------------------------------
(Intercept) 702.99 *** 678.59 ***
(1.44) (3.45)
english -3.64 *** -0.25
(0.44) (0.31)
lunch -0.72 ***
(0.07)
income 1.70 ***
(0.15)
-----------------------------------
R^2 0.28 0.83
Adj. R^2 0.28 0.83
Num. obs. 180 180
===================================
*** p < 0.001; ** p < 0.01; * p < 0.05
The comparison between models 1 and 2, shows that the performance of the children is related to poverty rather than language. The two are, however, correlated.
10.2.0.17 Exercise 18
Plot the residuals against the fitted values from the second model to visually check for heteroskedasticity.
$residuals <- m2$residuals # mistakes
ma$fitted <- m2$fitted.values # predictions
ma
ggplot(ma, aes(fitted, residuals)) +
geom_point() +
geom_hline(yintercept = 0)
The plot looks OK. I would not conclude that the model violates the constant errors assumption.
10.2.0.18 Exercise 19
Run the Breusch-Pagan Test to see whether the model suffers from heteroskedastic errors.
bptest(m2) # we cannot reject the null -> yay :)
studentized Breusch-Pagan test
data: m2
BP = 2.3913, df = 3, p-value = 0.4953
The Bresch Pagan test confirms, the constant error assumption is not violated.