Chapter 10 Panel Data - Fixed Effects and some Random Effects

10.1 Seminar – in class exercises

We start by downloading resource curse data, saving it to our working directory (the same folder as contains your R-Project) and loading the dataset into R. Have a peek at the data using str().

resource <- read.csv("resourcecurse.csv")

str(resource)
'data.frame':   876 obs. of  10 variables:
 $ country     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ countrycode : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ year        : int  1996 1998 2000 2002 2003 2004 2005 2006 2007 2008 ...
 $ aid         : num  NA NA NA 1.15 1.21 ...
 $ oil         : chr  ".." ".." ".." "0.010029386249079" ...
 $ gdp.capita  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ institutions: num  -2.06 -2.09 -2.13 -1.75 -1.58 ...
 $ polity2     : int  -7 -7 -7 NA NA NA NA NA NA NA ...
 $ population  : int  17822884 18863999 20093756 21979923 23064851 24118979 25070798 25893450 26616792 27294031 ...
 $ mortality   : num  106 104 104 103 104 ...

The oil variable is coded as a factor variable but it should be numeric. Missing values are coded as “..”. Let’s make them proper missing variables (NAs) and then convert the variable to numeric. For this we will reload the naniar package we used in week 2 for dealing with missingness.

library(naniar)

# recode missings
resource <- replace_with_na(resource, # dataset
                         replace = list(oil = "..")) # condition for replacement

# convert to numeric
resource$oil <- as.numeric(resource$oil)

To estimate panel data models, we need to install the plm package. You only need to do this once.

install.packages("plm")
package 'plm' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\ksenia\AppData\Local\Temp\RtmpeGpFvS\downloaded_packages

Every time, we want to use the package (when we start a new R session), we load the plm library like so:

library(plm)

We log-transform gdp per capita and population size.

resource$log.gdp <- log(resource$gdp.capita)
resource$log.pop <- log(resource$population)

# we could also use mutate() here
resource <- mutate(
  resource,
  log.gdp = log(gdp.capita),
  log.pop = log(population)
)

10.1.1 Our data

Variable Description
country country name
countrycode 3 letter country abbreviation
year
aid net aid flow (in per cent of GDP)
oil oil rents (in per cent of GDP)
gdp.capita GDP per capita in constant 2000 US dollars
institutions world governance indicator index for quality of institutions
polity2 polity IV project index
population
mortality rate (per 1000 live births)

We test the rentier states theory and the resource curse that we discussed in the lecture. It states that rentier capitalism can be a curse on the systemic level. States that extract rents from easily lootable resources instead of taxing their people develop institutions that become unresponsive to their citizens and provide less public goods. North and Weingast (academic heroes), for instance, relate the advent of democracy in Britain to the struggle for property rights.

10.1.2 Unit fixed effects (country fixed effects)

In class, our first fixed effects model was called m3. It was the unit fixed effects model. Recall, that the unit fixed effects model is the same as including dummy variables for all countries except the baseline country. Therefore, we control for all potential confounders that vary across countries but are constant over time (e.g., the colonial heritage of a country).

# run fixed effects model
m3 <- plm(
  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality,
  data = resource,
  index = c("country", "year"),
  model = "within",
  effect = "individual"
  )

# model output
summary(m3)
Oneway (individual) effect Within Model

Call:
plm(formula = institutions ~ oil + aid + log.gdp + polity2 + 
    log.pop + mortality, data = resource, effect = "individual", 
    model = "within", index = c("country", "year"))

Unbalanced Panel: n = 58, T = 1-12, N = 672

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.387491 -0.058907 -0.001577  0.059838  0.385401 

Coefficients:
             Estimate  Std. Error t-value  Pr(>|t|)    
oil       -0.00219906  0.00140117 -1.5694  0.117064    
aid        0.00209884  0.00098457  2.1317  0.033429 *  
log.gdp    0.19220799  0.03235794  5.9401 4.796e-09 ***
polity2    0.01597425  0.00270244  5.9110 5.668e-09 ***
log.pop   -0.19811667  0.07083938 -2.7967  0.005326 ** 
mortality  0.00809087  0.00155842  5.1917 2.845e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    8.8269
Residual Sum of Squares: 7.361
R-Squared:      0.16608
Adj. R-Squared: 0.079665
F-statistic: 20.1804 on 6 and 608 DF, p-value: < 2.22e-16

Similar to the F-test, we use the check whether country fixed effects explain any variation at all using the Lagrange Multiplier test.

# check for unit(country) fixed effects
plmtest(m3, effect="individual")

    Lagrange Multiplier Test - (Honda)

data:  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality
normal = 53.306, p-value < 2.2e-16
alternative hypothesis: significant effects

The null hypothesis is that country fixed effects do not have any effect and that would mean, statistically, that we could leave them out. However, in this case we reject the null hypothesis and hence we do need to control for country fixed effects.

10.1.3 Time fixed effects

We now estimate the time fixed effects model to illustrate how this would be done. However, we already know that we do need to include country fixed effects. Not estimating country fixed effects would be a mistake. The time fixed effects model does not include country fixed effects and, therefore, it makes that mistake. Generally, in the time fixed effects model, we control for all sources of confounding that vary over time but are constant across the units (the countries) such as technological change, for instance (you can argue whether technological change really affects all countries in our sample in the same way). The time fixed effects model includes a dummy variable for every time period except the baseline.

# time fixed effects model
m4 <- plm(
  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality,
  data = resource,
  index = c("country", "year"),
  model = "within",
  effect = "time")

# model output time fixed effects
summary(m4)
Oneway (time) effect Within Model

Call:
plm(formula = institutions ~ oil + aid + log.gdp + polity2 + 
    log.pop + mortality, data = resource, effect = "time", model = "within", 
    index = c("country", "year"))

Unbalanced Panel: n = 58, T = 1-12, N = 672

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-1.101400 -0.301653 -0.012631  0.305685  0.927664 

Coefficients:
            Estimate Std. Error t-value  Pr(>|t|)    
oil       -0.0182239  0.0020133 -9.0516 < 2.2e-16 ***
aid        0.0113730  0.0030714  3.7029 0.0002311 ***
log.gdp    0.4815862  0.0200886 23.9731 < 2.2e-16 ***
polity2    0.0288310  0.0030044  9.5961 < 2.2e-16 ***
log.pop   -0.0453017  0.0101438 -4.4659 9.390e-06 ***
mortality  0.0059840  0.0012486  4.7924 2.041e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    457.29
Residual Sum of Squares: 105.88
R-Squared:      0.76847
Adj. R-Squared: 0.76245
F-statistic: 361.772 on 6 and 654 DF, p-value: < 2.22e-16

Notice that adjusted R^2 is much larger in the time fixed effects model than in the country fixed effects model. That does not mean that the time fixed effects model is better. In fact adjusted R^2 cannot be compared between country fixed effects and time fixed effects models. In the country fixed effects model, adjusted R^2 is the variation in the dependent variable that is explained by our independent variables that vary within countries. It is the explained within-country variation. In a time fixed effects model, adjusted R^2 gives us the explained within-time variation.

The time fixed effects model gives us different results from the country fixed effects model. We don’t like the time fixed effects model here because we already saw that we need to include time fixed effects from the plmtest(). We can, however, check whether we need to include time fixed effects or put differently whether time fixed effects matter jointly. We do this using the plmtest() again.

# test for time fixed effects
plmtest(m4, effect="time")

    Lagrange Multiplier Test - time effects (Honda)

data:  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality
normal = 0.84001, p-value = 0.2005
alternative hypothesis: significant effects

The test comes back insignificant. That means, statistically speaking, we do not need to control for time fixed effects to have a consistent model. The test gives you justification to stick with the country fixed effects model. But, we will ignore the test. In the country fixed effects model, we have 602 degrees of freedom. We can afford to estimate country fixed effects in addition. There, are 12 time periods (indicated by the capital T in the summary output) and you can verify this like so:

# frequency table of year (i.e., number of observations per period)
table(resource$year)

1996 1998 2000 2002 2003 2004 2005 2006 2007 2008 2009 2010 
  73   73   73   73   73   73   73   73   73   73   73   73 
# number of time periods
length(table(resource$year))
[1] 12

With 602 degrees of freedom, we can easily afford to estimate another 11 parameters (1 for each year where 1 year is the baseline category). Having 602 degrees of freedom is like having 602 free observations (that is a lot of information).

We do not make a mistake by controlling for potential confounders that vary across countries and are constant over time (unit fixed effects) and confounders that vary across time but are constant across units (time fixed effects). Therefore, we do that.

10.1.4 Two-way fixed effects

We now estimate the two-way fixed effects model. We control for all confounders that vary across units (countries) but are constant over time and we control for all confounders that vary over time but are constant across units.

# two-way fixed effects model
m5 <-  plm(
  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality,
  data = resource,
  index = c("country", "year"),
  model = "within",
  effect = "twoways"
  )

summary(m5)
Twoways effects Within Model

Call:
plm(formula = institutions ~ oil + aid + log.gdp + polity2 + 
    log.pop + mortality, data = resource, effect = "twoways", 
    model = "within", index = c("country", "year"))

Unbalanced Panel: n = 58, T = 1-12, N = 672

Residuals:
       Min.     1st Qu.      Median     3rd Qu.        Max. 
-0.37098386 -0.05851291  0.00078975  0.06010988  0.44604514 

Coefficients:
             Estimate  Std. Error t-value  Pr(>|t|)    
oil       -0.00160143  0.00153067 -1.0462  0.295880    
aid        0.00277470  0.00099306  2.7941  0.005372 ** 
log.gdp    0.29880737  0.03793901  7.8760 1.604e-14 ***
polity2    0.01594021  0.00266203  5.9880 3.670e-09 ***
log.pop    0.00664606  0.08016389  0.0829  0.933954    
mortality  0.00400393  0.00173133  2.3126  0.021082 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    8.3506
Residual Sum of Squares: 6.9545
R-Squared:      0.16718
Adj. R-Squared: 0.063949
F-statistic: 19.9736 on 6 and 597 DF, p-value: < 2.22e-16

10.1.5 Serial correlation/auto-correlation

In a panel model, we always have serial correlation. Maybe always is an overstatement but just maybe. Serial correlation means that a variable at time t (let’s say 2000) and in country i (let’s say Greece) is related to its value at t-1 (in 1999). Anything that is path dependent would fall into this category. Surely, institutional quality is path dependent. There is a statistical test for auto-correlation but really your default assumption should be that auto-correlation is present.

Let’s carry out the test. The null hypothesis is that we do not have auto-correlation.

# Breusch-Godfrey test
pbgtest(m5)

    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality
chisq = 226.99, df = 1, p-value < 2.2e-16
alternative hypothesis: serial correlation in idiosyncratic errors

Clearly, we do have auto-correlation, so we need to correct our standard errors. We need to libraries for this. First, sandwich and second, lmtest.

library(sandwich)
library(lmtest)

# heteroskedasticity and autocorrelation consistent standard errors
m5.hac <- coeftest(m5, vcov = vcovHC(m5, method = "arellano", type = "HC3"))
m5.hac

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)  
oil       -0.0016014  0.0025913 -0.6180  0.53680  
aid        0.0027747  0.0017274  1.6063  0.10874  
log.gdp    0.2988074  0.1323460  2.2578  0.02432 *
polity2    0.0159402  0.0061832  2.5780  0.01018 *
log.pop    0.0066461  0.3003638  0.0221  0.98235  
mortality  0.0040039  0.0055190  0.7255  0.46844  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference is noticeable. It is a mistake not to correct for serial correlation. The difference is that we now fail to reject the null hypothesis for the effect of aid.

10.1.6 Cross-sectional dependence/ spatial dependence

Spatial dependence is common in panel data sets but unlike serial correlation, it is not always present. Spatial correlation means that some units that cluster together (usually geographically) are affected by some external shock in the same way. For instance, the Arab Spring affected counties in the MENA region in the same way.

We test for cross-sectional dependence. If it exists, we need to correct for it. The null hypothesis is that we do not have spatial dependence.

# Peasaran test for cross-sectional dependence
pcdtest(m5)
Warning in pcdres(tres = tres, n = n, w = w, form = paste(deparse(x$formula)),
: Some pairs of individuals (7 percent) do not have any or just one time period
in common and have been omitted from calculation

    Pesaran CD test for cross-sectional dependence in panels

data:  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality
z = -2.2416, p-value = 0.02499
alternative hypothesis: cross-sectional dependence

The test comes back significant. Therefore, we need to adjust our standard errors for serial correlation, heteroskedasticity and spatial dependency.

Some political scientists like to estimate the so-called panel corrected standard errors (PCSE). In fact, Beck and Katz 1995 is one of the most cited political science papers of all time. However, Driscoll and Kraay (1998) propose standard errors that work even better in short panels (where we have few observations per unit). Their standard errors are sometimes called the SCC estimator. We correct for spatial correlation using SCC standard errors.

# Driscoll and Kraay SCC standard errors
m5.scc <- coeftest(m5, vcov = vcovSCC(m5, type = "HC3", cluster = "group"))
m5.scc

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)   
oil       -0.0016014  0.0027086 -0.5912 0.554577   
aid        0.0027747  0.0016836  1.6480 0.099872 . 
log.gdp    0.2988074  0.1332502  2.2425 0.025298 * 
polity2    0.0159402  0.0059858  2.6630 0.007953 **
log.pop    0.0066461  0.3061856  0.0217 0.982690   
mortality  0.0040039  0.0049989  0.8010 0.423468   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is our final model. We find no evidence for hypothesis 1 and 2. Both oil and aid are unrelated to institutional quality (note that this is different from what you saw in the lecture. I had an error in the code. This version is correct.)

10.1.7 The random effects model

We show you the random effects model only because you see it applied often in political science. However, the model rests on a strong assumption. Recall from our lecture, the random effects model assumes that the time invariant confounders are unrelated to our regressors. This assumption will almost always be violated. The random effects model is weak from a causal inference standpoint. However, it tends to do well in prediction tasks where we are interested in predicting outcomes but don’t really care whether X is causally related to Y.

Let’s estimate the random effects model.

# random effects model
ran.effects <- plm(
  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality,
  data = resource,
  index = c("country", "year"),
  model = "random")

# model output
summary(ran.effects)
Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = institutions ~ oil + aid + log.gdp + polity2 + 
    log.pop + mortality, data = resource, model = "random", index = c("country", 
    "year"))

Unbalanced Panel: n = 58, T = 1-12, N = 672

Effects:
                  var std.dev share
idiosyncratic 0.01211 0.11003 0.072
individual    0.15550 0.39434 0.928
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7312  0.9197  0.9197  0.9187  0.9197  0.9197 

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.40994 -0.06965 -0.00097  0.00060  0.07768  0.36508 

Coefficients:
              Estimate Std. Error z-value  Pr(>|z|)    
(Intercept) -1.2461766  0.6244808 -1.9955  0.045984 *  
oil         -0.0045894  0.0014159 -3.2413  0.001190 ** 
aid          0.0017600  0.0010371  1.6971  0.089677 .  
log.gdp      0.3144364  0.0288779 10.8885 < 2.2e-16 ***
polity2      0.0193579  0.0027204  7.1158 1.113e-12 ***
log.pop     -0.0994350  0.0316767 -3.1391  0.001695 ** 
mortality    0.0100485  0.0012924  7.7749 7.549e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    11.915
Residual Sum of Squares: 9.0041
R-Squared:      0.24433
Adj. R-Squared: 0.23751
Chisq: 214.434 on 6 DF, p-value: < 2.22e-16

As mentioned, you will have an extremely hard time convincing anyone of a causal claim made based on a random effects model. However, sometimes you cannot estimate a fixed effects model. For instance, if you wish to estimate the effect of the electoral system on some outcome, you have the problem that the electoral system does not vary within countries (countries tend to choose an electoral system and stick with it). That means, you cannot estimate a unit-fixed effects model. You can however, estimate the random effects model in that case.

The absolute minimum hurdle that you need to pass to be allowed to use the random effects model is to carry out the Hausman test. The test assesses whether the errors are correlated with the X variables. It thus, tests the assumption that the random effects model is based on.

However, we have to caution against the Hausman test! The Hausman test does not take heteroskedastic errors into account and it does not take serial correlation into account. That’s a big problem. Even if the Hausman tests, confirms that the random effects model is consistent, it may be wrong. We should always be skeptical of the random effects model (when it’s used to make a causal claim).

Let’s run the Hausman test. Its null hypothesis is that the errors and the X’s are uncorrelated and hence the random effects model is consistent.

# hausman test
phtest(m5, ran.effects)

    Hausman Test

data:  institutions ~ oil + aid + log.gdp + polity2 + log.pop + mortality
chisq = 107.47, df = 6, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

The Hausman test rejects the null hypothesis. The random effects model is inconsistent. You now have all the tools to carry out your own analysis. Go ahead and show us whether more guns lead to less crime or not.

10.1.8 Homework exercises

More guns, less crime. This is the claim of a (in)famous book. It shows that violent crime rates in the United States decrease when gun ownership restrictions are relaxed. The data used in Lott’s research compares violent crimes, robberies, and murders across 50 states to determine whether the so called “shall” laws that remove discretion from license granting authorities actually decrease crime rates. So far 41 states have passed these “shall” laws where a person applying for a licence to carry a concealed weapon doesn’t have to provide justification or “good cause” for requiring a concealed weapon permit.

Load the guns.csv dataset directly into R by running the following line:

guns <- read.csv("https://github.com/QMUL-SPIR/Public_files/raw/master/datasets/guns.csv")

The data includes the following variables:

Variable Description
mur Murder rate (incidents per 100,000)
shall =1 if state has a shall-carry law in effect in that year, 0 otherwise
incarc rate Incarceration rate in the state in the previous year
(sentenced prisoners per 100,000 residents; value for the previous year)
pm1029 Percent of state population that is male, ages 10 to 29
stateid ID number of states (Alabama = , Alaska = 2, etc.)
year Year (1977 - 1999)
  1. Estimate the effect of shall using a simple linear model and interpret it.

  2. Estimate a unit fixed effects model and a random effects model. Are both models consistent? If not, which is the appropriate model? Use a consistent model to estimate the effect of the shall laws on the murder rate.

  3. Think of a theoretical reason to control for time fixed effects (what confounding sources could bias our estimate of the shall laws?). Test for time fixed effects using the appropriate test. If time fixed effects are required, re-estimate the fixed effects model as a twoway fixed effects model and interpret the effect of lax gun laws.

  4. Correct the standard errors to account for heteroskedasticity and serial correlation. Does the conclusion regarding the effect of the shall laws change?

  5. Test for cross-sectional dependence and if present, use the SSC estimator to correct for heteroskedasticity, serial correlation, and spatial dependence. Does our conclusion regarding the effect of the shall laws change?