Collinearity

Collinearity (or multicollinearity—the two terms are synonyms) is a description of a situation where a set of variables (such as regressors) have an exact or almost exact linear relationship. If the columns of the \(\bf{X}\) matrix in a regression are exactly (or perfectly) collinear, then the \({\bf{X'X}}\) formed from them will not be invertible and if they're nearly collinear, it may be difficult to invert the matrix using standard methods on a standard computer because of precision issues. Perfect collinearity and near collinearity have very different sources and require very different approaches, so we will split the topic into two parts.

Perfect Collinearity

In most cases, there is a relatively simple workaround for perfect collinearity—when the inversion routine detects that there is collinearity among the first \(K\) variables, it zeros out row and column \(K\) and just continues on. In effect, this just removes the variable that (in the order they were included in the regression) completes the collinear set. For instance, if we fall into the dummy variable trap and include CONSTANT, and both MALE and FEMALE dummies in a regression (rather than either the two dummies or CONSTANT and one of the two dummies):

linreg wage

# constant male female

LINREG will go ahead with that and produce the following output:

Linear Regression - Estimation by Least Squares

Dependent Variable WAGE

Usable Observations 3294

Degrees of Freedom 3292

Centered R^2 0.0317459

R-Bar^2 0.0314517

Uncentered R^2 0.7639932

Mean of Dependent Variable 5.7575850178

Std Error of Dependent Variable 3.2691857840

Standard Error of Estimate 3.2173642756

Sum of Squared Residuals 34076.917047

Regression F(1,3292) 107.9338

Significance Level of F 0.0000000

Log Likelihood -8522.2280

Durbin-Watson Statistic 1.8662

Variable Coeff Std Error T-Stat Signif

************************************************************************************

1. Constant 5.1469238679 0.0812248211 63.36639 0.00000000

2. MALE 1.1660972915 0.1122421588 10.38912 0.00000000

3. FEMALE 0.0000000000 0.0000000000 0.00000 0.00000000

This is exactly the result you would get if you had (properly) left the FEMALE dummy out of the regression. If you ordered the regressors as CONSTANT FEMALE MALE, you would get a FEMALE coefficient with the MALE being zeroed out. (Note that the degrees of freedom is corrected to subtract only two regressors, not three).

The single most common source for perfect collinearity is a single regressor which is zero throughout the sample being used. While this can be due to using a subsample where a variable (usually a dummy) is zero, more frequently it occurs in non-linear estimation if the PARMSET includes a variable which doesn't appear in the function being optimized. Here the DELTA variable is mistakenly included on the NONLIN instruction when it doesn't appear in the NLCONST FRML.

nonlin beta0 delta beta1 beta2

frml nlconst cons = beta0+beta1*inc^beta2

linreg cons

# constant inc

compute beta0=%beta(1),beta1=%beta(2),beta2=1.0

nlls(frml=nlconst)

Again, the unnecessary variable shows in the output with a zero coefficient and zero standard error.

Nonlinear Least Squares - Estimation by Gauss-Newton

Convergence in 26 Iterations. Final criterion was 0.0000008 <= 0.0000100

Dependent Variable CONS

Quarterly Data From 1960:01 To 2009:04

Usable Observations 200

Degrees of Freedom 197

Centered R^2 0.9987629

R-Bar^2 0.9987503

Uncentered R^2 0.9997774

Mean of Dependent Variable 4906.7400000

Std Error of Dependent Variable 2304.4552354

Standard Error of Estimate 81.4634087

Sum of Squared Residuals 1307348.5306

Regression F(2,197) 79523.7556

Significance Level of F 0.0000000

Log Likelihood -1162.3071

Durbin-Watson Statistic 0.4081

Variable Coeff Std Error T-Stat Signif

************************************************************************************

1. BETA0 299.01962998 48.85249050 6.12087 0.00000000

2. DELTA 0.00000000 0.00000000 0.00000 0.00000000

3. BETA1 0.28931304 0.03433131 8.42709 0.00000000

4. BETA2 1.12408725 0.01243615 90.38867 0.00000000

Sometimes, the use of a regression with perfect collinearity is intentional to allow a common regressor list with two models which use slightly different free parameters. For instance, in this second case, you might have a related model where DELTA does appear.

Perfect collinearity isn't always easily ignored. In a multivariate regression (as done with SUR or NLSYSTEM), most estimation methods depend upon an estimate of the covariance matrix of the residuals. If the number of equations is larger than the number of usable time periods, an unrestricted estimate of the covariance matrix has to be singular, and there is no mechanical way to deal with that because the inverse of the matrix is needed to properly weight the information from the different equations. Even if the number of equations is slightly smaller than the number of usable time periods, the unrestricted covariance matrix can still be singular because the residuals have fewer degrees of freedom (that is, fewer pieces of independent information) than the original data. The only solutions to the singularity problem are:

1.Use fewer equations.

2.Use more data points. Note that the only observations that are used in one of these regressions are ones for which all the equations have data, so if you have some series with missing data, it's possible that dropping a few equations will also increase the usable sample size.

3.Use a separate estimate of the covariance matrix (input to the instruction using the CV option). For instance, shrinking the off-diagonal elements (multiplying by, for instance, .8 off the diagonal and 1 on the diagonal) will give you a non-singular matrix.

Near Collinearity

At one point, near collinearity was a major issue, particularly in time series analysis, and you can often determine when the early editions of a textbook were written based upon how much space is given to a discussion of it. Computer arithmetic at the old single precision (largely made obsolete by floating point processors that were standard by the early 1990's) could not cope with the high degree of correlation among lags of typical time series data for things like distributed lags and vector autoregressions. (The correlation between \(x_t)\) and \(x_{t-1}\) approaches 1 as the data set gets large if \(x\) is, for instance, a random walk). A famous test case for linear regressions was the Longley data set (from 1967) which would generally require special algorithms for the regression to be computed in single precision. With the almost universal change to double precision, these issues were largely eliminated.

While you can now safely run time series regressions with blocks of lags of persistent data among your regressors without having to worry about the calculations being wrong, it's important to understand that the use of highly correlated regressors does affect the interpretation of the output. The following is an example of an eight-lag autoregression for (log) U.S. M1:

linreg fm1

# constant fm1{1 to 8}

Linear Regression - Estimation by Least Squares

Dependent Variable FM1

Monthly Data From 1959:09 To 2006:04

Usable Observations 560

Degrees of Freedom 551

Centered R^2 0.9999585

R-Bar^2 0.9999579

Uncentered R^2 0.9999993

Mean of Dependent Variable 6.1448743408

Std Error of Dependent Variable 0.7782903192

Standard Error of Estimate 0.0050493344

Sum of Squared Residuals 0.0140481739

Regression F(8,551) 1660040.8853

Significance Level of F 0.0000000

Log Likelihood 2171.4903

Durbin-Watson Statistic 1.9983

Variable Coeff Std Error T-Stat Signif

************************************************************************************

1. Constant 0.002536927 0.001730916 1.46566 0.14331249

2. FM1{1} 1.176161443 0.042617338 27.59819 0.00000000

3. FM1{2} -0.089009016 0.065642350 -1.35597 0.17566435

4. FM1{3} 0.057594948 0.065793757 0.87539 0.38174523

5. FM1{4} -0.099986781 0.065901004 -1.51723 0.12978289

6. FM1{5} 0.020901289 0.065888273 0.31722 0.75119441

7. FM1{6} 0.086948615 0.065847199 1.32046 0.18722959

8. FM1{7} -0.136515526 0.065843403 -2.07334 0.03860510

9. FM1{8} -0.016303804 0.042740776 -0.38146 0.70301061

There are several things to note about the output. First, there is a tendency for the coefficients to switch signs from one lag to the next. Second, the standard errors will (almost always) be fairly flat in the middle lags (from lag 2 to lag 7) with considerably lower values for the end lags (1 and 8 in this case). Both of these come about because of high correlation between adjacent lags of a persistent data series (such as this). Almost any single lag can be removed from the regression with relatively little effect on the fit (only two of the eight are individually significant at the 5% level) because its neighboring lags can do a reasonable job of proxying for it. Because the middle lags have two neighbors, while the end lags have just one, the middle lags aren't as well-determined, hence the higher standard errors. The sign changes are due to the fact that if the regressors themselves are positively correlated, their coefficients are negatively correlated: move one up and the next one down by the same amount and almost nothing happens to the fit of the regression. The main takeaway is that the individual coefficients in a model like this aren't structural, that is, they do not have any meaning outside of their place in the overall model. Some combinations of the coefficients can be structural (for instance, the sum would form the basis for a unit root test), but not the single coefficients.

One other thing to note (along the same lines): the individual t-statistics don't tell you much about how blocks of lags behave. For instance, the t's for lags 5, 6, 7 and 8 have three which are quite insignificant and one which is marginally significant (at .05). However, if we test an exclusion of all four together:

exclude

# fm1{5 to 8}

Null Hypothesis : The Following Coefficients Are Zero

FM1 Lag(s) 5 to 8

F(4,551)= 5.77397 with Significance Level 0.00014765

The result is very highly significant, which would come as a surprise from just looking at the individual coefficient information.