Thursday, October 27, 2011

Correlated regressors

In ordinary least-squares linear regression, correlated regressors lead to unstable model parameter estimates.  The intuition is that given two correlated regressors, it is difficult to determine how much of the data is due to one regressor and how much is due to the other.  Let's look at this geometrically.


% Generate two regressors (in the columns of the matrix).
% These two regressors are relatively uncorrelated (nearly orthogonal).
X = [10 1;
       1 10];

% Generate some data (no noise has been added yet).
data = [24 25]';

% Simulate 100 measurements of the data (with noise added).
% For each measurement, estimate the weights on the regressors.
y = zeros(2,100);
h = zeros(2,100);
for rep=1:100
  y(:,rep) = data + 2*randn(2,1);
  h(:,rep) = inv(X'*X)*X'*y(:,rep);

% Estimate weights on the regressors for the case of no noise.
htrue = inv(X'*X)*X'*data;

% Now visualize the results
figure(999); clf; hold on;
h1 = scatter(y(1,:),y(2,:),'g.');
h2 = scatter(data(1),data(2),'k.');
axis square; axis([0 50 0 50]);
h3 = drawarrow(repmat([0 0],[2 1]),X','r-',[],10);
for p=1:size(X,2)
  h4 = scatter(X(1,p)*h(p,:),X(2,p)*h(p,:),25,'gx');
  h5 = scatter(X(1,p)*htrue(p),X(2,p)*htrue(p),'k.');
h6 = drawarrow(X(:,1)',X(:,2)','b-',[],0);
xlabel('dimension 1');
ylabel('dimension 2');
legend([h1 h2 h3(1) h4 h5 h6], ...
       {'measured data' 'noiseless data' 'regressors' ...
        'estimated weights' 'true weights' 'difference between regressors'});

The green X's represent each regressor scaled by the weight estimated for that regressor in each of the 100 simulations.  The X's are indicative of how reliably we can estimate the weights.  In this example, the weights are estimated quite reliably (the spread of the X's is relatively small).

% Let's repeat the simulation but now with
% two regressors that are highly correlated.
X = [6 5;
       5 6];

In this example, the two regressors are highly correlated and weight estimation is unreliable.  To understand why this happens, examine the difference between the regressors.  Notice that the difference between the regressors is quite small.  Noise in the data shifts the data along this difference, giving rise to substantially different parameter estimates. For example, if the measured data shifts towards the upper-left, then this tends to produce high weights for the upper-left regressor (and low weights for the bottom-right regressor); if the measured data shifts towards the bottom-right, then this tends to produce high weights for the bottom-right regressor (and low weights for the upper-left regressor).


The stability of model parameter estimates is determined (in part) by the amount of noise in the direction of the regressor difference.  If the projection of the noise onto the regressor difference has small variance (as in the first example), then parameter estimates will tend to be stable; if the projection has large variance (as in the second example), then parameter estimates will tend to be unstable.

So how can we obtain better parameter estimates in the case of correlated regressors? One solution is to use regularization strategies (which will be described in a later post).

No comments:

Post a Comment