Wednesday, December 14, 2011

Noise, model complexity, and overfitting

When building models, ideally we would (1) have plenty of data available and (2) know what model to apply to the data.  If these conditions hold, then we would simply need to estimate the parameters of the model and verify that the model is indeed accurate.  The problem is that in reality, these conditions often do not hold.  That is, we often have limited or noisy data and we often do not know what model is appropriate for the data at hand.  In these cases, we are faced with the task of trying different models and determining which model is best.

It is useful to think of model complexity as a dimension along which models vary.  Complex, flexible models have the potential to describe many different types of functions.  The advantage of such models is that the true model (i.e. the model that most accurately characterizes the data) may in fact be contained in the set of models that can be described.  The disadvantage of such models is that they have many free parameters, and it may be difficult to obtain good parameter estimates with limited or noisy data.  (Or, stated another way: when fitting such models to limited or noisy data, there is a strong possibility of overfitting the data --- that is, a strong possibility that random variability in the data will have too much influence on parameter estimates and give rise to an inaccurate model.)  On the other hand, simple, less flexible models describe fewer types of functions compared to complex models.  The advantage of simple models is that they have fewer free parameters, and so it becomes feasible to obtain good parameter estimates with limited or noisy data.  (Or, stated another way: there is low likelihood that simple models will overfit the data.)  The disadvantage of simple models is that the types of functions that can be described may be poor approximations to the true underlying function.

A priori, it is impossible to say whether a complex or simple model will yield the most accurate model estimate for a given dataset (since it depends on the amount of data available, the nature of the effect, etc.).  We must empirically try different models and evaluate their performance, e.g. using cross-validation.  In this post, we provide a simple example illustrating these concepts.


% Construct training dataset.
x = linspace(-2,2,30);
modelfun = @(x) x.^4 - 20*x.^3 + 5*x.^2 + 4*x + 1;
y = feval(modelfun,x) + 50*randn(1,length(x));

% Construct testing dataset.
xval = repmat(linspace(-2,2,30),[1 10]);
yval = feval(modelfun,xval) + 50*randn(1,300);

% Define max polynomial degrees to try.
degstodo = 1:10;
degs = [1 3 10];  % display these

% Make a figure.
results = []; ax1 = []; ax2 = [];
figure(999); setfigurepos([100 100 800 600]); clf;
for dd=1:length(degstodo)
  makemodel = @(x) catcell(2,arrayfun(@(n) x'.^n,0:degstodo(dd),'UniformOutput',0));
  X = feval(makemodel,x);
  recboot = fitprfstatic(X,y',0,0,[],100,[],[],[],@calccod);  % bootstrap
  rec     = fitprfstatic(X,y',0,0,[],0,[],[],[],@calccod);    % full fit
  pred = feval(makemodel,xval)*rec.params';
  results(dd,1) = rec.r;  % training R^2
  results(dd,2) = calccod(pred,yval');  % testing R^2
  modelfits = [];
  for p=1:size(recboot.params,1)
    modelfits(p,:) = X*recboot.params(p,:)';
  iii = find(ismember(degs,degstodo(dd)));
  if ~isempty(iii)
    subplot(length(degs),2,(iii-1)*2+1); hold on;
    h1 = scatter(x,y,'k.');
    mn = median(modelfits,1);
    se = stdquartile(modelfits,1,1);
    [d,ix] = sort(x);
    h2 = errorbar3(x(ix),mn(ix),se(ix),'v',[1 .8 .8]);
    h3 = plot(x(ix),mn(ix),'r-');
    h4 = plot(x(ix),feval(modelfun,x(ix)),'b-');
    xlabel('x'); ylabel('y');
    legend([h1 h3 h2 h4],{'Data' 'Model fit' 'Error bars' 'True model'},'Location','NorthEastOutside');
    ax1(iii,:) = axis;
    title(sprintf('Max polynomial degree = %d',degstodo(dd)));

    subplot(length(degs),2,(iii-1)*2+2); hold on;
    xlabel('Polynomial degree');
    ylabel('Estimated weight');
    ax = axis;
    axis([-1 max(degstodo)+1 ax(3:4)]);
    ax2(iii,:) = axis;

% We have created a dataset in which there is a nonlinear relationship
% between x and y.  The observed data points are indicated by black dots,
% and the true relationship is indicated by a blue line.  We have fit three
% different models to the data: a linear model (max polynomial degree of 1),
% a cubic model (max polynomial degree of 3), and a model that includes
% polynomials up to degree 10.  For each model, we have bootstrapped the
% fits, allowing us to see the variability of the model fit (pink error bars
% around the red line), as well as the variability of the parameter
% estimates (red error bars on the vertical black bars).
% There are several important observations:
% 1. The cubic model gives the most accurate model estimate.  The linear model
%    underfits the data, while the 10th degree model overfits the data.
% 2. In the cubic and 10th degree models, we estimate weights on higher
%    polynomials, and this changes the estimated weights on the lower
%    polynomials.  The change in weights indicates that the lower and higher
%    polynomials have correlated effects in the data.  In general, correlated regressors
%    are not necessarily problematic.  However, in the given dataset, we do not
%    have much data available and the influence of the higher polynomials is
%    weak (in fact, there are no polynomials beyond degree 4 in the true model).
%    Thus, we are better off ignoring the higher polynomials than attempting to
%    include them in the model.

% Because the data were generated from a known model, we can compare
% the various models to the known model in order to see which model is
% most accurate.  However, in general, we do not have access to the true,
% underlying model.  Instead, what we can do is to use cross-validation to
% quantify model accuracy.
figure(998); clf; hold on;h = plot(results,'o-');
h2 = straightline(calccod(feval(modelfun,xval),yval),'h','k-');
legend([h' h2],{'Training' 'Testing' 'Testing (true model)'});
ax = axis;
axis([0 length(degstodo)+1 0 ax(4)]);
xlabel('Max polynomial degree');
ylabel('Variance explained (R^2)');

% In this figure, we see that increasing the maximum polynomial degree invariably
% increases performance on the training data.  However, we see that increasing
% the maximum polynomial degree only increases performance on the testing data
% up to a degree of 3.  Beyond degree 3, the additional higher-order polynomials
% causes the model to overfit the training data and produces model estimates
% that perform more poorly on the testing data.  Thus, by using cross-validation we
% have found that the cubic model is the most accurate model.
% Notice that the performance of the cubic model is nevertheless still lower than
% that of the true model.  In order to improve the performance of our model
% estimate, we either have to change our modeling strategy (e.g. change the
% model, use regularization in parameter estimation) or collect more data.
% But in the end, there is no magic bullet --- achieving imperfect models is
% an inevitable feature of model building!


  1. Hi Kendrick
    I just read your post and thought that I would ask you something. I know it´s been a while since you wrote this post, but you might still be willing to help? :)
    I am trying to characterize and model the noise in a dataset, and I was wondering if it was possible to do this by deliberatly overfitting the data, i.e. deliberatly fitting the noise?

    Kind regards
    Christian L. Jensen

  2. Without knowing more about your data situation, it seems that your suggestion is sensible. In a sense, a model that fits every single data point perfectly is a model that is maximally overfit (i.e. it includes not only the signal but all of the noise too).