Sunday, December 25, 2011

Importance of error bars on data points when evaluating model quality

When measuring some quantity, the same value might not be obtained across repeated measurements --- in other words, noise may exist in the measurements.  Some portion of the noise might be due to factors that can be controlled for, and if these factors are controlled for, then the noise level could potentially be reduced.  However, for sake of the present discussion, let's assume that the noise present in a given situation can neither be controlled nor predicted.

When evaluating how well a model characterizes a given dataset, it is important to take into account the intrinsic noisiness of the data.  This is because a model cannot be expected to predict all of the data, as doing so would imply that the model predicts even the portion of the data that is due to noise (which, by our definition, is unpredictable).

Let's illustrate these ideas using a simple set of examples:



The top-left scatter plot shows some data (black dots) and a linear model (red line).  The model does fairly well capturing the dependency of the y-coordinate on the x-coordinate.  However, there are no error bars on the data points, so we lack critical information.  Our interpretation of the model and the data will be quite different, depending on the size of the error bars.

The upper-right scatter plot shows one possible case: the error on the data points is relatively small.  In this case there is a large amount of variance in the data that is both real (i.e. not simply attributable to noise) and not accounted for by the model.

The lower-left scatter plot shows another possible case: the error on the data points is relatively large.  In this case there appears to be very little variance in the data that is both real and not accounted for by the model.

The lower-right scatter plot shows one last case: the error on the data points is extremely large.  This case is in fact nearly impossible, and suggests that the error bar estimates are inaccurate.  We will examine this case in more detail later.

CODE

% CASE A:
% We have a model and it is the optimal characterization of the data.
% The only failure of the model is its inability to predict the
% noise in the data.  (Note that this case corresponds to case 2
% in the initial set of examples.)
noiselevels = [10 2];
figure(999); setfigurepos([100 100 500 250]); clf;
for p=1:length(noiselevels)
  x = linspace(1,9,50);
  y = polyval([1 4],x);
  subplot(1,2,p); hold on;
  measurements = bsxfun(@plus,y,noiselevels(p)*randn(10,50));  % 10 measurements per data point
  mn = mean(measurements,1);
  se = std(measurements,[],1)/sqrt(10);
  h1 = scatter(x,mn,'k.');
  h2 = errorbar2(x,mn,se,'v','k-');
  h3 = plot(ax(1:2),polyval([1 4],ax(1:2)),'r-');
  xlabel('x'); ylabel('y');
  legend([h1 h2(1) h3],{'Data' 'Error bars' 'Model'},'Location','SouthEast');
  r2 = calccod(y,mn);
  dist = calccod(repmat(y,[10000 1]),bsxfun(@plus,y,sqrt(mean(se.^2))*randn(10000,50)),2);
  title(sprintf('R^2 = %.0f%%; maximum R^2 = [%.0f%% %.0f%%]',r2,prctile(dist,2.5),prctile(dist,97.5)));
  if p==1
    ax = axis;
  end
  axis(ax);
end



% We have conducted two simulations.  In both simulations, there is a simple linear
% relationship between x and y.  Assuming this linear relationship, we have generated
% 10 measurements of y for various fixed values of x, and we have plotted the mean and
% standard error of these measurements.  In the simulation on the left, the noise level
% is relatively high, whereas in the simulation on the right, the noise level is relatively
% low.  The R^2 between the model and the data is 35% and 92% for the first and
% second simulations, respectively.  (For now, ignore the reported "maximum R^2"
% values, as these will be explained later.)
%
% What these simulations demonstrate is that (1) an optimal model can produce very
% different R^2 values, depending on the level of noise in the data and (2) if a model
% passes through or close to the error bars around most data points, the model may be
% nearly, if not fully, optimal.

% CASE B:
% We have a model and it characterizes the data only to a limited extent.
% The model fails to characterize aspects of the data that do not
% appear to be merely due to measurement noise.  (Note that this case
% corresponds to case 1 in the initial set of examples.)
noiselevel = 2;
figure(998); setfigurepos([100 100 250 250]); clf; hold on;
x = linspace(1,9,50);
y = polyval([1 4],x);
measurements = bsxfun(@plus,y + 4*randn(1,50),noiselevel*randn(10,50));
mn = mean(measurements,1);
se = std(measurements,[],1)/sqrt(10);
h1 = scatter(x,mn,'k.');
h2 = errorbar2(x,mn,se,'v','k-');
h3 = plot(ax(1:2),polyval([1 4],ax(1:2)),'r-');
xlabel('x'); ylabel('y');
legend([h1 h2(1) h3],{'Data' 'Error bars' 'Model'},'Location','SouthEast');
r2 = calccod(y,mn);
dist = calccod(repmat(y,[10000 1]),bsxfun(@plus,y,sqrt(mean(se.^2))*randn(10000,50)),2);
title(sprintf('R^2 = %.0f%%; maximum R^2 = [%.0f%% %.0f%%]',r2,prctile(dist,2.5),prctile(dist,97.5)));



% In this simulation, there is an overall linear relationship between x and y
% (as indicated by the red line).  However, this relationship does not fully
% characterize the y-values --- there are many data points that deviate
% substantially (many standard errors away) from the model.  The R^2 between
% the model and the data is just 7%.
%
% Given the relatively low noise level in the data, it intuitively seems that
% a much higher R^2 value should be possible.  For example, perhaps there is
% another regressor (besides the x-coordinate) that, if added to the model,
% would substantially improve the model's accuracy.  Without getting into the
% issue of how to improve the current model, we can use a simple set of
% simulations to confirm that the current R^2 value is indeed not optimal.
% First, we assume that the current model's prediction of the data points
% are the true (noiseless) y-values.  Then, we generate new sets of
% data by simulating measurements of the true y-values; for these
% measurements, we match the noise level to that observed in the actual
% dataset.  Finally, we calculate the R^2 between the model and the
% simulated datasets.  The 95% confidence interval on the resulting R^2
% values is indicated as the "maximum R^2" values in the title of the figure.
%
% The results of the simulations show that for our example,
% the R^2 value that we can expect to attain is at least 88%.  This
% confirms our suspicion that the current R^2 value of 7% is suboptimal.

% CASE C:
% We have a model and it appears to characterize the data quite well.
% In fact, the regularity and predictability of the data appear to be
% higher than what would be expected based on the large error bars
% on the data points.  (Note that this case corresponds to case 3 in the
% initial set of examples.)
noiselevel = 10;
figure(997); setfigurepos([100 100 250 250]); clf; hold on;
x = linspace(1,9,30);
y = polyval([1 4],x);
measurements = bsxfun(@plus,y,noiselevel*randn(10,30));
mn = mean(measurements,1);
se = std(measurements,[],1)/sqrt(10);
mn = .3*mn + .7*y;  % a simple hack to make the data better than it should be
h1 = scatter(x,mn,'k.');
h2 = errorbar2(x,mn,se,'v','k-');
h3 = plot(ax(1:2),polyval([1 4],ax(1:2)),'r-');
xlabel('x'); ylabel('y');
legend([h1 h2(1) h3],{'Data' 'Error bars' 'Model'},'Location','SouthEast');
r2 = calccod(y,mn);
dist = calccod(repmat(y,[10000 1]),bsxfun(@plus,y,sqrt(mean(se.^2))*randn(10000,30)),2);
title(sprintf('R^2 = %.0f%%; maximum R^2 = [%.0f%% %.0f%%]',r2,prctile(dist,2.5),prctile(dist,97.5)));



% In this simulation, the model characterizes the data quite well,
% with an R^2 value of 80%.  However, performing the same
% "maximum R^2" calculations described earlier, we find
% that the 95% confidence interval on the R^2 values that we
% can expect given the noise level in the data is [2%, 58%].
% This indicates that the model is, in a sense, doing too well.
% The problem might be due to inaccuracy of the error bar
% estimates: perhaps the error bar estimates are larger than
% they should be, or perhaps the errors on different data points
% are not independent of one another (the assumption of
% independence is implicit in the calculation of the
% maximum R^2 values).  Resolving tricky problems like
% these requires detailed inspection of the data on a case-
% by-case basis.

No comments:

Post a Comment