Saturday, November 12, 2011

Error bars on standard deviations

We are often concerned with estimating the mean of a population. Given that we can obtain only a limited number of samples from the population, what we normally do is to compute the mean of the samples and then put an error bar on the mean.

Now, sometimes we might be interested in the standard deviation of the population. Notice that the exact same problem arises --- with a limited number of samples from the population, we can obtain an estimate of the standard deviation of the population (by simply computing the standard deviation of the samples), but this is just an estimate. Thus, an error bar can be put on the standard deviation, too.

CODE

% Let's perform a simulation. For several different sample sizes,
% we draw random samples from a normal distribution with mean 0
% and standard deviation 1. For each random sample, we compute
% the standard deviation of the sample. We then look at the
% distribution of the standard deviations across repeated simulations.
nn = 8;
ns = 2.^(1:nn);
cmap = cmapdistinct(nn);
figure(999); setfigurepos([100 100 700 250]); clf;
subplot(1,2,1); hold on;
h = []; sd = [];
for p=1:length(ns)
x = randn(100000,ns(p));
dist = std(x,[],2);
[n,x] = hist(dist,100);
h(p) = plot(x,n,'-','Color',cmap(p,:));
sd(p) = std(dist);
end
ax = axis;
legend(h,cellfun(@(x) ['n = ' x],mat2cellstr(ns),'UniformOutput',0));
xlabel('Standard deviation');
ylabel('Frequency');
title('Standard deviations obtained using different sample sizes');
subplot(1,2,2); hold on;
h2 = plot(1:nn,sd,'r-');
ax = axis; axis([0 nn+1 ax(3:4)]);
set(gca,'XTick',1:nn,'XTickLabel',mat2cellstr(ns));
xlabel('n');
ylabel('Standard deviation of distribution');
title('Spread of standard deviation estimates');



% Notice that with few data points, the standard deviations are highly variable.
% With increasing numbers of data points, the standard deviations are more tightly
% clustered around the true value of 1.
%
% To put some concrete numbers on this: at n = 32, the spread in the distribution
% (which we quantify using standard deviation) is about 0.12. This indicates that
% if we draw 32 data points from a normal distribution and compute the standard
% deviation of the data points, our standard deviation estimate is accurate only
% to about +/- 12%.

FINAL OBSERVATIONS

Recall that the standard error of the mean for a random sample drawn from a Gaussian distribution is simply the standard deviation of the sample divided by the square root of the number of data points. Notice that in computing the standard error estimate, we are implicitly estimating the standard deviation of the population. But we have just seen that this standard deviation estimate may be somewhat inaccurate, depending on the number of data points. This implies that standard errors are themselves subject to noise.

To put it simply: we can put error bars on error bars. The error on error bars will tend to be high in the case of few data points.

No comments:

Post a Comment