Saturday, November 19, 2011

Nonparametric probability distributions

Parametric probability distributions (e.g. the Gaussian distribution) make certain assumptions about the distribution of a set of data. Instead of making these assumptions, we can use nonparametric methods to estimate probability distributions (such methods include bootstrapping, histograms, and kernel density estimation).


% Generate some data points.
x = 4 + randn(1,100).^2 + 0.2*randn(1,100);

% Let's visualize the data.
figure(999); clf; hold on;
axis([0 12 0 0.8]); ax = axis;
h1 = straightline(x,'v','r-');

% We have drawn a vertical line at each data point.
% Imagine that the vertical lines collectively
% define a probability distribution that has a value
% of 0 everywhere except for the data points
% where there are infinitely high spikes.  This
% probability distribution is what the bootstrapping
% method is in fact implicitly using --- drawing
% random samples with replacement from the
% original data is equivalent to treating the data
% points themselves as defining the probability
% distribution estimate.

% Now let's construct a histogram.
[nn,cc] = hist(x,20);
binwidth = cc(2)-cc(1);
h2 = bar(cc,nn/sum(nn) * (1/binwidth),1);

% The histogram is a simple nonparametric method for
% estimating the probability distribution associated with
% a set of data.  It is nonparametric because it makes
% no assumption about the shape of the probability
% distribution.  Note that we have scaled the histogram
% in such a way as to ensure that the area occupied by
% the histogram equals 1, which is a requirement for
% true probability distributions.

% The simplest and most common method for summarizing a
% set of data is to compute the mean and standard deviation
% of the data.  This can be seen as using a parametric
% probability distribution, the Gaussian distribution,
% to interpret the data.  Let's make this connection
% explicit by plotting a Gaussian probability distribution
% whose mean and standard deviation are matched to that
% of the data.
mn = mean(x);
sd = std(x);
xx = linspace(ax(1),ax(2),100);
h3 = plot(xx,evalgaussian1d([mn sd 1/(sd*sqrt(2*pi)) 0],xx),'g-','LineWidth',2);

% Notice that the Gaussian distribution is reasonably
% matched to the data points and the histogram but tends
% to overestimate the data density at low values.

% The histogram is a crude method since the probability
% distributions that it creates have discontinuities (the
% corners of the black rectangles).  To avoid discontinuities,
% we can instead use kernel density estimation.  In this method,
% we place a kernel at each data point and sum across the
% kernels to obtain the final probability distribution estimate.
% Let's apply kernel density estimation to our example using
% a Gaussian kernel with standard deviation 0.5.  (To ensure
% visibility, we have made the height of each kernel higher than
% it actually should be.)
kernelwidth = 0.5;
vals = [];
h4 = [];
for p=1:length(x)
  vals(p,:) = evalgaussian1d([x(p) kernelwidth 1/(kernelwidth*sqrt(2*pi)) 0],xx);
  h4(p) = plot(xx,1/10 * vals(p,:),'c-');
h5 = plot(xx,mean(vals,1),'b-','LineWidth',2);
  % Note that we could have used MATLAB's ksdensity function to achieve
  % identical results:
  % vals = ksdensity(x,xx,'width',0.5);
  % h5 = plot(xx,vals,'b-');

% Finally, add a legend.
legend([h1(1) h2(1) h3(1) h4(1) h5(1)],{'Data' 'Histogram' 'Gaussian' 'Kernels' 'Kernel Density Estimate'});

% Notice that the kernel density estimate is similar to the histogram, albeit smoother.


When constructing histograms, a free parameter is the number of bins to use.  Typically, histograms are used as an exploratory method, so it would be quite acceptable to try out a variety of different values (so as to get a sense of what the data are like).

Notice that the number of bins used in a histogram is analogous to the width of the kernel used in kernel density estimation.

How do we choose the correct kernel width in kernel density estimation?  If the kernel is too small, then we risk gaps in the resulting probability distribution, which is probably not representative of the true distribution.  If the kernel is too large, then the resulting probability distribution will tend towards being uniform, which is also probably not representative of the true distribution.  We will address selection of kernel width in a later post, but the short answer is that we can use cross-validation to choose the optimal kernel width.

No comments:

Post a Comment