Sunday, November 27, 2011

Fitting probability distributions to data

Here we cover the very basics of probability distributions and how to fit them to data.  We will see that one way to fit a probability distribution is to determine the parameters of the distribution that maximize the likelihood of the data.

CODE

% Let's look at a basic probability distribution, the Gaussian
% distribution.  In the one-dimensional case, there are two parameters,
% the mean and the standard deviation.  Let's see an example.
mn = 3;
sd = 1;
fun = @(x) 1/(sd*sqrt(2*pi)) * exp(-(x-mn).^2/(2*sd^2));

% We have defined a function that takes a value and returns
% the corresponding likelihood for a Gaussian distribution with
% mean 3 and standard deviation 1.  This function is called
% a probability density function.  Let's visualize it.
figure(999); clf; hold on;
xx = -1:.01:7;
h1 = plot(xx,feval(fun,xx),'r-','LineWidth',2);
xlabel('Value');
ylabel('Likelihood');
legend(h1,{'Probability density function'});



% For a Gaussian distribution, about 95% of the time, values drawn
% from the distribution will lie within two standard deviations
% of the mean.  In our example, this range is between 1 and 5.
xx2 = linspace(mn-2*sd,mn+2*sd,100);
h2 = bar(xx2,feval(fun,xx2),1);
set(h2,'FaceColor',[.8 .8 .8]);
set(h2,'EdgeColor','none');
uistack(h1,'top');



% We have shaded in the area underneath the probability density function
% that lies between 1 and 5.  If we were to calculate the actual area
% of this region, we would find that it is (approximately) 0.95.
% The total area underneath the probability density function is 1.
delete(h2);

% Now let's consider the concept of calculating the likelihood of a
% set of data given a particular probability distribution.  Using
% our example Gaussian distribution (mean 3, standard deviation 1),
% let's calculate the likelihood of a sample set of data.
data = [2.5 3 3.5];
h3 = straightline(data,'v','b-');
likelihood = prod(feval(fun,data));
legend([h1 h3(1)],{'Probability density function' 'Data points'});
title(sprintf('Likelihood of data points = %.6f',likelihood));



% The likelihood of observing the data points 2.5, 3, and 3.5 is
% obtained by multiplying the likelihoods of each individual data
% point.  (We are assuming that the data points are independent.)
% Now, for comparison, let's calculate the likelihood of a different
% set of data.
delete(h3);
data = [4 4.5 5];
h3 = straightline(data,'v','b-');
likelihood = prod(feval(fun,data));
legend([h1 h3(1)],{'Probability density function' 'Data points'});
title(sprintf('Likelihood of data points = %.6f',likelihood));



% Notice that the likelihood of this new set of data is much smaller
% then that of the original set of data.

% Now that we know how to calculate the likelihood of a set of data
% given a particular probability distribution, we can now think about
% how to actually fit a probability distribution to a set of data.
% All that we need to do is to set the parameters of the probability
% distribution such that the likelihood of the set of data is maximized.

% Let's do an example.  We have several data points and we want to fit
% a univariate (one-dimensional) Gaussian distribution to the data.
% To determine the optimal mean and standard deviation parameters,
% let's use brute force.
data = randn(1,100)*2.5 + 4;
fun = @(pp) sum(-log(1/(abs(pp(2))*sqrt(2*pi)) * exp(-(data-pp(1)).^2/(2*pp(2)^2))));
options = optimset('Display','iter','FunValCheck','on', ...
                   'MaxFunEvals',Inf,'MaxIter',Inf,'TolFun',1e-6,'TolX',1e-6);
params = fminsearch(fun,[0 1],options);

% What we have done is to use fminsearch to determine the mean and standard
% deviation parameters that minimize the sum of the negative log likelihoods
% of the data points.  (Maximizing the product of the likelihoods of the data
% points is equivalent to maximizing the log of the product of the likelihoods
% of the data points, which is equivalent to maximizing the sum of the log of
% the likelihood of each data point, which is equivalent to minimizing the
% sum of the negative log likelihood of the data points.  Or, semi-formally:
%   Let <ps> be a vector of likelihoods.  Then,
%     max prod(ps) <=> max log(prod(ps))
%                  <=> max sum(log(ps))
%                  <=> min sum(-log(ps))
% Note that taking the log of the likelihoods helps avoid numerical
% precision issues.)  Let's visualize the results.
mn = params(1);
sd = abs(params(2));
fun = @(x) 1/(sd*sqrt(2*pi)) * exp(-(x-mn).^2/(2*sd^2));
figure(998); clf; hold on;
h2 = straightline(data,'v','k-');
ax = axis;
xx = linspace(ax(1),ax(2),100);
h1 = plot(xx,feval(fun,xx),'r-','LineWidth',2);
axis([ax(1:2) 0 max(feval(fun,xx))*1.1]);
xlabel('Value');
ylabel('Likelihood');
legend([h2(1) h1],{'Data' 'Model'});



% Let's take the mean and standard deviation parameters that we found
% using optimization and compare them to the mean and standard deviation
% of the data points.
params
[mean(data) std(data,1)]



% Notice that the two sets of results are more or less identical (the
% difference can be attributed to numerical precision issues).  Thus,
% we see that computing the mean and standard deviation of a set of
% data can be viewed as implicitly fitting a one-dimensional
% Gaussian distribution to the data.

No comments:

Post a Comment