Tuesday, November 8, 2011

Testing null hypotheses using randomization, the bootstrap, and Monte Carlo methods

Earlier we saw how randomization can be used to test the null hypothesis that there is no meaningful order in a set of data. This is actually just a special case of the general statistical strategy of using randomness to see what happens under various null hypotheses. This general statistical strategy covers not only randomization methods but also Monte Carlo methods and the bootstrap.  Let's look at some examples.

Example 1: Using randomization to test whether two sets of data are different (e.g. in their means)

In this example, we use randomization to check whether two sets of data come from different probability distributions. Let's pose the null hypothesis that the two sets of data actually come from the same probability distribution. Under this hypothesis, the two sets of data are interchangeable, so if we aggregate the data points and randomly divide the data points into two sets, then the results should be comparable to results obtained from the original data.

% Generate two sets of data.
x = randn(1,100) + 1.4;
y = randn(1,100) + 1;

% Let's aggregate the data, randomly divide the data, and compute the resulting
% differences in means.  And for comparison, let's compute the actual difference in means.
[pval,val,dist] = randomization([x y],2,@(a) mean(a(1:100))-mean(a(101:200)),10000,0);

% Visualize the results.
figure(999); clf; hold on;
hist(dist,100);
h1 = straightline(val,'v','r-');
ax = axis; axis([-.8 .8 ax(3:4)]);
xlabel('Difference in the means');
ylabel('Frequency');
legend(h1,{'Actual observed value'});
title(sprintf('Results obtained using random assignments into two groups; p (two-tailed) = %.4f',pval));



% We see that the actual difference in means, 0.525, is quite unlikely with respect
% to the differences in means obtained via randomization.  The p-value is 0.0002,
% referring to the proportion of the distribution that is more extreme than the
% actual difference in means.  (Since we don't have an a priori reason to expect
% a positive or negative difference, we use a two-tailed p-value --- that is,
% not only do we count the number of values greater than 0.525, but we also count
% the number of values less than -0.525.)  Thus, the null hypothesis is probably
% incorrect.

% Note that strictly speaking, we have not shown that the two sets of data come
% from probability distributions with different means --- all we have done is to
% reject the hypothesis that the sets of data come from the same probability
% distribution.

Example 2: Using the bootstrap to test whether two sets of data are different (e.g. in their means)

In the previous example, we posed the null hypothesis that the two sets of data come from the same distribution and then used randomization to generate new datasets.  Notice that instead of using randomization to generate new datasets, we can use the bootstrap.  The difference is that in the case of randomization, we enforce the fact that none of the data points are repeated within a set of data nor across the two sets of data, whereas in the case of the bootstrap, we do not enforce these constraints --- we generate new datasets by simply drawing data points with replacement from the original set of data.  (The bootstrap strategy seems preferable as it probably generates more representative random samples, but this remains to be proven...)

% Let's go through an example using the same data from Example 2.
% Aggregate the data, draw bootstrap samples, and compute the resuling
% differences in means.
dist = bootstrap([x y],@(a) mean(a(1:100)) - mean(a(101:200)),10000);

% Compute the actual observed difference in means as well as the corresponding p-value.
val = mean(x) - mean(y);
pval = sum(abs(dist) > abs(val)) / length(dist);

% Visualize the results.
figure(998); clf; hold on;
hist(dist,100);
h1 = straightline(val,'v','r-');
ax = axis; axis([-.8 .8 ax(3:4)]);
xlabel('Difference in the means');
ylabel('Frequency');
legend(h1,{'Actual observed value'});
title(sprintf('Results obtained using bootstraps of aggregated data; p (two-tailed) = %.4f',pval));



% We find that the actual difference in means is quite unlikely with respect
% to the differences in means obtained via bootstrapping.  Moreover, the p-value is
% quite similar to that obtained with the randomization method.

Example 3: Using Monte Carlo simulations to test whether a sequence of numbers is different from noise

The randomization (Example 1) and bootstrap (Example 2) methods that we have seen depend on having an adequate number of data points.  With sufficient data points, the empirical data distribution can serve as a reasonable proxy for the true data distribution.  However, when the number of data points is small, the empirical data distribution may be a poor proxy for the true data distribution and so statistical procedures based on the empirical data distribution may suffer.

One way to reduce dependencies on the empirical data distribution is to bring in assumptions (e.g. priors) on what the true data distribution is like.  For example, one simple assumption is that the data follows a Gaussian distribution.  Let's look at an example.

% Suppose we observe the following data, where the x-axis is an
% independent variable (ranging from 1 to 5) and the y-axis
% is a dependent (measured) variable.
x = [1.5 3.2 2 2.5 4];
figure(997); clf; hold on;
scatter(1:5,x,'ro');
axis([0 6 0 6]);
xlabel('Independent variable');
ylabel('Dependent variable');
val = calccorrelation(1:5,x);
title(sprintf('r = %.4g',val));



% Although the observed correlation is quite high, how do we know
% the data aren't just random noise?  Let's be more precise:
% Let's pose the null hypothesis that the data are simply random
% draws from a fixed Gaussian distribution.  Under this hypothesis,
% what are correlation values that we would expect to obtain?
% For sake of principles, let's take random draws from a
% Gaussian distribution matched to the observed data with respect
% to mean and standard deviation, although because correlation is
% insensitive to gain and offset, matching the observed data
% in terms of mean and standard deviation is not actually necessary.
fake = mean(x) + std(x)*randn(10000,5);
dist = calccorrelation(repmat(1:5,[10000 1]),fake,2);
pval = sum(dist > val) / length(dist);
figure(996); clf; hold on;
hist(dist,100);
h1 = straightline(val,'v','r-');
xlabel('Correlation');
ylabel('Frequency');
legend(h1,{'Actual observed correlation'});
title(sprintf('Results obtained using Monte Carlo simulations; p (one-tailed) = %.4f',pval));



% Notice that the p-value is just 0.10, which means that about 10% of the time,
% purely random data would result in a correlation value as extreme as the
% one actually observed.  Thus, the null hypothesis that the data are purely
% random Gaussian data is somewhat likely.

1 comment: