Monday, November 21, 2011

Peeking at P-values

Guest Post by Jon
A common experimental practice is to collect data and then do a statistical test to evaluate whether the data differs significantly from a null hypothesis.  Sometimes researchers peek at the data before data collection is finished and do a preliminary analysis of the data. If a statistical test indicates a negative result, more data is collected; if there is a positive result, data collection is stopped. This strategy invalidates the statistical test by inflating the likelihood of observing a false positive.

In this post we demonstrate the amount of inflation obtained by this strategy. We sample from a distribution with a zero mean and test whether the sample mean differs from 0. As we will see, if we continually peek at the data, and then decide whether to continue data collection contingent on the partial results, we wind up with an elevated chance of rejecting the null hypothesis.

% Simulate 1000 experiments with 200 data points each
x = randn(200,1000);

% We expect about 5% false positives, given an alpha of 0.05
disp(mean(ttest(x, 0, 0.05)));

% Now let's calculate the rate of false positives for different sample
% sizes. We assume a minimum of 10 samples and a maximum of 200.
h = zeros(size(x));
for ii = 10:200
    h(ii,:) = ttest(x(1:ii, :), 0, 0.05);

% The chance of a false positive is about 0.05, no matter how many data points
plot(1:200, mean(h, 2), 'r-', 'LineWidth', 2)
ylim([0 1])
ylabel('Probability of a false positive')
xlabel('Number of samples')

% How would the false positive rate change if we peeked at the data?
% To simulate peeking, we take the cumulative sum of h values for each
% simulation. The result of this is that if at any point we reject the null
% (h=1), the remaining points for that simulation also assume we rejected
% null.
peekingH = logical(cumsum(h));

figure(99); hold on
plot(1:200, mean(peekingH, 2), 'k-', 'LineWidth', 2)
legend('No peeking', 'Peeking')

% The plot demonstrates the problem with peeking: we defined the likelihood
% of a false positive as our alpha value (here, 0.05), but we have created
% a false positive rate that is much higher.


  1. Very interesting... This is a bit like the issue of multiple comparisons, except that instead of having several different samples that you are testing for an effect, you are testing the same sample multiple times (albeit the sample grows in size).

    I have never seen an instance of a paper describing how an experiment was actually conducted (e.g. we tried a pilot, then tweaked this or that, and then collected a few more data points, etc.); rather, papers tend to summarize and streamline the messy experiment into a nice clean one. There is of course good reason for this, that is, to make the paper simpler. But as this post shows, this process in some cases can be highly misleading and statistically invalid.

  2. I should have noted in the original post: I got this idea from a discussion with Nick Davidenko many years ago. -Jon