Random analyses in MATLAB

Learning MATLAB

2014-01-26T16:56:00.003-08:00

The new version of the Statistics and data analysis in MATLAB class is available at:

http://artsci.wustl.edu/~kkay/psych5007/

The basics of MATLAB programming is now covered, with accompanying lecture videos.

New statistics class

2012-04-05T10:19:00.002-07:00

Psych216A: Statistics and data analysis in MATLAB (Spring 2012)
http://white.stanford.edu/~knk/Psych216A/
Lecture videos and other materials are available online.

Principal components analysis

2012-01-14T00:34:00.000-08:00

Principal components analysis (PCA) is useful for data exploration and dimensionality reduction. In this post, we will see that (1) PCA is just an application of SVD, (2) PCs define an orthogonal coordinate system such that in this system the data are uncorrelated, (3) PCs maximize the variance explained in the data, and (4) we can often use a small number of PCs to reconstruct (or approximate) a given set of data.

What are principal components (PCs)?

Suppose we have a data matrix X with dimensions m x n, where each row corresponds to a different data point and each column corresponds to a different attribute. (For example, if we measured the height and weight of 1000 different people, then we could construct a data matrix with dimensions 1000 x 2.) Furthermore, let's presume that the mean of each column has been subtracted off (why this is important will become clear later). If we take the SVD of X, we obtain matrices U, S, and V such that X = U*S*V'. The columns of V are mutually orthogonal unit-length vectors; these are the principal components (PCs) of X.

SVD on the data matrix or the covariance matrix

Instead of taking the SVD of X (the data matrix), we can take the SVD of X'*X (the covariance matrix). This is because X'*X = (V*S'*U')*(U*S*V') = V*S.^2*V' (where S.^2 is a square matrix with the square of the diagonal entries of S along the diagonal). So, if we take the SVD of X'*X, the resulting V matrix should be identical to that obtained when taking the SVD of X (except for potential sign flips of the columns). A potential benefit of taking the SVD of the covariance matrix is reduced computational time. For example, if m >> n, then X is a large matrix of size m x n whereas X'*X is a small matrix of size n x n.

PCA decorrelates data

One way to think about PCA is that it decorrelates the data matrix. Prior to PCA, the columns of the data matrix may have some correlation with each other, i.e. the dot-product of any given pair of columns of X may be non-zero. What PCA does is to provide a linear transformation of the data such that after the transformation, the columns of the data matrix are uncorrelated with one another. The transformation is specifically given by multiplication with the matrix V.

What exactly does multiplication with V do? V has dimensions n x n and contains the principal components (PCs) in the columns. Given a point in n-dimensional space (i.e. a vector of dimensions 1 x n), if we multiply that point with V, what we are doing is projecting the point onto each of the PCs. This yields the coordinates of the point with respect to the space defined by the PCs. Since the PCs form an orthogonal basis, all we are really doing is rotating the space.

Now let's see what happens when we take the data matrix X and multiply it with V. Since X = U*S*V', X*V = U*S*V'*V = U*S. The result, U*S, has the property that the columns are uncorrelated with one another. The reason is that the columns of U are already mutually orthgonal by way of the SVD; and since S is a diagonal matrix, multiplication with S simply rescales the columns of U, which does not change the condition of mutual orthogonality.

% Let's see an example. Here we create 1000 points
% in two-dimensional space.
X = zeromean(randnmulti(1000,[],[1 .6; .6 1],[1 .5]),1);
[U,S,V] = svd(X,0);
figure; setfigurepos([100 100 500 250]);
subplot(1,2,1); hold on;
scatter(X(:,1),X(:,2),'r.');
axis equal;
h1 = drawarrow([0 0],V(:,1)','k-',[],10,'LineWidth',2);
h2 = drawarrow([0 0],V(:,2)','b-',[],10,'LineWidth',2);
legend([h1 h2],{'PC 1' 'PC 2'});
xlabel('Dimension 1');
ylabel('Dimension 2');
title('Data');
subplot(1,2,2); hold on;
X2 = X*V;
V2 = (V'*V)';
scatter(X2(:,1),X2(:,2),'r.');
axis equal;
h1 = drawarrow([0 0],V2(:,1)','k-',[],10,'LineWidth',2);
h2 = drawarrow([0 0],V2(:,2)','b-',[],10,'LineWidth',2);
legend([h1 h2],{'PC 1' 'PC 2'});
xlabel('Projection onto PC 1');
ylabel('Projection onto PC 2');
title('Data');

% In the first panel, we plot the data as red dots. Notice
% that the two dimensions are moderately correlated with each other.
% By taking the SVD of the data, we obtain the PCs, and we plot
% the PCs as black and blue arrows. Notice that the PCs are
% orthogonal to each other. Also, notice that the first PC points
% in the direction along which the data tends to lie.
%
% In the second panel, we project the data onto the PCs and
% re-plot the data. Notice that the only thing that has
% happened is that the space has been rotated. In this new
% space, the data points are uncorrelated and the PCs are
% now aligned with the coordinate axes.

% Let's look at another example. Whereas in the previous example
% we ensured that the columns of the data matrix were zero-mean,
% in this example, we will intentionally make the columns have
% non-zero means.
X = randnmulti(100,[1 -2],[1 .7; .7 1],[1 .5]);
% (now repeat the code above)

% In this example, the first PC again points in the direction along
% which the data tend to lie. (Actually, strictly speaking, the
% first PC points in the opposite direction. But there is a sign
% ambiguity in SVD --- the signs of corresponding columns of the U
% and V matrices can be flipped with no change to the overall math.
% So if we wanted to, we could simply flip the sign of the first PC (which
% corresponds to the first column of V) and also flip the sign of
% the first column of U.) Notice that the first PC does not
% point in the direction of the elongation of the cloud of points;
% rather, the first PC points towards the middle of the cloud.
% The reason this happens is that the columns of the data matrix were not
% mean-subtracted (i.e. centered), and as it turns out, the primary effect
% in the data is displacement from the origin.
%
% (One consequence of neglecting to center each column is that the columns
% of the data matrix after projection onto the PCs may have some correlation
% with one another. After projection onto the PCs, it is guaranteed only
% that the dot-product of the columns is zero. Correlation (r) involves
% more than just a dot-product; it involves both mean-subtraction and
% unit-length-normalization before computing the dot product. Thus, there
% is no guarantee that the columns are uncorrelated. Indeed, in the
% previous example, the correlation after projection on the PCs is r=-0.23.)
%
% Whether or not to subtract off the mean of each data column before computing
% the SVD depends on the nature of the data --- it is up to you to decide.

PCs point towards maximal variance in the data

% Principal components have a particular ordering --- each principal
% component points in the direction of maximal variance that is
% orthogonal to each of the previous principal components. In this
% way, each principal component accounts for the maximal possible
% amount of variance, ignoring the variance already accounted for
% by the previous principal components. (With respect to explaining
% variance, it would be pointless for a given vector to be
% non-orthogonal to all previous ones; the extra descriptive
% power afforded by a vector lies only in the component of the
% vector that is orthogonal to the existing subspace.)

% Let's see an example.
X = zeromean(randnmulti(50,[],[1 .8; .8 1],[1 .5]),1);
figure; setfigurepos([100 100 500 250]);
subplot(1,2,1); hold on;
scatter(X(:,1),X(:,2),'r.');
axis equal;
xlabel('Dimension 1');
ylabel('Dimension 2');
title('Data');
subplot(1,2,2); hold on;
h = [];
for p=1:size(X,1)
h(p) = plot([0 X(p,1)],[0 X(p,2)],'k-');
end
scatter(X(:,1),X(:,2),'r.');
scatter(0,0,'g.');
axis equal; ax = axis;
xlabel('Dimension 1');
ylabel('Dimension 2');
title('Variance of the data');

% In the first plot, we simply plot the data. Before
% proceeding, we need to understand what it means to explain
% variance in data. Variance (without worrying about
% mean-subtraction or the normalization term) is simply the
% sum of the squares of the values in a given set of data.
% Now, since the sum of the squares of the coordinates of a
% data point is the same as the square of the distance of the
% data point from the origin, we can think of variance as
% equivalent to squared distance. To illustrate this, in the
% second plot we have drawn a black line between each data
% point and the origin. The aggregate of all of the black lines
% can be thought of as representing the variance of the data.
% If we have a model that is attempting to fit the data, we
% can ask how close the model comes to the data points.
% The closer that the model is to the data, the more variance
% the model explains in the data. Currently, without a model,
% our model fit is simply the origin, and we have 100% of the
% variance left to explain.
%
% What we would like to determine is the direction that accounts
% for maximal variance in the data. That is, we are looking for
% a vector such that if we were to use that vector to fit the
% data points, the fitted points would be as close to the data
% as possible.
figure; setfigurepos([100 100 500 250]);
subplot(1,2,1); hold on;
direction = unitlength([.2 1]');
Xproj = X*direction*direction';
h = [];
for p=1:size(X,1)
h(p) = plot([Xproj(p,1) X(p,1)],[Xproj(p,2) X(p,2)],'k-');
end
h0 = scatter(X(:,1),X(:,2),'r.');
h1 = scatter(Xproj(:,1),Xproj(:,2),'g.');
axis(ax);
xlabel('Dimension 1');
ylabel('Dimension 2');
title('Variance remaining for sub-optimal direction');
subplot(1,2,2); hold on;
[U,S,V] = svd(X,0);
direction = V(:,1);
Xproj = X*direction*direction';
h = [];
for p=1:size(X,1)
h(p) = plot([Xproj(p,1) X(p,1)],[Xproj(p,2) X(p,2)],'k-');
end
h0 = scatter(X(:,1),X(:,2),'r.');
h1 = scatter(Xproj(:,1),Xproj(:,2),'g.');
axis(ax);
xlabel('Dimension 1');
ylabel('Dimension 2');
title('Variance remaining for optimal direction');

% In the first plot, we have deliberately chosen a sub-optimal
% direction (the direction points slightly to the right of vertical).
% Using the given direction, we have determined the best possible fit
% to each data point; the fitted points are shown in green. The
% distance from the fitted points to the actual data points is
% indicated by black lines. In the second plot, we have chosen the
% optimal direction, namely, the first principal component of the data.
% Notice that the total distance from the fitted points to the actual data
% points is much smaller in the second case than in the first. This
% reflects the fact that the first principal component explains much
% more variance than the direction we chose in the first plot.
%
% The idea, then, is to repeat this process iteratively --- first,
% we determine the vector that approximates the data as best as
% possible, then we add in a second vector that improves the
% approximation as much as possible, and so on.

Singular values indicate variance explained

% A nice characteristic of PCA is that the PCs define
% an orthogonal coordinate system. Because of this property,
% the incremental improvements with which the PCs approximate
% the data are exactly additive.
%
% (To see why, imagine you have a point that is located at (x,y,z).
% The squared distance to the origin is x^2+y^2+z^2.
% If we use the x-axis to approximate the point,
% the model fit is (x,0,0) and the remaining distance is
% y^2+z^2. If we then use the y-axis to approximate the point,
% the model fit is (x,y,0) and the remaining distance is
% z^2. Finally, if we use the z-axis to approximate the point,
% the model fit is (x,y,z) and there is zero remaining distance.
% Thus, due to the geometric properties of Euclidean space, all
% of the variance components add up exactly.)
%
% A little math can show that that the variance accounted
% for by individual principal components is given by the square of
% diagonal elements of the matrix S (which are also known as
% the singular values).
%
% The proportion of the total variance in a dataset that
% is accounted for by the first N PCs, where N ranges
% from 1 to the number of dimensions in the data can
% be calculated simply as
% cumsum(diag(S).^2) / sum(diag(S).^2) * 100.
% This sequence of percentages is useful when choosing
% a small number of PCs to summarize a dataset.
% We will see an example of this below.

Matrix reconstruction

% In data exploration, it is often useful to look at the big
% effects in the data. A quick and dirty technique is to identify
% a small number of PCs that define a subspace within which
% most of the data resides.
temp = unitlength(rand(100,9),1);
X = randnmulti(11,rand(1,9),temp'*temp);
[U,S,V] = svd(X,0);
varex = cumsum(diag(S).^2) / sum(diag(S).^2) * 100;
Xapproximate = [];
for p=1:size(X,2)
% this is a nice trick for using the first p principal
% components to approximate the data matrix. we leave it
% to the reader to verify why this works.
Xapproximate(:,:,p) = U(:,1:p)*S(1:p,1:p)*V(:,1:p)';
end
mn = min(X(:));
mx = max(X(:));
figure; setfigurepos([100 100 500 500]);
for p=1:size(X,2)
subplot(3,3,p); hold on;
imagesc(Xapproximate(:,:,p),[mn mx]);
axis image; axis off;
title(sprintf('PC %d, %.1f%% Variance',p,varex(p)));
end

% What we have done here is to create a dataset (dimensions 11 x 9)
% and then use an increasing number of PCs to approximate the data.
% The full dataset corresponds to "PC 9" where we use all 9
% PCs to approximate the data. Notice that using just 3 PCs
% allows us to account for 92.9% of the variance in the original data.
%
% A useful next step might be to try and interpret the first 3 PCs
% and/or to visualize the data projected onto the first 3 PCs.
% If we gained an understanding of what is happening in the
% first 3 PCs of the data, it would probably be safe to deem that we
% have a good understanding of the data.

SVD and covariance matrices

2011-12-28T21:10:00.001-08:00

This post describes singular value decomposition (SVD) and how it applies to covariance matrices. As we will see in later posts, SVD and covariance matrices are central to understanding principal components analysis (PCA), linear regression, and multivariate Gaussian probability distributions.

SVD

Singular value decomposition (SVD) decomposes a matrix X into three matrices U, S, and V such that:
X = U*S*V'
The columns of U are mutually orthogonal unit-length vectors (these are called the left singular vectors); the columns of V are mutually orthogonal unit-length vectors (these are called the right singular vectors); and S is a matrix with zeros everywhere except for non-negative values along the diagonal (these values are called the singular values).

% Let's see an example.
X = randn(10,3);
[U,S,V] = svd(X);
figure; setfigurepos([100 100 500 150]);
subplot(1,3,1); imagesc(U'*U); axis image square; colorbar; title('U^TU');
subplot(1,3,2); imagesc(V'*V); axis image square; colorbar; title('V^TV');
subplot(1,3,3); imagesc(S); axis image square; colorbar; title('S');

Notice that U'*U and V'*V are identity matrices. This makes sense because the sum of the squares of the coordinates of a unit-length vector equals one and because the dot product of orthogonal vectors equals zero.

Covariance matrices

Suppose X represents a set of linear regressors. That is, suppose the dimensions of X are m x n, representing n regressors defined in an m-dimensional space. Then, X'*X is an n x n covariance matrix, representing the covariance of each regressor with each other regressor. (Technically, this is not quite correct: true covariance involves subtracting off the mean of each regressor before calculating the pairwise dot-products and dividing the results by the number of elements in each regressor (m). Nevertheless, it is still useful to think of X'*X as a covariance matrix.)

SVD applied to covariance matrices

SVD provides a useful decomposition of covariance matrices:
X'*X = (U*S*V')'*(U*S*V') = V*S'*U' * U*S*V' = V*S.^2*V'
where S.^2 is a diagonal matrix consisting of zeros everywhere except for non-negative values along the diagonal (these values are the square of the values in the original S matrix).

X'*X is an n x n matrix that can be interpreted as applying a linear operation to n-dimensional space. For instance, suppose we have a vector v with dimensions n x 1. This vector corresponds to a specific point in n-dimensional space. If we calculate (X'*X)*v, we obtain a new vector corresponding to a new point in n-dimensional space.

The SVD decomposition of X'*X gives us insight into the nature of the linear operation. Specifically, the SVD decomposition tells us that the linear operation consists of three successive operations: a rotation, a scaling, and then an undoing of the original rotation:

The initial rotation comes from multiplication with V'. This is because the columns of V form a complete orthonormal basis (each column is a unit-length vector and all columns are orthogonal to one another). Multiplication of V' and v projects the vector v onto the basis. Change of basis simply rotates the coordinate system.
The scaling comes from multiplication with S.^2. Since S.^2 is a diagonal matrix, multiplication with S.^2 simply stretches the coordinate system along the axes of the coordinate system.
The final rotation comes from multiplication with V. This multiplication undos the rotation that was performed by the initial multiplication with V'.

% Let's see an example.
X = unitlength(zeromean(randnmulti(1000,[],[1 .6; .6 1]),1),1);
figure; setfigurepos([100 100 300 300]);
imagesc(X'*X,[0 1]);
axis image square;
colorbar;
title('Covariance matrix: X^TX');

% In this example there are 2 regressors defined in a 1000-dimensional space.
% The regressors are moderately correlated with one another, as can be seen
% in the relatively high off-diagonal term in the covariance matrix.

% Let's use SVD to interpret the operation performed by the covariance matrix.
[U,S,V] = svd(X);
angs = linspace(0,2*pi,100);
vectors = {};
circlevectors = {};
vectors{1} = eye(2);
circlevectors{1} = [cos(angs); sin(angs)];
vectors{2} = V'*vectors{1};
circlevectors{2} = V'*circlevectors{1};
vectors{3} = S'*S*vectors{2};
circlevectors{3} = S'*S*circlevectors{2};
vectors{4} = V*vectors{3};
circlevectors{4} = V*circlevectors{3};
% now perform plotting
figure; setfigurepos([100 100 600 200]);
colors = 'rg';
titles = {'Original space' '(1) Rotate' '(2) Scale' '(3) Un-rotate'};
for p=1:4
subplot(1,4,p);
axis([-1.8 1.8 -1.8 1.8]); axis square; axis off;
for q=1:2
drawarrow(zeros(1,2),vectors{p}(:,q)',[colors(q) '-'],[],5);
end
plot(circlevectors{p}(1,:),circlevectors{p}(2,:),'k-');
title(titles{p});
end

% The first plot shows a circle in black and equal-length vectors oriented
% along the coordinate axes in red and green. The second plot shows what
% happens after the initial rotation operation. The third plot shows what
% happens after the scaling operation. The fourth plot shows what happens
% after the final rotation operation. Notice that in the end result, the
% red and green vectors are no longer orthogonal and the circle is now an ellipse.

% Let's visualize the linear operation once more, but this time let's use
% the red and green vectors to represent the columns of the V matrix.
vectors = {};
vectors{1} = V;
vectors{2} = V'*vectors{1};
vectors{3} = S'*S*vectors{2};
vectors{4} = V*vectors{3};
% run the "now perform plotting" section above

% Compare the first plot to the last plot. Notice that red and green vectors
% have not changed their angle but only their magnitude --- this reflects the
% fact that the columns of the V matrix are eigenvectors of the covariance matrix.
% Multiplication by the covariance matrix scales the space along the directions
% of the eigenvectors, which has the consequence that only the magnitudes of
% the eigenvectors, and not their angles, are modified.

Importance of error bars on data points when evaluating model quality

2011-12-25T23:41:00.000-08:00

When measuring some quantity, the same value might not be obtained across repeated measurements --- in other words, noise may exist in the measurements. Some portion of the noise might be due to factors that can be controlled for, and if these factors are controlled for, then the noise level could potentially be reduced. However, for sake of the present discussion, let's assume that the noise present in a given situation can neither be controlled nor predicted.

When evaluating how well a model characterizes a given dataset, it is important to take into account the intrinsic noisiness of the data. This is because a model cannot be expected to predict all of the data, as doing so would imply that the model predicts even the portion of the data that is due to noise (which, by our definition, is unpredictable).

Let's illustrate these ideas using a simple set of examples:

The top-left scatter plot shows some data (black dots) and a linear model (red line). The model does fairly well capturing the dependency of the y-coordinate on the x-coordinate. However, there are no error bars on the data points, so we lack critical information. Our interpretation of the model and the data will be quite different, depending on the size of the error bars.

The upper-right scatter plot shows one possible case: the error on the data points is relatively small. In this case there is a large amount of variance in the data that is both real (i.e. not simply attributable to noise) and not accounted for by the model.

The lower-left scatter plot shows another possible case: the error on the data points is relatively large. In this case there appears to be very little variance in the data that is both real and not accounted for by the model.

The lower-right scatter plot shows one last case: the error on the data points is extremely large. This case is in fact nearly impossible, and suggests that the error bar estimates are inaccurate. We will examine this case in more detail later.

CODE

% CASE A:
% We have a model and it is the optimal characterization of the data.
% The only failure of the model is its inability to predict the
% noise in the data. (Note that this case corresponds to case 2
% in the initial set of examples.)
noiselevels = [10 2];
figure(999); setfigurepos([100 100 500 250]); clf;
for p=1:length(noiselevels)
x = linspace(1,9,50);
y = polyval([1 4],x);
subplot(1,2,p); hold on;
measurements = bsxfun(@plus,y,noiselevels(p)*randn(10,50)); % 10 measurements per data point
mn = mean(measurements,1);
se = std(measurements,[],1)/sqrt(10);
h1 = scatter(x,mn,'k.');
h2 = errorbar2(x,mn,se,'v','k-');
h3 = plot(ax(1:2),polyval([1 4],ax(1:2)),'r-');
xlabel('x'); ylabel('y');
legend([h1 h2(1) h3],{'Data' 'Error bars' 'Model'},'Location','SouthEast');
r2 = calccod(y,mn);
dist = calccod(repmat(y,[10000 1]),bsxfun(@plus,y,sqrt(mean(se.^2))*randn(10000,50)),2);
title(sprintf('R^2 = %.0f%%; maximum R^2 = [%.0f%% %.0f%%]',r2,prctile(dist,2.5),prctile(dist,97.5)));
if p==1
ax = axis;
end
axis(ax);
end

% We have conducted two simulations. In both simulations, there is a simple linear
% relationship between x and y. Assuming this linear relationship, we have generated
% 10 measurements of y for various fixed values of x, and we have plotted the mean and
% standard error of these measurements. In the simulation on the left, the noise level
% is relatively high, whereas in the simulation on the right, the noise level is relatively
% low. The R^2 between the model and the data is 35% and 92% for the first and
% second simulations, respectively. (For now, ignore the reported "maximum R^2"
% values, as these will be explained later.)
%
% What these simulations demonstrate is that (1) an optimal model can produce very
% different R^2 values, depending on the level of noise in the data and (2) if a model
% passes through or close to the error bars around most data points, the model may be
% nearly, if not fully, optimal.

% CASE B:
% We have a model and it characterizes the data only to a limited extent.
% The model fails to characterize aspects of the data that do not
% appear to be merely due to measurement noise. (Note that this case
% corresponds to case 1 in the initial set of examples.)
noiselevel = 2;
figure(998); setfigurepos([100 100 250 250]); clf; hold on;
x = linspace(1,9,50);
y = polyval([1 4],x);
measurements = bsxfun(@plus,y + 4*randn(1,50),noiselevel*randn(10,50));
mn = mean(measurements,1);
se = std(measurements,[],1)/sqrt(10);
h1 = scatter(x,mn,'k.');
h2 = errorbar2(x,mn,se,'v','k-');
h3 = plot(ax(1:2),polyval([1 4],ax(1:2)),'r-');
xlabel('x'); ylabel('y');
legend([h1 h2(1) h3],{'Data' 'Error bars' 'Model'},'Location','SouthEast');
r2 = calccod(y,mn);
dist = calccod(repmat(y,[10000 1]),bsxfun(@plus,y,sqrt(mean(se.^2))*randn(10000,50)),2);
title(sprintf('R^2 = %.0f%%; maximum R^2 = [%.0f%% %.0f%%]',r2,prctile(dist,2.5),prctile(dist,97.5)));

% In this simulation, there is an overall linear relationship between x and y
% (as indicated by the red line). However, this relationship does not fully
% characterize the y-values --- there are many data points that deviate
% substantially (many standard errors away) from the model. The R^2 between
% the model and the data is just 7%.
%
% Given the relatively low noise level in the data, it intuitively seems that
% a much higher R^2 value should be possible. For example, perhaps there is
% another regressor (besides the x-coordinate) that, if added to the model,
% would substantially improve the model's accuracy. Without getting into the
% issue of how to improve the current model, we can use a simple set of
% simulations to confirm that the current R^2 value is indeed not optimal.
% First, we assume that the current model's prediction of the data points
% are the true (noiseless) y-values. Then, we generate new sets of
% data by simulating measurements of the true y-values; for these
% measurements, we match the noise level to that observed in the actual
% dataset. Finally, we calculate the R^2 between the model and the
% simulated datasets. The 95% confidence interval on the resulting R^2
% values is indicated as the "maximum R^2" values in the title of the figure.
%
% The results of the simulations show that for our example,
% the R^2 value that we can expect to attain is at least 88%. This
% confirms our suspicion that the current R^2 value of 7% is suboptimal.

% CASE C:
% We have a model and it appears to characterize the data quite well.
% In fact, the regularity and predictability of the data appear to be
% higher than what would be expected based on the large error bars
% on the data points. (Note that this case corresponds to case 3 in the
% initial set of examples.)
noiselevel = 10;
figure(997); setfigurepos([100 100 250 250]); clf; hold on;
x = linspace(1,9,30);
y = polyval([1 4],x);
measurements = bsxfun(@plus,y,noiselevel*randn(10,30));
mn = mean(measurements,1);
se = std(measurements,[],1)/sqrt(10);
mn = .3*mn + .7*y; % a simple hack to make the data better than it should be
h1 = scatter(x,mn,'k.');
h2 = errorbar2(x,mn,se,'v','k-');
h3 = plot(ax(1:2),polyval([1 4],ax(1:2)),'r-');
xlabel('x'); ylabel('y');
legend([h1 h2(1) h3],{'Data' 'Error bars' 'Model'},'Location','SouthEast');
r2 = calccod(y,mn);
dist = calccod(repmat(y,[10000 1]),bsxfun(@plus,y,sqrt(mean(se.^2))*randn(10000,30)),2);
title(sprintf('R^2 = %.0f%%; maximum R^2 = [%.0f%% %.0f%%]',r2,prctile(dist,2.5),prctile(dist,97.5)));

% In this simulation, the model characterizes the data quite well,
% with an R^2 value of 80%. However, performing the same
% "maximum R^2" calculations described earlier, we find
% that the 95% confidence interval on the R^2 values that we
% can expect given the noise level in the data is [2%, 58%].
% This indicates that the model is, in a sense, doing too well.
% The problem might be due to inaccuracy of the error bar
% estimates: perhaps the error bar estimates are larger than
% they should be, or perhaps the errors on different data points
% are not independent of one another (the assumption of
% independence is implicit in the calculation of the
% maximum R^2 values). Resolving tricky problems like
% these requires detailed inspection of the data on a case-
% by-case basis.

Noise, model complexity, and overfitting

2011-12-14T22:11:00.000-08:00

When building models, ideally we would (1) have plenty of data available and (2) know what model to apply to the data. If these conditions hold, then we would simply need to estimate the parameters of the model and verify that the model is indeed accurate. The problem is that in reality, these conditions often do not hold. That is, we often have limited or noisy data and we often do not know what model is appropriate for the data at hand. In these cases, we are faced with the task of trying different models and determining which model is best.

It is useful to think of model complexity as a dimension along which models vary. Complex, flexible models have the potential to describe many different types of functions. The advantage of such models is that the true model (i.e. the model that most accurately characterizes the data) may in fact be contained in the set of models that can be described. The disadvantage of such models is that they have many free parameters, and it may be difficult to obtain good parameter estimates with limited or noisy data. (Or, stated another way: when fitting such models to limited or noisy data, there is a strong possibility of overfitting the data --- that is, a strong possibility that random variability in the data will have too much influence on parameter estimates and give rise to an inaccurate model.) On the other hand, simple, less flexible models describe fewer types of functions compared to complex models. The advantage of simple models is that they have fewer free parameters, and so it becomes feasible to obtain good parameter estimates with limited or noisy data. (Or, stated another way: there is low likelihood that simple models will overfit the data.) The disadvantage of simple models is that the types of functions that can be described may be poor approximations to the true underlying function.

A priori, it is impossible to say whether a complex or simple model will yield the most accurate model estimate for a given dataset (since it depends on the amount of data available, the nature of the effect, etc.). We must empirically try different models and evaluate their performance, e.g. using cross-validation. In this post, we provide a simple example illustrating these concepts.

CODE

% Construct training dataset.
x = linspace(-2,2,30);
modelfun = @(x) x.^4 - 20*x.^3 + 5*x.^2 + 4*x + 1;
y = feval(modelfun,x) + 50*randn(1,length(x));

% Construct testing dataset.
xval = repmat(linspace(-2,2,30),[1 10]);
yval = feval(modelfun,xval) + 50*randn(1,300);

% Define max polynomial degrees to try.
degstodo = 1:10;
degs = [1 3 10]; % display these

% Make a figure.
results = []; ax1 = []; ax2 = [];
figure(999); setfigurepos([100 100 800 600]); clf;
for dd=1:length(degstodo)
makemodel = @(x) catcell(2,arrayfun(@(n) x'.^n,0:degstodo(dd),'UniformOutput',0));
X = feval(makemodel,x);
recboot = fitprfstatic(X,y',0,0,[],100,[],[],[],@calccod); % bootstrap
rec     = fitprfstatic(X,y',0,0,[],0,[],[],[],@calccod);    % full fit
pred = feval(makemodel,xval)*rec.params';
results(dd,1) = rec.r; % training R^2
results(dd,2) = calccod(pred,yval'); % testing R^2
modelfits = [];
for p=1:size(recboot.params,1)
    modelfits(p,:) = X*recboot.params(p,:)';
end
iii = find(ismember(degs,degstodo(dd)));
if ~isempty(iii)
    subplot(length(degs),2,(iii-1)*2+1); hold on;
    h1 = scatter(x,y,'k.');
    mn = median(modelfits,1);
    se = stdquartile(modelfits,1,1);
    [d,ix] = sort(x);
    h2 = errorbar3(x(ix),mn(ix),se(ix),'v',[1 .8 .8]);
    h3 = plot(x(ix),mn(ix),'r-');
    h4 = plot(x(ix),feval(modelfun,x(ix)),'b-');
    uistack(h1,'top');
    xlabel('x'); ylabel('y');
    legend([h1 h3 h2 h4],{'Data' 'Model fit' 'Error bars' 'True model'},'Location','NorthEastOutside');
    ax1(iii,:) = axis;
    axis(ax1(1,:));
    title(sprintf('Max polynomial degree = %d',degstodo(dd)));

    subplot(length(degs),2,(iii-1)*2+2); hold on;
    bar(0:degstodo(dd),median(recboot.params,1));
    errorbar2(0:degstodo(dd),median(recboot.params,1),stdquartile(recboot.params,1,1),'v','r-');
    set(gca,'XTick',0:degstodo(dd));
    xlabel('Polynomial degree');
    ylabel('Estimated weight');
    ax = axis;
    axis([-1 max(degstodo)+1 ax(3:4)]);
    ax2(iii,:) = axis;
end
end

% We have created a dataset in which there is a nonlinear relationship
% between x and y. The observed data points are indicated by black dots,
% and the true relationship is indicated by a blue line. We have fit three
% different models to the data: a linear model (max polynomial degree of 1),
% a cubic model (max polynomial degree of 3), and a model that includes
% polynomials up to degree 10. For each model, we have bootstrapped the
% fits, allowing us to see the variability of the model fit (pink error bars
% around the red line), as well as the variability of the parameter
% estimates (red error bars on the vertical black bars).
%
% There are several important observations:
% 1. The cubic model gives the most accurate model estimate. The linear model
%    underfits the data, while the 10th degree model overfits the data.
% 2. In the cubic and 10th degree models, we estimate weights on higher
%    polynomials, and this changes the estimated weights on the lower
%    polynomials. The change in weights indicates that the lower and higher
%    polynomials have correlated effects in the data. In general, correlated regressors
%    are not necessarily problematic. However, in the given dataset, we do not
%    have much data available and the influence of the higher polynomials is
%    weak (in fact, there are no polynomials beyond degree 4 in the true model).
%    Thus, we are better off ignoring the higher polynomials than attempting to
%    include them in the model.

% Because the data were generated from a known model, we can compare
% the various models to the known model in order to see which model is
% most accurate. However, in general, we do not have access to the true,
% underlying model. Instead, what we can do is to use cross-validation to
% quantify model accuracy.
figure(998); clf; hold on;h = plot(results,'o-');
h2 = straightline(calccod(feval(modelfun,xval),yval),'h','k-');
legend([h' h2],{'Training' 'Testing' 'Testing (true model)'});
set(gca,'XTick',1:length(degstodo),'XTickLabel',mat2cellstr(degstodo));
ax = axis;
axis([0 length(degstodo)+1 0 ax(4)]);
xlabel('Max polynomial degree');
ylabel('Variance explained (R^2)');

% In this figure, we see that increasing the maximum polynomial degree invariably
% increases performance on the training data. However, we see that increasing
% the maximum polynomial degree only increases performance on the testing data
% up to a degree of 3. Beyond degree 3, the additional higher-order polynomials
% causes the model to overfit the training data and produces model estimates
% that perform more poorly on the testing data. Thus, by using cross-validation we
% have found that the cubic model is the most accurate model.
%
% Notice that the performance of the cubic model is nevertheless still lower than
% that of the true model. In order to improve the performance of our model
% estimate, we either have to change our modeling strategy (e.g. change the
% model, use regularization in parameter estimation) or collect more data.
% But in the end, there is no magic bullet --- achieving imperfect models is
% an inevitable feature of model building!

Model accuracy vs. model reliability

2011-12-10T13:30:00.001-08:00

In previous posts, we looked at cross-validation and bootstrapping in the context of regression. Cross-validation and bootstrapping are similar in that both involve resampling the data and then fitting a model of interest. In cross-validation, we fit a model of interest to a subset of the data and then evaluate how well the model predicts the remaining data. In bootstrapping, we create bootstrap samples by drawing with replacement from the original data and then fit a model of interest to each bootstrap sample. However, despite this superficial similarity, the two methods have fundamentally different purposes: cross-validation quantifies the accuracy of a model whereas bootstrapping quantifies the reliability of a model.

CODE

% Let's distill the distinction between accuracy and reliability
% down to its core and look at a very simple example.
figure(999); clf; hold on;
h1 = scatter(1,7,'ro');
h2 = scatter(1,4,'bo');
h3 = errorbar2(1,4,1,'v','b-');
axis([0 2 0 12]);
legend([h1 h2 h3],{'True model' 'Estimated model' 'Error bars'});
ylabel('Value');
set(gca,'XTick',[]);

% In this example, we have a single number indicated by the red dot,
% and we are trying to match this number with a model. Through some
% means we have estimated a specific model, and the prediction of the
% model is indicated by the blue dot. Moreover, through some means we
% have estimated error bars on the model's prediction, and this is
% indicated by the blue line.

% Now let's consider the accuracy and reliability of the estimated
% model. The accuracy of the model corresponds to how far the
% estimated model is away from the true model. The reliability
% of the model corresponds to how variable the estimated model is.
h4 = drawarrow([1.3 4.5],[1.03 4.52],'k-',[],10);
h5 = text(1.33,4.5,'Reliability');
h6 = plot([.95 .9 .9 .95],[7 7 4 4],'k-');
h7 = text(.88,5.5,'Accuracy','HorizontalAlignment','Right');

% Accuracy and reliability are not the same thing, although they do bear
% certain relationships to one another. For example, if reliability is
% low, then it is likely that accuracy is low. (Imagine that the error bar
% on a given model is very large. Then, we would expect that any given
% estimate of the model would be not well matched to the true model.)
% Conversely, if accuracy is high, then it is likely that reliability
% is also high. (If a model estimate predicts responses extremely
% well, then it is likely that the parameters of the model are well
% estimated.)
%
% However, an important case to keep in mind is that it is possible for a
% model to have high reliability but low accuracy. To see how this can
% occur, let's examine each possible configuration of accuracy and
% reliability.

% CASE 1: MODEL IS RELIABLE AND ACCURATE.
%   In this case, there are enough data to obtain good estimates of
%   the parameters of the model, and the model is a good description
%   of the data. Let's see an example (quadratic model fitted to
%   quadratic data).
x = rand(1,100)*14 - 8;
y = -x.^2 + 2*x + 4 + 6*randn(1,100);
rec = fitprfstatic([x.^2; x; ones(1,length(x))]',y',0,0,[],100,[],[],[],@calccod);
figure(998); clf; hold on;
h1 = scatter(x,y,'k.');
ax = axis;
xx = linspace(ax(1),ax(2),100);
X = [xx.^2; xx; ones(1,length(xx))]';
modelfits = [];
for p=1:size(rec.params,1)
modelfits(p,:) = X*rec.params(p,:)';
end
mn = median(modelfits,1);
se = stdquartile(modelfits,1,1);
h2 = errorbar3(xx,mn,se,'v',[.8 .8 1]);
h3 = plot(xx,mn,'b-');
h4 = plot(xx,-xx.^2 + 2*xx + 4,'r-');
uistack(h1,'top');
xlabel('x'); ylabel('y');
legend([h1 h4 h3 h2],{'Data' 'True model' 'Estimated model' 'Error bars'});
title('Model is reliable and accurate');

% CASE 2: MODEL IS RELIABLE BUT INACCURATE.
%   In this case, there are enough data to obtain good estimates of
%   the parameters of the model, but the model is a bad description
%   of the data. Let's see an example (linear model fitted to
%   quadratic data).
x = rand(1,100)*10 - 5;
y = x.^2 - 3*x + 4 + 1*randn(1,100);
rec = fitprfstatic([x; ones(1,length(x))]',y',0,0,[],100,[],[],[],@calccod);
figure(997); clf; hold on;
h1 = scatter(x,y,'k.');
ax = axis;
xx = linspace(ax(1),ax(2),100);
X = [xx; ones(1,length(xx))]';
modelfits = [];
for p=1:size(rec.params,1)
modelfits(p,:) = X*rec.params(p,:)';
end
mn = median(modelfits,1);
se = stdquartile(modelfits,1,1);
h2 = errorbar3(xx,mn,se,'v',[.8 .8 1]);
h3 = plot(xx,mn,'b-');
h4 = plot(xx,xx.^2 - 3*xx + 4,'r-');
uistack(h1,'top');
xlabel('x'); ylabel('y');
legend([h1 h4 h3 h2],{'Data' 'True model' 'Estimated model' 'Error bars'});
title('Model is reliable but inaccurate');

% CASE 3: MODEL IS UNRELIABLE BUT ACCURATE.
%   This is not a likely situation. Suppose there are insufficient data to
%   obtain good estimates of the parameters of a model. This implies that
%   the parameters would fluctuate widely from dataset to dataset, which in
%   turn implies that the predictions of the model would also fluctuate widely
%   from dataset to dataset. Thus, for any given dataset, it would be unlikely
%   that the predictions of the estimated model would be well matched to the data.

% CASE 4. MODEL IS UNRELIABLE AND INACCURATE.
%   In this case, there are insufficient data to obtain good estimates of
%   the parameters of the model, and this supplies a plausible explanation
%   for why the model does not describe the data well. (Of course, it could
%   be the case that even with sufficient data, the estimated model would
%   still be a poor description of the data; see case 2 above.) Let's see
%   an example of an unreliable and inaccurate model (Gaussian model
%   fitted to Gaussian data, but only a few noisy data points are available).
x = linspace(1,100,20);
y = evalgaussian1d([40 10 10 2],x) + 10*randn(1,20);
model = {[30 20 5 0] [-Inf 0 -Inf -Inf; Inf Inf Inf Inf] @(pp,xx) evalgaussian1d(pp,xx)};
rec = fitprfstatic(x',y',model,[],[],100,[],[],[],@calccod);
figure(996); clf; hold on;
h1 = scatter(x,y,'k.');
ax = axis;
xx = linspace(ax(1),ax(2),100);
modelfits = [];
for p=1:size(rec.params,1)
modelfits(p,:) = evalgaussian1d(rec.params(p,:),xx);
end
mn = median(modelfits,1);
se = stdquartile(modelfits,1,1);
h2 = errorbar3(xx,mn,se,'v',[.8 .8 1]);
h3 = plot(xx,mn,'b-');
h4 = plot(xx,evalgaussian1d([40 10 10 2],xx),'r-');
uistack(h1,'top');
xlabel('x'); ylabel('y');
legend([h1 h4 h3 h2],{'Data' 'True model' 'Estimated model' 'Error bars'});
title('Model is unreliable and inaccurate');

Using bootstrapping to quantify model reliability

2011-12-03T21:55:00.001-08:00

When fitting a model to data, the accuracy with which we can estimate the parameters of the model depends on the amount of data available and the level of noise in the data. To quantify the reliability of model parameter estimates, a simple yet effective approach is to use bootstrapping.

CODE

% Here's an example dataset.
x = 1:100;
y = evalgaussian1d([40 10 10 2],x) + 4*randn(1,100);
figure(999); clf; hold on;
h1 = scatter(x,y,'k.');
xlabel('x'); ylabel('y');

% Let's presume that the model that describes these data is
% a one-dimensional Gaussian function. (Indeed, that's what
% we used to generate the data; see above.) Let's fit this
% model to the data using bootstrapping. Specifically, let's
% draw 100 bootstraps from the data points and fit the
% Gaussian model to each bootstrap.
model = {[30 20 5 0] [-Inf 0 -Inf -Inf; Inf Inf Inf Inf] @(pp,xx) evalgaussian1d(pp,xx)};
rec = fitprfstatic(x',y',model,[],[],100,[],[],[],@calccod);
modelfits = [];
for p=1:size(rec.params,1)
modelfits(p,:) = evalgaussian1d(rec.params(p,:),x);
end
h2 = plot(modelfits','r-');
uistack(h1,'top');

% Each red line is the model fit obtained from one of
% the bootstraps. By inspecting the variability of the
% red lines across bootstraps, we get a sense of the
% reliability of the model.

% For better visualization, let's compute the mean and standard
% deviation across bootstraps. This gives us the mean model
% fit and its standard error.
delete(h2);
mn = mean(modelfits,1);
se = std(modelfits,[],1);
h3 = errorbar3(x,mn,se,'v',[1 .7 .7]);
h4 = plot(x,mn,'r-');
uistack(h1,'top');

% So far we have been examining the variability of the model fit.
% But we may be interested in the variability of the model parameters
% (which underlie the model fit). Let's look at how model parameters
% vary from bootstrap to bootstrap.
figure(998); clf; hold on;
np = size(rec.params,2);
bar(1:np,median(rec.params,1));
for p=1:size(rec.params,2)
set(scatter(p*ones(1,100) - 0.1,rec.params(:,p),'r.'),'CData',[1 .7 .7]);
end
errorbar2(1:np,median(rec.params,1),stdquartile(rec.params,1,1),'v','r-','LineWidth',2);
set(gca,'XTick',1:np,'XTickLabel',{'Mean' 'Std Dev' 'Gain' 'Offset'});
xlabel('Parameter'); ylabel('Value');

% In this figure, the light dots indicate the parameters obtained in
% different bootstraps, the black bars indicate the median across
% bootstraps, and the red error bars indicate the 68% confidence
% intervals calculated using percentiles (these confidence intervals
% are analogous to plus and minus one standard error in the case
% of Gaussian distributions).

% There are two basic factors that determine the reliability of
% an estimated model: how many data points there are to fit the
% model and how noisy each individual data point is. First,
% let's look at an example that illustrates the impact of the
% number of data points.
nns = [20 80 320 1280];
noiselevels = [4 4 4 4];
figure(997); setfigurepos([100 100 500 400]); clf;
for p=1:length(nns)
x = 1 + rand(1,nns(p))*99;
y = evalgaussian1d([40 10 10 2],x) + noiselevels(p)*randn(1,nns(p));
subplot(2,2,p); hold on;
h1 = scatter(x,y,'k.');
xlabel('x'); ylabel('y');
if p == 1
    ax = axis;
end
model = {[30 20 5 0] [-Inf 0 -Inf -Inf; Inf Inf Inf Inf] @(pp,xx) evalgaussian1d(pp,xx)};
rec = fitprfstatic(x',y',model,[],[],50,[],[],[],@calccod);
modelfits = [];
xx = 1:100;
for q=1:size(rec.params,1)
    modelfits(q,:) = evalgaussian1d(rec.params(q,:),xx);
end
mn = mean(modelfits,1);
se = std(modelfits,[],1);
h3 = errorbar3(xx,mn,se,'v',[1 .7 .7]);
h4 = plot(xx,mn,'r-');
if p <= 2
    uistack(h1,'top');
end
axis(ax);
title(sprintf('Number of data points = %d, noise level = %.1f',nns(p),noiselevels(p)));
end

% Notice that as the number of data points increases, the reliability
% of the estimated model increases. Next, let's look at an example
% that illustrates the impact of the noise level.
nns = [80 80 80 80];
noiselevels = [4 2 1 0.5];

% Notice that as the noise level decreases, the reliability of the
% estimated model increases.

Basics of regression and model fitting

2011-12-03T01:00:00.001-08:00

In regression, we build a model that uses one or more variables to predict some other variable. To understand regression, it is useful to play with simple two-dimensional data (where one variable is used to predict a second variable). An important aspect of regression is the use of cross-validation to evaluate the quality of different models.

CODE

% Let's generate some data in two dimensions.
x = randn(1,200);
y = x.^2 + 3*x + 2 + 3*randn(1,200);
figure(999); clf; hold on;
h1 = scatter(x,y,'k.');
xlabel('x'); ylabel('y');

% Through inspection of the scatterplot, we see that there
% appears to be a nonlinear relationship between x and y.
% We would like to build a model that quantitatively characterizes
% this relationship.

% Let's consider two different models. One model is a purely linear
% model, y = a*x + b, where a and b are free parameters. The
% second model is a quadratic model, y = a*x^2 + b*x + c where
% a, b, and c are free parameters. We will assume that we want
% to minimize the squared error between the model and the data.
model1 = polyfit(x,y,1);
model2 = polyfit(x,y,2);
ax = axis;
xx = linspace(ax(1),ax(2),100);
h2 = plot(xx,polyval(model1,xx),'r-','LineWidth',2);
h3 = plot(xx,polyval(model2,xx),'g-','LineWidth',2);
axis(ax);
legend([h2 h3],{'Linear model' 'Quadratic model'});
title('Direct fit (no cross-validation)');

% Although the linear model captures the basic trend in the data, the
% quadratic model seems to characterize the data better. In particular, the
% linear model seems to overestimate the data at middle values of x and
% underestimate the data at low and high values of x.

% How can we formally establish which model is best? A simple approach
% is to quantify the fit quality of each model using a metric like R^2
% (see the blog post on R^2) and then determine which model has
% the higher fit quality. The problem with this approach is that the
% quadratic model will always outperform the linear model in
% terms of fit quality, even when the true underlying relationship
% is purely linear. The reason is that the quadratic model subsumes
% the linear model and includes one additional parameter (the
% weight on the x^2 term). Thus, the quadratic model will always do at
% least as well as the linear model and will do even better given the
% extra parameter (unless the weight on the x^2 term is estimated to be
% exactly zero).

% A better approach is to quantify the prediction quality of each model
% using cross-validation. This approach is exactly the same as the
% first approach, except that the quality of fit is evaluated on new
% data points that are not used to fit the parameters of the model. The
% intuition is that the fit quality on the data points used for training
% will, on average, be an overestimate of the true fit quality (i.e. the
% expected fit quality of the estimated model for data points drawn from
% the underlying data distribution).

% With cross-validation, there is no guarantee that the more complex
% quadratic model will outperform the linear model. The best performing
% model will be the one that, after parameter estimation, most closely
% characterizes the underlying relationship between x and y.

% Let's use cross-validation in fitting the linear and
% quadratic models to our example dataset. Specifically,
% let's use leave-one-out cross-validation, a method in which we
% leave a single data point out, fit the model on the remaining data
% points, predict the left-out data point, and then repeat this whole
% process for every single data point. We will use the metric of R^2
% to quantify how closely the model predictions match the data.
model1 = fitprfstatic([x; ones(1,length(x))]',y',0,0,[],-1,[],[],[],@calccod);
model2 = fitprfstatic([x.^2; x; ones(1,length(x))]',y',0,0,[],-1,[],[],[],@calccod);
figure(998); setfigurepos([100 100 600 300]); clf;
subplot(1,2,1); hold on;
h1 = scatter(x,y,'k.');
ax = axis;
[d,ii] = sort(x);
h2 = plot(x(ii),model1.modelfit(ii),'r-','LineWidth',2);
axis(ax);
xlabel('x'); ylabel('y');
title(sprintf('Linear model; cross-validated R^2 = %.1f',model1.r));
subplot(1,2,2); hold on;
h3 = scatter(x,y,'k.');
ax = axis;
[d,ii] = sort(x);
h4 = plot(x(ii),model2.modelfit(ii),'g-','LineWidth',2);
axis(ax);
xlabel('x'); ylabel('y');
title(sprintf('Quadratic model; cross-validated R^2 = %.1f',model2.r));

% In these plots, the red and green lines indicate the predictions
% of the linear and quadratic models, respectively. Notice that
% the lines are slightly wiggly; this reflects the fact that each
% predicted data point comes from a different model estimate.
% We find that the quadratic model achieves a higher
% cross-validated R^2 value than the linear model.

% Let's repeat these simulations for another dataset.
x = randn(1,50);
y = .1*x.^2 + x + 2 + randn(1,50);

% For this dataset, even though the underlying relationship between
% x and y is quadratic, we find that the quadratic model produces
% a lower cross-validated R^2 than the linear model. This indicates
% that we do not have enough data to reliably estimate the parameters
% of the quadratic model. Thus, we are better off estimating the linear
% model and using the linear model estimate as a description of the data.
% (For comparison purposes, the R^2 between the true model and the
% observed y-values, calculated as calccod(.1*x.^2 + x + 2,y), is 47.4.
% The linear model's R^2 is not as good as this, but is substantially
% better than the quadratic's model R^2.)

Fitting probability distributions to data

2011-11-27T20:17:00.001-08:00

Here we cover the very basics of probability distributions and how to fit them to data. We will see that one way to fit a probability distribution is to determine the parameters of the distribution that maximize the likelihood of the data.

CODE

% Let's look at a basic probability distribution, the Gaussian
% distribution. In the one-dimensional case, there are two parameters,
% the mean and the standard deviation. Let's see an example.
mn = 3;
sd = 1;
fun = @(x) 1/(sd*sqrt(2*pi)) * exp(-(x-mn).^2/(2*sd^2));

% We have defined a function that takes a value and returns
% the corresponding likelihood for a Gaussian distribution with
% mean 3 and standard deviation 1. This function is called
% a probability density function. Let's visualize it.
figure(999); clf; hold on;
xx = -1:.01:7;
h1 = plot(xx,feval(fun,xx),'r-','LineWidth',2);
xlabel('Value');
ylabel('Likelihood');
legend(h1,{'Probability density function'});

% For a Gaussian distribution, about 95% of the time, values drawn
% from the distribution will lie within two standard deviations
% of the mean. In our example, this range is between 1 and 5.
xx2 = linspace(mn-2*sd,mn+2*sd,100);
h2 = bar(xx2,feval(fun,xx2),1);
set(h2,'FaceColor',[.8 .8 .8]);
set(h2,'EdgeColor','none');
uistack(h1,'top');

% We have shaded in the area underneath the probability density function
% that lies between 1 and 5. If we were to calculate the actual area
% of this region, we would find that it is (approximately) 0.95.
% The total area underneath the probability density function is 1.
delete(h2);

% Now let's consider the concept of calculating the likelihood of a
% set of data given a particular probability distribution. Using
% our example Gaussian distribution (mean 3, standard deviation 1),
% let's calculate the likelihood of a sample set of data.
data = [2.5 3 3.5];
h3 = straightline(data,'v','b-');
likelihood = prod(feval(fun,data));
legend([h1 h3(1)],{'Probability density function' 'Data points'});
title(sprintf('Likelihood of data points = %.6f',likelihood));

% The likelihood of observing the data points 2.5, 3, and 3.5 is
% obtained by multiplying the likelihoods of each individual data
% point. (We are assuming that the data points are independent.)
% Now, for comparison, let's calculate the likelihood of a different
% set of data.
delete(h3);
data = [4 4.5 5];
h3 = straightline(data,'v','b-');
likelihood = prod(feval(fun,data));
legend([h1 h3(1)],{'Probability density function' 'Data points'});
title(sprintf('Likelihood of data points = %.6f',likelihood));

% Notice that the likelihood of this new set of data is much smaller
% then that of the original set of data.

% Now that we know how to calculate the likelihood of a set of data
% given a particular probability distribution, we can now think about
% how to actually fit a probability distribution to a set of data.
% All that we need to do is to set the parameters of the probability
% distribution such that the likelihood of the set of data is maximized.

% Let's do an example. We have several data points and we want to fit
% a univariate (one-dimensional) Gaussian distribution to the data.
% To determine the optimal mean and standard deviation parameters,
% let's use brute force.
data = randn(1,100)*2.5 + 4;
fun = @(pp) sum(-log(1/(abs(pp(2))*sqrt(2*pi)) * exp(-(data-pp(1)).^2/(2*pp(2)^2))));
options = optimset('Display','iter','FunValCheck','on', ...
                   'MaxFunEvals',Inf,'MaxIter',Inf,'TolFun',1e-6,'TolX',1e-6);
params = fminsearch(fun,[0 1],options);

% What we have done is to use fminsearch to determine the mean and standard
% deviation parameters that minimize the sum of the negative log likelihoods
% of the data points. (Maximizing the product of the likelihoods of the data
% points is equivalent to maximizing the log of the product of the likelihoods
% of the data points, which is equivalent to maximizing the sum of the log of
% the likelihood of each data point, which is equivalent to minimizing the
% sum of the negative log likelihood of the data points. Or, semi-formally:
%   Let <ps> be a vector of likelihoods. Then,
%     max prod(ps) <=> max log(prod(ps))
%                  <=> max sum(log(ps))
%                  <=> min sum(-log(ps))
% Note that taking the log of the likelihoods helps avoid numerical
% precision issues.) Let's visualize the results.
mn = params(1);
sd = abs(params(2));
fun = @(x) 1/(sd*sqrt(2*pi)) * exp(-(x-mn).^2/(2*sd^2));
figure(998); clf; hold on;
h2 = straightline(data,'v','k-');
ax = axis;
xx = linspace(ax(1),ax(2),100);
h1 = plot(xx,feval(fun,xx),'r-','LineWidth',2);
axis([ax(1:2) 0 max(feval(fun,xx))*1.1]);
xlabel('Value');
ylabel('Likelihood');
legend([h2(1) h1],{'Data' 'Model'});

% Let's take the mean and standard deviation parameters that we found
% using optimization and compare them to the mean and standard deviation
% of the data points.
params
[mean(data) std(data,1)]

% Notice that the two sets of results are more or less identical (the
% difference can be attributed to numerical precision issues). Thus,
% we see that computing the mean and standard deviation of a set of
% data can be viewed as implicitly fitting a one-dimensional
% Gaussian distribution to the data.

Peeking at P-values

2011-11-21T20:46:00.001-08:00

Guest Post by Jon

A common experimental practice is to collect data and then do a statistical test to evaluate whether the data differs significantly from a null hypothesis. Sometimes researchers peek at the data before data collection is finished and do a preliminary analysis of the data. If a statistical test indicates a negative result, more data is collected; if there is a positive result, data collection is stopped. This strategy invalidates the statistical test by inflating the likelihood of observing a false positive.

In this post we demonstrate the amount of inflation obtained by this strategy. We sample from a distribution with a zero mean and test whether the sample mean differs from 0. As we will see, if we continually peek at the data, and then decide whether to continue data collection contingent on the partial results, we wind up with an elevated chance of rejecting the null hypothesis.

CODE
% Simulate 1000 experiments with 200 data points each
x = randn(200,1000);

% We expect about 5% false positives, given an alpha of 0.05
disp(mean(ttest(x, 0, 0.05)));

% Now let's calculate the rate of false positives for different sample
% sizes. We assume a minimum of 10 samples and a maximum of 200.
h = zeros(size(x));
for ii = 10:200
h(ii,:) = ttest(x(1:ii, :), 0, 0.05);
end

% The chance of a false positive is about 0.05, no matter how many data points
figure(99);
plot(1:200, mean(h, 2), 'r-', 'LineWidth', 2)
ylim([0 1])
ylabel('Probability of a false positive')
xlabel('Number of samples')

% How would the false positive rate change if we peeked at the data?
% To simulate peeking, we take the cumulative sum of h values for each
% simulation. The result of this is that if at any point we reject the null
% (h=1), the remaining points for that simulation also assume we rejected
% null.
peekingH = logical(cumsum(h));

figure(99); hold on
plot(1:200, mean(peekingH, 2), 'k-', 'LineWidth', 2)
legend('No peeking', 'Peeking')

% The plot demonstrates the problem with peeking: we defined the likelihood
% of a false positive as our alpha value (here, 0.05), but we have created
% a false positive rate that is much higher.

Nonparametric probability distributions

2011-11-19T22:21:00.001-08:00

Parametric probability distributions (e.g. the Gaussian distribution) make certain assumptions about the distribution of a set of data. Instead of making these assumptions, we can use nonparametric methods to estimate probability distributions (such methods include bootstrapping, histograms, and kernel density estimation).

CODE

% Generate some data points.
x = 4 + randn(1,100).^2 + 0.2*randn(1,100);

% Let's visualize the data.
figure(999); clf; hold on;
axis([0 12 0 0.8]); ax = axis;
h1 = straightline(x,'v','r-');
xlabel('Value');
ylabel('Probability');

% We have drawn a vertical line at each data point.
% Imagine that the vertical lines collectively
% define a probability distribution that has a value
% of 0 everywhere except for the data points
% where there are infinitely high spikes. This
% probability distribution is what the bootstrapping
% method is in fact implicitly using --- drawing
% random samples with replacement from the
% original data is equivalent to treating the data
% points themselves as defining the probability
% distribution estimate.

% Now let's construct a histogram.
[nn,cc] = hist(x,20);
binwidth = cc(2)-cc(1);
h2 = bar(cc,nn/sum(nn) * (1/binwidth),1);

% The histogram is a simple nonparametric method for
% estimating the probability distribution associated with
% a set of data. It is nonparametric because it makes
% no assumption about the shape of the probability
% distribution. Note that we have scaled the histogram
% in such a way as to ensure that the area occupied by
% the histogram equals 1, which is a requirement for
% true probability distributions.

% The simplest and most common method for summarizing a
% set of data is to compute the mean and standard deviation
% of the data. This can be seen as using a parametric
% probability distribution, the Gaussian distribution,
% to interpret the data. Let's make this connection
% explicit by plotting a Gaussian probability distribution
% whose mean and standard deviation are matched to that
% of the data.
mn = mean(x);
sd = std(x);
xx = linspace(ax(1),ax(2),100);
h3 = plot(xx,evalgaussian1d([mn sd 1/(sd*sqrt(2*pi)) 0],xx),'g-','LineWidth',2);

% Notice that the Gaussian distribution is reasonably
% matched to the data points and the histogram but tends
% to overestimate the data density at low values.

% The histogram is a crude method since the probability
% distributions that it creates have discontinuities (the
% corners of the black rectangles). To avoid discontinuities,
% we can instead use kernel density estimation. In this method,
% we place a kernel at each data point and sum across the
% kernels to obtain the final probability distribution estimate.
% Let's apply kernel density estimation to our example using
% a Gaussian kernel with standard deviation 0.5. (To ensure
% visibility, we have made the height of each kernel higher than
% it actually should be.)
kernelwidth = 0.5;
vals = [];
h4 = [];
for p=1:length(x)
vals(p,:) = evalgaussian1d([x(p) kernelwidth 1/(kernelwidth*sqrt(2*pi)) 0],xx);
h4(p) = plot(xx,1/10 * vals(p,:),'c-');
end
h5 = plot(xx,mean(vals,1),'b-','LineWidth',2);
% Note that we could have used MATLAB's ksdensity function to achieve
% identical results:
% vals = ksdensity(x,xx,'width',0.5);
% h5 = plot(xx,vals,'b-');

% Finally, add a legend.
legend([h1(1) h2(1) h3(1) h4(1) h5(1)],{'Data' 'Histogram' 'Gaussian' 'Kernels' 'Kernel Density Estimate'});

% Notice that the kernel density estimate is similar to the histogram, albeit smoother.

ADDITIONAL ISSUES

When constructing histograms, a free parameter is the number of bins to use. Typically, histograms are used as an exploratory method, so it would be quite acceptable to try out a variety of different values (so as to get a sense of what the data are like).

Notice that the number of bins used in a histogram is analogous to the width of the kernel used in kernel density estimation.

How do we choose the correct kernel width in kernel density estimation? If the kernel is too small, then we risk gaps in the resulting probability distribution, which is probably not representative of the true distribution. If the kernel is too large, then the resulting probability distribution will tend towards being uniform, which is also probably not representative of the true distribution. We will address selection of kernel width in a later post, but the short answer is that we can use cross-validation to choose the optimal kernel width.

Error bars on standard deviations

2011-11-12T17:17:00.000-08:00

We are often concerned with estimating the mean of a population. Given that we can obtain only a limited number of samples from the population, what we normally do is to compute the mean of the samples and then put an error bar on the mean.

Now, sometimes we might be interested in the standard deviation of the population. Notice that the exact same problem arises --- with a limited number of samples from the population, we can obtain an estimate of the standard deviation of the population (by simply computing the standard deviation of the samples), but this is just an estimate. Thus, an error bar can be put on the standard deviation, too.

CODE

% Let's perform a simulation. For several different sample sizes,
% we draw random samples from a normal distribution with mean 0
% and standard deviation 1. For each random sample, we compute
% the standard deviation of the sample. We then look at the
% distribution of the standard deviations across repeated simulations.
nn = 8;
ns = 2.^(1:nn);
cmap = cmapdistinct(nn);
figure(999); setfigurepos([100 100 700 250]); clf;
subplot(1,2,1); hold on;
h = []; sd = [];
for p=1:length(ns)
x = randn(100000,ns(p));
dist = std(x,[],2);
[n,x] = hist(dist,100);
h(p) = plot(x,n,'-','Color',cmap(p,:));
sd(p) = std(dist);
end
ax = axis;
legend(h,cellfun(@(x) ['n = ' x],mat2cellstr(ns),'UniformOutput',0));
xlabel('Standard deviation');
ylabel('Frequency');
title('Standard deviations obtained using different sample sizes');
subplot(1,2,2); hold on;
h2 = plot(1:nn,sd,'r-');
ax = axis; axis([0 nn+1 ax(3:4)]);
set(gca,'XTick',1:nn,'XTickLabel',mat2cellstr(ns));
xlabel('n');
ylabel('Standard deviation of distribution');
title('Spread of standard deviation estimates');

% Notice that with few data points, the standard deviations are highly variable.
% With increasing numbers of data points, the standard deviations are more tightly
% clustered around the true value of 1.
%
% To put some concrete numbers on this: at n = 32, the spread in the distribution
% (which we quantify using standard deviation) is about 0.12. This indicates that
% if we draw 32 data points from a normal distribution and compute the standard
% deviation of the data points, our standard deviation estimate is accurate only
% to about +/- 12%.

FINAL OBSERVATIONS

Recall that the standard error of the mean for a random sample drawn from a Gaussian distribution is simply the standard deviation of the sample divided by the square root of the number of data points. Notice that in computing the standard error estimate, we are implicitly estimating the standard deviation of the population. But we have just seen that this standard deviation estimate may be somewhat inaccurate, depending on the number of data points. This implies that standard errors are themselves subject to noise.

To put it simply: we can put error bars on error bars. The error on error bars will tend to be high in the case of few data points.

Testing null hypotheses using randomization, the bootstrap, and Monte Carlo methods

2011-11-08T00:15:00.000-08:00

Earlier we saw how randomization can be used to test the null hypothesis that there is no meaningful order in a set of data. This is actually just a special case of the general statistical strategy of using randomness to see what happens under various null hypotheses. This general statistical strategy covers not only randomization methods but also Monte Carlo methods and the bootstrap. Let's look at some examples.

Example 1: Using randomization to test whether two sets of data are different (e.g. in their means)

In this example, we use randomization to check whether two sets of data come from different probability distributions. Let's pose the null hypothesis that the two sets of data actually come from the same probability distribution. Under this hypothesis, the two sets of data are interchangeable, so if we aggregate the data points and randomly divide the data points into two sets, then the results should be comparable to results obtained from the original data.

% Generate two sets of data.
x = randn(1,100) + 1.4;
y = randn(1,100) + 1;

% Let's aggregate the data, randomly divide the data, and compute the resulting
% differences in means. And for comparison, let's compute the actual difference in means.
[pval,val,dist] = randomization([x y],2,@(a) mean(a(1:100))-mean(a(101:200)),10000,0);

% Visualize the results.
figure(999); clf; hold on;
hist(dist,100);
h1 = straightline(val,'v','r-');
ax = axis; axis([-.8 .8 ax(3:4)]);
xlabel('Difference in the means');
ylabel('Frequency');
legend(h1,{'Actual observed value'});
title(sprintf('Results obtained using random assignments into two groups; p (two-tailed) = %.4f',pval));

% We see that the actual difference in means, 0.525, is quite unlikely with respect
% to the differences in means obtained via randomization. The p-value is 0.0002,
% referring to the proportion of the distribution that is more extreme than the
% actual difference in means. (Since we don't have an a priori reason to expect
% a positive or negative difference, we use a two-tailed p-value --- that is,
% not only do we count the number of values greater than 0.525, but we also count
% the number of values less than -0.525.) Thus, the null hypothesis is probably
% incorrect.

% Note that strictly speaking, we have not shown that the two sets of data come
% from probability distributions with different means --- all we have done is to
% reject the hypothesis that the sets of data come from the same probability
% distribution.

Example 2: Using the bootstrap to test whether two sets of data are different (e.g. in their means)

In the previous example, we posed the null hypothesis that the two sets of data come from the same distribution and then used randomization to generate new datasets. Notice that instead of using randomization to generate new datasets, we can use the bootstrap. The difference is that in the case of randomization, we enforce the fact that none of the data points are repeated within a set of data nor across the two sets of data, whereas in the case of the bootstrap, we do not enforce these constraints --- we generate new datasets by simply drawing data points with replacement from the original set of data. (The bootstrap strategy seems preferable as it probably generates more representative random samples, but this remains to be proven...)

% Let's go through an example using the same data from Example 2.
% Aggregate the data, draw bootstrap samples, and compute the resuling
% differences in means.
dist = bootstrap([x y],@(a) mean(a(1:100)) - mean(a(101:200)),10000);

% Compute the actual observed difference in means as well as the corresponding p-value.
val = mean(x) - mean(y);
pval = sum(abs(dist) > abs(val)) / length(dist);

% Visualize the results.
figure(998); clf; hold on;
hist(dist,100);
h1 = straightline(val,'v','r-');
ax = axis; axis([-.8 .8 ax(3:4)]);
xlabel('Difference in the means');
ylabel('Frequency');
legend(h1,{'Actual observed value'});
title(sprintf('Results obtained using bootstraps of aggregated data; p (two-tailed) = %.4f',pval));

% We find that the actual difference in means is quite unlikely with respect
% to the differences in means obtained via bootstrapping. Moreover, the p-value is
% quite similar to that obtained with the randomization method.

Example 3: Using Monte Carlo simulations to test whether a sequence of numbers is different from noise

The randomization (Example 1) and bootstrap (Example 2) methods that we have seen depend on having an adequate number of data points. With sufficient data points, the empirical data distribution can serve as a reasonable proxy for the true data distribution. However, when the number of data points is small, the empirical data distribution may be a poor proxy for the true data distribution and so statistical procedures based on the empirical data distribution may suffer.

One way to reduce dependencies on the empirical data distribution is to bring in assumptions (e.g. priors) on what the true data distribution is like. For example, one simple assumption is that the data follows a Gaussian distribution. Let's look at an example.

% Suppose we observe the following data, where the x-axis is an
% independent variable (ranging from 1 to 5) and the y-axis
% is a dependent (measured) variable.
x = [1.5 3.2 2 2.5 4];
figure(997); clf; hold on;
scatter(1:5,x,'ro');
axis([0 6 0 6]);
xlabel('Independent variable');
ylabel('Dependent variable');
val = calccorrelation(1:5,x);
title(sprintf('r = %.4g',val));

% Although the observed correlation is quite high, how do we know
% the data aren't just random noise? Let's be more precise:
% Let's pose the null hypothesis that the data are simply random
% draws from a fixed Gaussian distribution. Under this hypothesis,
% what are correlation values that we would expect to obtain?
% For sake of principles, let's take random draws from a
% Gaussian distribution matched to the observed data with respect
% to mean and standard deviation, although because correlation is
% insensitive to gain and offset, matching the observed data
% in terms of mean and standard deviation is not actually necessary.
fake = mean(x) + std(x)*randn(10000,5);
dist = calccorrelation(repmat(1:5,[10000 1]),fake,2);
pval = sum(dist > val) / length(dist);
figure(996); clf; hold on;
hist(dist,100);
h1 = straightline(val,'v','r-');
xlabel('Correlation');
ylabel('Frequency');
legend(h1,{'Actual observed correlation'});
title(sprintf('Results obtained using Monte Carlo simulations; p (one-tailed) = %.4f',pval));

% Notice that the p-value is just 0.10, which means that about 10% of the time,
% purely random data would result in a correlation value as extreme as the
% one actually observed. Thus, the null hypothesis that the data are purely
% random Gaussian data is somewhat likely.

Coefficient of determination (R^2)

2011-11-02T10:18:00.000-07:00

For the purposes of model evaluation, it is preferable to use coefficient of determination (R^2) as opposed to correlation (r) or squared correlation (r^2). This is because correlation implicitly includes offset and gain parameters in the model. Instead of using correlation, the proper approach is to estimate the offset and gain parameters as part of the model-building process and then to use a metric (like R^2) that is sensitive to offset and gain. Note that when using R^2, getting the offset and gain estimated accurately is extremely important (otherwise, you can obtain low or even negative R^2 values).

CONCEPTS

In a previous post on correlation, we saw that the correlation between two sets of numbers is insensitive to the offset and gain of each set of numbers. Moreover, we saw that if you are using correlation as a means for measuring how well one set of numbers predicts another, then you are implicitly using a linear model with free parameters (an offset parameter and a gain parameter). This implicit usage of free parameters may be problematic --- two potential models may yield the same correlation even though one model gets the offset and gain completely wrong. What we would like is a metric that does not include these implicit parameters. A good choice for such a metric is the coefficient of determination (R^2), which is closely related to correlation.

Let <model> be a set of numbers that is a candidate match for another set of numbers, <data>. Then, R^2 is given by:
100 * (1 - sum((model-data).^2) / sum((data-mean(data)).^2))
There are two main components of this formula. The first component is the sum of the squares of the residuals (which are given by model-data). The second component is the sum of the squares of the deviation of the data points from their mean (this is the same as the variance of the data, up to a scale factor). So, intuitively, R^2 quantifies the size of the residuals relative to the size of the data and is expressed in terms of percentage. The metric is bounded at the top at 100% (i.e. zero residuals) and is unbounded at the bottom (i.e. there is no limit on the size of the residuals). This stands in contrast to the metric of r^2, which is bounded between -100% and 100%.

CODE

% Let's see an example of R^2 and r^2.
% Generate a sample dataset and a sample model.
data = 5 + randn(50,1);
model = .6*(data-5) + 4.5 + 0.3*randn(50,1);

% Now, calculate R^2 and r^2 and visualize.
R2 = calccod(model,data);
r2 = 100 * calccorrelation(model,data)^2;
figure(999); setfigurepos([100 100 600 220]); clf;
subplot(1,2,1); hold on;
h1 = plot(data,'k-');
h2 = plot(model,'r-');
axis([0 51 0 8]);
xlabel('Data point');
ylabel('Value');
legend([h1 h2],{'Data' 'Model'},'Location','South');
title(sprintf('R^2 = %.4g%%, r^2 = %.4g%%',R2,r2));
subplot(1,2,2); hold on;
h3 = scatter(model,data,'b.');
axis square; axis([0 8 0 8]); axissquarify; axis([0 8 0 8]);
xlabel('Model');
ylabel('Data');

% Notice that r^2 is larger than R^2. The reason for this is that r^2
% implicitly includes offset and gain parameters. To see this clearly,
% let's first apply an explicit offset parameter to the model and
% re-run the "calculate and visualize" code above.
model = model - mean(model) + mean(data);

% Notice that by allowing the model to get the mean right, we have
% closed some of the gap between R^2 and r^2. There is one more
% step: let's apply an explicit gain parameter to the model and
% re-run the "calculate and visualize" code above. (To do this
% correctly, we have to estimate both the offset and gain
% simultaneously in a linear regression model.)
X = [model ones(size(model,1),1)];
h = inv(X'*X)*X'*data;
model = X*h;

% Aha, the R^2 value is now the same as the r^2 value.
% Thus, we see that r^2 is simply a special case of R^2
% where we allow an offset and gain to be applied
% to the model to match the data. After applying
% the offset and gain, r^2 is equivalent to R^2.

% Note that because fitting an offset and gain will
% always reduce the residuals between the model and the data,
% r^2 will always be greater than or equal to R^2.

% In some cases, you might have a model that predicts the same
% value for all data points. Such cases are not a problem for
% R^2 --- the calculation proceeds in exactly the same way as usual.
% However, note that such cases are ill-defined for r^2, because
% the variance of the model is 0 and division by 0 is not defined.

% An R^2 value of 0% has a special meaning --- you achieve 0% R^2
% if your model predicts the mean of the data correctly and does
% nothing else. If your model gets the mean wrong, then you can
% quickly fall into negative R^2 values (i.e. your model is worse
% than a model that simply predicts the mean). Here is an example.
data = 5 + randn(50,1);
model = repmat(mean(data),[50 1]);
R2 = calccod(model,data);
figure(998); setfigurepos([100 100 600 220]); clf;
subplot(1,2,1); hold on;
h1 = plot(data,'k-');
h2 = plot(model,'r-');
axis([0 51 0 8]);
xlabel('Data point');
ylabel('Value');
legend([h1 h2],{'Data' 'Model'},'Location','South');
title(sprintf('R^2 = %.4g%%',R2));
subplot(1,2,2); hold on;
model = repmat(4.5,[50 1]);
R2 = calccod(model,data);
h1 = plot(data,'k-');
h2 = plot(model,'r-');
axis([0 51 0 8]);
xlabel('Data point');
ylabel('Value');
legend([h1 h2],{'Data' 'Model'},'Location','South');
title(sprintf('R^2 = %.4g%%',R2));

% Finally, note that R^2 is NOT symmetric (whereas r^2 is symmetric).
% The reason for this has to do with the gain issue. Let's see an example.
data = 5 + randn(50,1);
regressor = data + 4*randn(50,1);
X = [regressor ones(50,1)];
h = inv(X'*X)*X'*data;
model = X*h;
r2 = 100 * calccorrelation(model,data)^2;
figure(997); setfigurepos([100 100 600 220]); clf;
subplot(1,2,1); hold on;
R2 = calccod(model,data);
h1 = scatter(model,data,'b.');
axis square; axis([0 8 0 8]); axissquarify; axis([0 8 0 8]);
straightline(mean(data),'h','k-');
xlabel('Model');
ylabel('Data');
title(sprintf('R^2 of model predicting data = %.4g%%; r^2 = %.4g%%',R2,r2));
subplot(1,2,2); hold on;
R2 = calccod(data,model);
h2 = scatter(data,model,'b.');
axis square; axis([0 8 0 8]); axissquarify; axis([0 8 0 8]);
straightline(mean(data),'h','k-');
xlabel('Data');
ylabel('Model');
title(sprintf('R^2 of data predicting model = %.4g%%; r^2 = %.4g%%',R2,r2));

% On the left is the correct ordering where the model is attempting to
% predict the data. The variance in the data is represented by the scatter
% of the blue dots along the y-dimension. The green line is the model fit,
% and it does a modest job at explaining some of the variance along the
% y-dimension. It is useful to compare against the black line, which is the
% prediction of a baseline model that only and always predicts the mean of
% the data. On the right is the incorrect ordering where the data is
% attempting to predict the model. The variance in the model is represented
% by the scatter of the blue dots along the y-dimension. The green line is
% the model fit, and notice that it does a horrible job at explaining the
% variance along the y-dimension (and hence explains the very low R^2 value).
% The black line is much closer to the data points than the green line.
%
% Basically, what is happening is that the gain is correct in the case at
% the left but is incorrect in the case at the right. Because r^2
% normalizes out the gain, the r^2 value is the same in both cases.
% But because R^2 is sensitive to gain, it give very different results.
% To fix the case at the right, the gain on the quantity being used for
% prediction (the x-axis) needs to be greatly reduced so as to better
% match the quantity being predicted (the y-axis).

Correlation (r)

2011-10-30T22:57:00.000-07:00

Correlation is a basic and fundamental concept. By 'correlation' I refer to Pearson's product-moment correlation (r), which I find to be most useful, but there are other versions of correlation. In a nutshell, correlation is a single number ranging from -1 to 1 that summarizes how well one set of numbers predicts another set of numbers.

CODE

% Let's start with a simple example.
data = randnmulti(100,[10 6],[1 .6; .6 1],[3 1]);
figure(999); clf; hold on;
scatter(data(:,1),data(:,2),'r.');
axis square; axis([0 20 0 20]);
set(gca,'XTick',0:1:20,'YTick',0:1:20);
xlabel('X'); ylabel('Y');
title('Data');

% To compute the correlation, let's first standardize each
% variable, i.e. subtract off the mean and divide by the
% standard deviation. This converts each variable into
% z-score units.
dataz = calczscore(data,1);
figure(998); clf; hold on;
scatter(dataz(:,1),dataz(:,2),'r.');
axis square; axis([-5 5 -5 5]);
set(gca,'XTick',-5:1:5,'YTick',-5:1:5);
xlabel('X (z-scored)'); ylabel('Y (z-scored)');

% We would like a single number that represents the relationship
% between X and Y. Points that lie in the upper-right or lower-left
% quadrants support a positive relationship between X and Y (i.e.
% higher on X is associated with higher on Y), whereas points that
% lie in the upper-left or lower-right quadrants support a negative
% relationship between X and Y (i.e. higher on X is associated with
% lower on Y).
set(straightline(0,'h','k-'),'Color',[.6 .6 .6]);
set(straightline(0,'v','k-'),'Color',[.6 .6 .6]);
text(4,4,'+','FontSize',48,'HorizontalAlignment','center');
text(-4,-4,'+','FontSize',48,'HorizontalAlignment','center');
text(4,-4,'-','FontSize',60,'HorizontalAlignment','center');
text(-4,4,'-','FontSize',60,'HorizontalAlignment','center');

% Let's calculate the "average" relationship. To do this, we compute
% the average product of the X and Y variables (in the z-scored space).
% The result is the correlation value.
r = mean(dataz(:,1) .* dataz(:,2));
title(sprintf('r = %.4g',r));

% Now, let's turn to a different (but in my opinion more useful)
% interpretation of correlation. Let's start with a simple example.
data = randnmulti(100,[10 6],[1 .8; .8 1],[3 1]);

% For each set of values, subtract off the mean and scale the values such
% that the the values have unit length (i.e. the sum of the squares of the
% values is 1).
datanorm = unitlength(zeromean(data,1),1);

% The idea is that each set of values represents a vector in a
% 100-dimensional space. After mean-subtraction and unit-length-normalization,
% the vectors lie on the unit sphere. The correlation is simply
% the dot-product between the two vectors. If the two vectors are
% very similar to one another, the dot-product will be high (close to 1);
% if the two vectors are not very similar to one another (think: randomly
% oriented), the dot-product will be low (close to 0); if the two vectors
% are very anti-similar to one another (think: pointing in opposite
% directions), the dot product will be very negative (close to -1).
% Let's visualize this for our example.
% the idea here is to project the 100-dimensional space onto
% two dimensions so that we can actually visualize the data.
% the first dimension is simply the first vector.
% the second dimension is the component of the second vector that
% is orthogonal to the first vector.
dim1 = datanorm(:,1);
dim2 = unitlength(projectionmatrix(datanorm(:,1))*datanorm(:,2));
% compute the coordinates of the two vectors in this reduced space.
c1 = datanorm(:,1)'*[dim1 dim2];
c2 = datanorm(:,2)'*[dim1 dim2];
% make a figure
figure(997); clf; hold on;
axis square; axis([-1.2 1.2 -1.2 1.2]);
h1 = drawarrow([0 0; 0 0],[c1; c2],[],[],[],'LineWidth',4,'Color',[1 .8 .8]);
h2 = drawellipse(0,0,0,1,1,[],[],'k-');
h3 = plot([c2(1) c2(1)],[0 c2(2)],'k--');
h4 = plot([0 c2(1)],[0 0],'r-','LineWidth',4);
xlabel('Dimension 1');
ylabel('Dimension 2');
legend([h1(1) h2 h4],{'Data' 'Unit sphere' 'Projection'});
r = dot(datanorm(:,1),datanorm(:,2));
title(sprintf('r = %.4g',r));

% The vector interpretation lends itself readily to the concept
% of variance explained. Suppose that we are using one vector A
% to predict the other vector B in a linear regression sense. Then,
% the weight applied to vector A is equal to the correlation
% value. (This is because inv(A'*A)*A'*B = inv(1)*A'*B = A'*B = r.)
% Let's visualize this.
figure(996); clf; hold on;
axis square; axis([-1.2 1.2 -1.2 1.2]);
h1 = drawarrow([0 0],c2,[],[],[],'LineWidth',4,'Color',[1 0 0]);
h2 = drawarrow([0 0],r*c1,[],[],[],'LineWidth',4,'Color',[0 1 0]);
h3 = plot([r r],[0 c2(2)],'b-');
h4 = drawellipse(0,0,0,1,1,[],[],'k-');
text(.3,.38,'1');
text(.35,-.1,'r');
xlabel('Dimension 1');
ylabel('Dimension 2');
legend([h1 h2 h3 h4],{'Vector B' 'Vector A (scaled by r)' 'Residuals' 'Unit sphere'});

% So, we ask, how much variance in vector B is explained by vector A?
% The total variance in vector B is 1^2. By the Pythagorean Theorem,
% the total variance in the residuals is 1^2 - r^2. So, the fraction
% of variance that we do not explain is (1^2 - r^2) / 1^2, and
% so the fraction of variance that we do explain is 1 - (1^2 - r^2) / 1^2,
% which simplifies to r^2. (Note that here, in order to keep things simple,
% we have invoked a version of variance that omits the division by the number
% of data points. Under this version of variance, the variance associated
% with a zero-mean vector is simply the sum of the squares of the elements
% of the vector, which is equivalent to the square of the vector's length.
% Thus, there is a nice equivalence between variance and squared distance,
% which we have exploited for our interpretation.)

FINAL REMARKS

There is much more to be said about correlation; we have covered here only the basics.

One important caveat to keep in mind is that correlation is not sensitive to offset and gain. This is because the mean of each set of numbers is subtracted off and because the scale of each set of numbers is normalized out. This has several consequences:

Correlation indicates how well deviations relative to the mean can be predicted. This is natural, since variance (normally defined) is also relative to the mean.
When using correlation as a index of how well one set of numbers predicts another set of numbers, you are implicitly allowing scale and offset parameters in your model. If you want to be completely parameter-free, you should instead use a metric like R^2 (which will be described in a later post).

Randomization tests

2011-10-30T12:45:00.000-07:00

Randomization is a simple and intuitive method for establishing statistical significance. We provide here two examples in which a set of data is randomly shuffled multiple times, the results of which demonstrate that the original order of the data is in fact special.

CODE

% Let's generate a time-series with an upward linear trend,
% corrupted by a large amount of noise.
x = (1:50) + 30*randn(1,50);

% Visualize the time-series.
figure(999); clf;
plot(x);
xlabel('Time');
ylabel('Value');
title('Time-series data');

% Suppose we didn't know that the data came from a model
% with an upward trend. How can we test whether the observed
% trend is statistically significant?

% Let's perform a randomization test. The logic is as follows:
% The null hypothesis is that the data points have no dependence on
% time. If the null hypothesis is true, then we can randomly permute
% the data points and this should produce datasets that are equivalent
% to the original dataset. So, we will generate random datasets,
% compute a statistic from these datasets, and compare these values to
% the statistic generated from the original dataset. Our statistic will
% be the correlation (Pearson's r) between the time-series and a diagonal line.
[pval,val,dist] = randomization(x,2,@(x) calccorrelation(x,1:length(x)),10000,1);

% What we have done is to compute correlation values for 10000 randomly permuted
% datasets. Let's look at the distribution of correlation values obtained,
% and let's see where the original correlation value lies.
figure(998); clf; hold on;
hist(dist,100);
h = straightline(val,'v','r-');
xlabel('Correlation');
ylabel('Frequency');
legend(h,{'Actual correlation'});
title('Distribution of correlation values obtained via randomization');

% The random correlation values are most of the time less than the
% actual correlation value. The proportion of time that the actual
% correlation value is greater than the random values is the p-value.
ax = axis;
text(val,1.05*ax(4),sprintf('p = %.4g',pval),'HorizontalAlignment','center');
axis([ax(1:2) ax(3) 1.1*ax(4)]);

% Given the unlikeliness of obtaining the correlation value that we did,
% we can conclude that the null hypothesis is probably false and that
% the data points probably do have a dependence on time. (Note that
% strictly-speaking, we cannot claim that the specific form of the
% time-dependence is a linear trend; we can claim only that there is
% some dependence on time.)

% Let's do another example. Generate points in a two-dimensional
% space with a weak correlation between the x- and y-coordinates.
data = randnmulti(500,[],[1 .1; .1 1]);

% Visualize the data.
figure(997); clf;
scatter(data(:,1),data(:,2),'r.');
xlabel('X');
ylabel('Y');
title(sprintf('Data (r = %.4g)',calccorrelation(data(:,1),data(:,2))));

% Let's perform a randomization test to see if there is a statistically
% significant relationship between the x- and y-coordinates. If there is
% no relationship, then we can randomly permute the x-coordinates
% relative to the y-coordinates and this should produce datasets
% that are equivalent to the original dataset. Our statistic of
% interest will be the correlation (Pearson's r) between x- and
% y-coordinates of each dataset.
[pval,val,dist] = randomization(data(:,1),1,@(x) calccorrelation(x,data(:,2)),10000,1);

% What we have done is to compute correlation values for 10000 randomly permuted
% datasets. Let's look at the distribution of correlation values obtained,
% and let's see where the original correlation value lies.
figure(996); clf; hold on;
hist(dist,100);
h = straightline(val,'v','r-');
xlabel('Correlation');
ylabel('Frequency');
legend(h,{'Actual correlation'});
title('Distribution of correlation values obtained via randomization');
ax = axis;
text(val,1.05*ax(4),sprintf('p = %.4g',pval),'HorizontalAlignment','center');
axis([ax(1:2) ax(3) 1.1*ax(4)]);

% The random correlation values are most of the time less than the
% actual correlation value. The proportion of time that the actual
% correlation value is greater than the random values is the p-value.
% The null hypothesis (that there is no dependence between the x-
% and y-coordinates) is probably false.

Correlated regressors

2011-10-27T08:13:00.000-07:00

In ordinary least-squares linear regression, correlated regressors lead to unstable model parameter estimates. The intuition is that given two correlated regressors, it is difficult to determine how much of the data is due to one regressor and how much is due to the other. Let's look at this geometrically.

CODE

% Generate two regressors (in the columns of the matrix).
% These two regressors are relatively uncorrelated (nearly orthogonal).
X = [10 1;
   1 10];

% Generate some data (no noise has been added yet).
data = [24 25]';

% Simulate 100 measurements of the data (with noise added).
% For each measurement, estimate the weights on the regressors.
y = zeros(2,100);
h = zeros(2,100);
for rep=1:100
y(:,rep) = data + 2*randn(2,1);
h(:,rep) = inv(X'*X)*X'*y(:,rep);
end

% Estimate weights on the regressors for the case of no noise.
htrue = inv(X'*X)*X'*data;

% Now visualize the results
figure(999); clf; hold on;
h1 = scatter(y(1,:),y(2,:),'g.');
h2 = scatter(data(1),data(2),'k.');
axis square; axis([0 50 0 50]);
h3 = drawarrow(repmat([0 0],[2 1]),X','r-',[],10);
for p=1:size(X,2)
h4 = scatter(X(1,p)*h(p,:),X(2,p)*h(p,:),25,'gx');
h5 = scatter(X(1,p)*htrue(p),X(2,p)*htrue(p),'k.');
end
uistack(h3,'top');
h6 = drawarrow(X(:,1)',X(:,2)','b-',[],0);
xlabel('dimension 1');
ylabel('dimension 2');
legend([h1 h2 h3(1) h4 h5 h6], ...
       {'measured data' 'noiseless data' 'regressors' ...
        'estimated weights' 'true weights' 'difference between regressors'});

The green X's represent each regressor scaled by the weight estimated for that regressor in each of the 100 simulations. The X's are indicative of how reliably we can estimate the weights. In this example, the weights are estimated quite reliably (the spread of the X's is relatively small).

% Let's repeat the simulation but now with
% two regressors that are highly correlated.
X = [6 5;
   5 6];

In this example, the two regressors are highly correlated and weight estimation is unreliable. To understand why this happens, examine the difference between the regressors. Notice that the difference between the regressors is quite small. Noise in the data shifts the data along this difference, giving rise to substantially different parameter estimates. For example, if the measured data shifts towards the upper-left, then this tends to produce high weights for the upper-left regressor (and low weights for the bottom-right regressor); if the measured data shifts towards the bottom-right, then this tends to produce high weights for the bottom-right regressor (and low weights for the upper-left regressor).

OBSERVATIONS

The stability of model parameter estimates is determined (in part) by the amount of noise in the direction of the regressor difference. If the projection of the noise onto the regressor difference has small variance (as in the first example), then parameter estimates will tend to be stable; if the projection has large variance (as in the second example), then parameter estimates will tend to be unstable.

So how can we obtain better parameter estimates in the case of correlated regressors? One solution is to use regularization strategies (which will be described in a later post).

Geometric interpretation of linear regression

2011-10-25T12:39:00.000-07:00

Linear regression can be given a nice intuitive geometric interpretation.

CODE

% Generate random data (just two data points)
y = 4+rand(2,1);

% Generate a regressor.
x = .5+rand(2,1);

% Our model is simply that the data can be explained, to some extent,
% by a scale factor times the regressor, e.g. y = x*c + n, where c is
% a free parameter and n represents the residuals. Normally, we
% want the best fit to the data, that is, we want the residuals to be
% as small as possible (in the sense that the sum of the squares of
% of the residuals is as small as possible). This can be given
% a nice geometric interpretation: the data is just a single point
% in a two-dimensional space, the regressor is a vector in this space,
% and we are trying to scale the vector to get as close to the data
% point as possible.
figure(999); clf; hold on;
h1 = scatter(y(1),y(2),'ko','filled');
axis square; axis([0 6 0 6]);
h2 = drawarrow([0 0],x','r-');
xlabel('dimension 1');
ylabel('dimension 2');

% Let's estimate the weight on the regressor that minimizes the
% squared error with respect to the data.
c = inv(x'*x)*x'*y;

% Now let's plot the model fit.
modelfit = x*c;
h3 = drawarrow([0 0],modelfit','g-');
uistack(h3,'bottom');

% Calculate the residuals and show this pictorially.
residuals = y - modelfit;
h4 = drawarrow(modelfit',y','b-',[],0);
uistack(h4,'bottom');

% Put a legend up
legend([h1 h2(1) h3 h4],{'data' 'regressor' 'model fit' 'residuals'});

% OK. Now's let's do an example for the case of two regressors.
% One important difference is that the model is now a weighted sum
% of the two regressors and so there are two free parameters.
y = 4+rand(2,1);
x = .5+rand(2,2);
figure(998); clf; hold on;
h1 = scatter(y(1),y(2),'ko','filled');
axis square; axis([0 6 0 6]);
h2 = drawarrow([0 0; 0 0],x','r-');
xlabel('dimension 1');
ylabel('dimension 2');
c = inv(x'*x)*x'*y;
modelfit = x*c;
h3 = drawarrow([0 0],modelfit','g-');
uistack(h3,'bottom');
residuals = y - modelfit;
h4 = drawarrow(modelfit',y','b-',[],0);
uistack(h4,'bottom');

% Each regressor makes a contribution to the final model fit,
% so let's plot the individual contributions.
h3b = [];
for p=1:size(x,2)
contribution = x(:,p)*c(p);
h3b(p) = drawarrow([0 0],contribution','c-');
end
uistack(h3b,'bottom');

% Finally, put a legend up
legend([h1 h2(1) h3b(1) h3 h4],{'data' 'regressors' 'contributions' 'model fit' 'residuals'});

OBSERVATIONS

In the first example, notice that the model fit lies at a right angle (i.e. is orthogonal) to the residuals. This makes sense because the model fit is the point (on the line defined by the regressor) that is closest in a Euclidean sense to the data.

In the second example, the model fits the data perfectly because there are as many regressors as there are data points (and the regressors are not collinear).

Error in two dimensions

2011-10-23T22:17:00.000-07:00

Regression normally attributes error to the dependent variable, but it is possible to fit regression models that attribute errors to both dependent and independent variables.

CODE

% Generate some random data points.
x = randn(1,1000);
y = randn(1,1000);

% Fit a line that minimizes the squared error between the y-coordinate of the data
% and the y-coordinate of the fitted line. Bootstrap to see the variability of the fitted line.
paramsA = bootstrp(20,@(a,b) polyfit(a,b,1),x',y');

% Fit a line that minimizes the sum of the squared distances between the data points
% and the line. Bootstrap to see the variability of the fitted line.
paramsB = bootstrp(20,@fitline2derror,x',y');

% Visualize the results.
figure(999); hold on;
scatter(x,y,'k.');
ax = axis;
for p=1:size(paramsA,1)
h1 = plot(ax(1:2),polyval(paramsA(p,:),ax(1:2)),'r-');
h2 = plot(ax(1:2),polyval(paramsB(p,:),ax(1:2)),'b-');
end
axis(ax);
xlabel('x');
ylabel('y');
title('Red minimizes squared error on y; blue minimizes squared error on both x and y');

% Now, let's repeat for a different dataset.
temp = randnmulti(1000,[],[1 .5; .5 1]);
x = temp(:,1)';
y = temp(:,2)';

OBSERVATIONS

First example: When minimizing error on y, the fitted lines tend to be horizontal. This is because the best that the model can do is to basically predict the mean y-value regardless of what the x-value is. When minimizing error on both x and y, all lines are basically equally bad, giving rise to wildly different line fits across different bootstraps.

Second example: In this example, minimizing error on y produces lines that are closer to horizontal than the lines produced by minimizing error on both x and y. Notice that the lines produced by minimizing error on both x and y are aligned with the intrinsic axes of the Gaussian cloud.

Error bar judgment

2011-10-23T20:19:00.000-07:00

Error bars are useful because they allow us to figure out how much of the data is signal and how much of the data is noise. We want to pay attention to aspects of the data that are real (i.e. outside of the error) and discount aspects of the data that are due to chance (i.e. within the error). Error bars that reflect +/- 1 standard error are surprisingly aggressive (see below).

CODE

% We are going to measure 40 different conditions.
% For the first twenty conditions, the true signal will be 0.
% For the second twenty conditions, the true signal will be 1.
numconditions = 40;

% We will make 30 different measurements for each condition.
n = 30;

% Let's perform a simulation, visualize the results,
% and then do it again (ad nauseum).
while 1

% these are the true signal values
signal = [zeros(1,numconditions/2) ones(1,numconditions/2)];

% this is the noise (random Gaussian noise)
noise = randn(n,numconditions);

% these are the measurements
measurement = bsxfun(@plus,signal,noise);

% given the measurements, let's calculate the mean
% and standard error for each condition.
mn = mean(measurement,1);
se = std(measurement,[],1)/sqrt(n);

% now, let's visualize the results
figure(999); clf; hold on;
bar(1:numconditions,mn);
errorbar2(1:numconditions,mn,se,'v','r-','LineWidth',2);
plot(1:numconditions,signal,'b-','LineWidth',2);
axis([0 numconditions+1 -1 2]);
title('Black is the mean; red is the standard error; blue is the true signal');
pause;

end

OBSERVATIONS

If you use +/- 1 standard error to visualize results, it may subjectively look like there are differences, even though there aren't any. For this reason, it may be useful to instead plot error bars that reflect +/- 2 standard errors.

It is quite common to find measurements that are several error bars away from the true value. Of course, this is completely expected given the nature of standard errors (e.g. 5% of the time, you will find a data point that is more than 2 standard errors away from the true value).