Understanding the Mathematics behind Principal Component Analysis

In this post, we’re going to learn the foundations of a very famous and interesting dimensionality reduction technique known as principal component analysis (PCA).

Specifically, we’re going to learn what principal components are, how data is concentrated within them, and learn about their orthogonality properties that make extraction of important data easier.

In other words, Principal component analysis (PCA) is a procedure for reducing the dimensionality of the variable space by representing it with a few orthogonal (uncorrelated) variables that capture most of its variability.

What is dimensionality reduction?

Reducing the dimensions of the feature space is called dimensionality reduction. Reduction of dimensions is needed when there are far too many features in a dataset, making it hard to distinguish between the important ones that are relevant to the output and the redundant or not-so important ones.

Disadvantages of having a large feature space

More data is required to explore the vastly-broader feature space.
The learning algorithm is motivated to learn a spurious structure due to redundant features and more noise, and thus, random structures may be introduced, and the model may overfit.
The algorithm’s simplicity is lost and the principle of Occam’s razor is violated.

There are many ways to achieve dimensionality reduction, but most of these techniques fall into one of two classes:

Feature Elimination: Eliminating features to reduce the feature space. As a disadvantage, information is lost due to dropped features.
Feature Extraction: In feature extraction, we create K “new” independent variables, where each independent variable is a combination of each of the given old independent variables.

PCA Algorithm

Principal component analysis is a technique for feature extraction — so it combines our input variables in a specific way, at which point we can drop the least important variables while still retaining the most valuable parts of all of the variables. PCA results in developing new features that are independent of one another.

Steps involved in PCA

Standardization: Calculate the mean of all the dimensions of the dataset, except the labels. Scale the data so that each variable contributes equally to analysis. In the equation given below, z is the scaled value, x is the initial, and mu and sigma are mean and standard deviation, respectively.

2. Covariance Matrix Computation: We can compute the covariance of two variables X and Y using the following formula:

Using the above formula, we can find the covariance matrix of A. Also, the result would be a square matrix of x*x dimensions.

3. Compute Eigenvectors and corresponding Eigenvalues: Wikipedia defines that, in linear algebra, an eigenvector (/ˈaɪɡənˌvɛktər/), or characteristic vector of a linear transformation, is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled.

In general, the eigenvector of a matrix A is the vector for which the following holds:

where lambda is a scalar value called the ‘eigenvalue’. This means that the linear transformation is defined by lambda and the equation can be re-written as:

where I is the identity matrix.

It’s important to notice that these eigenvectors are both unit eigenvectors, i.e. their lengths are both 1. These eigenvectors give us the patterns in the data, in order for us to extract the most useful ones.

4. Choose k eigenvectors with the largest eigenvalues: Sort the eigenvectors with respect to their decreasing order of eigenvalues, choosing k out of them, where k is the number of dimensions you wish to have in the new dataset.

What are principal components?

Principal components (PC) basically refer to the new variables constructed as a linear combination of initial features, such that these new variables are uncorrelated. Since the principal components are independent of one another, they are perpendicular to each other in the cartesian space.

To get the principal components in the order of their significance, we need to rank them in the order of their eigenvalues.

After choosing a few principal components, the new matrix of vectors is created and is called a feature vector.

5. Recasting data along Principal Components’ axes

In the last step, we need to transform our samples onto the new subspace by re-orienting data from the original axes to the ones that are now represented by the principal components.

So lastly, we have computed principal components and projected the data points in accordance with the new axes.

Hence, to summarize PCA:

Scale the data by subtracting the mean and dividing by std. deviation.
Compute the covariance matrix.
Compute eigenvectors and the corresponding eigenvalues.
Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues, these becoming the principal components.
Derive the new axes by re-orientation of data points according to the principal components.

Important points to note

PCA tries to compress as much information as possible in the first PC, the rest in the second, and so on…
PC’s do not have an interpretable meaning, being a linear combination of features.
Eigenvectors of the covariance matrix are actually directions of the axes where there is most variance.

Implementing PCA in scikit-learn

It merely takes four lines to apply the algorithm in Python with sklearn: import the classifier, create an instance, fit the data on the training set, and predict outcomes for the test set. The parameter n_components defines the number of principal components:

Why is standard scaling required before calculating a covariance matrix?

PCA calculates a new projection of your dataset. And the new axes are based on the standard deviation of your variables. So a variable with a high standard deviation will have a higher weight for the calculation of the axes than a variable with a low standard deviation. If you normalize your data, all variables have the same standard deviation, thus all variables have the same weight and your PCA calculates relevant axes.

In other words, PCA is sensitive to variance, and thus if no standardization is done, large range variables will dominate, leading to biased results and non-optimal principal components.

How is PCA different from linear regression?

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, whereas with linear regression, we’re trying to find a straight line that best fits the data.

Thus, the principal components are independent to one another, and in linear regression, the prediction of the output class directly depends on features.

Advantages of PCA

1. Eradication of correlated features: After implementing PCA on a dataset, all the principal components are independent of one another. There is no correlation among them, thus the model is not biased towards any set of features.

2. Improves algorithm performance: If the input dimensions are too high, then PCA can be used to speed up the algorithm, since it eradicates correlated variables and reduces the dimensions of the data space.

3. Reduces overfitting: Overfitting mainly occurs when there are too many variables in the dataset. So PCA helps in overcoming the overfitting issue by reducing the number of features.

4. Improves visualization: It’s very hard to visualize and understand data in high dimensions. PCA transforms high-dimensional data to low-dimensional data so as to make the visualization easier.

Disadvantages of PCA

1. Less interpretable: Principal components are the linear combination of your original features. Principal components are not as readable and interpretable as original features.

2. Data standardization is necessary: You must standardize your data before implementing PCA; otherwise PCA will not be able to find the optimal principal components.

3. Loss of Information: Although principal components try to cover the maximum variance among the features in a dataset, if we don’t select the number of principal components with care, it may miss some information as compared to the original list of features.

How to prepare data for PCA?

Always normalize your data before doing PCA because if you use data (features here) of different scales, we get misleading components. We can also simply use a correlation matrix instead of using a covariance matrix if features are of different scales.
Try to remove redundant or useless features and convert non-numeric data into categorical variables.

Sources to get started with PCA

Conclusion

In this post, we learned the fundamentals of working with principal component analysis (PCA), including the mathematics behind it. Despite being widely used and strongly supported, it has its share of advantages and disadvantages.

Let me know if you liked the article and how I can improve it. All feedback is welcome. I’ll be exploring the mathematics involved in other foundational machine learning algorithms in future posts, so stay tuned.