PCA vs Linear Regression – Two statistical methods that run very similarly. However, they differ in one important respect. What the two methods actually are and what this difference is, we explain to you in the following article.
Table of Contents
What is a PCA?
Principal Component Analysis (PCA) is a multivariate statistical method for structuring or simplifying a large data set. The main goal here is the discovery of relationships in 2 or 3 dimensional domain.
This method enjoys great popularity in almost all scientific disciplines and is mostly used when variables are highly correlated.
However, PCA is only a reliable method if the data are at least interval scaled and approximately normally distributed.
Although the variables are adjusted to avoid redundant effects, the error and residual variance of the data are not taken into account.
The following figure shows the basic principle of a PCA. High dimensional data relationships should be represented in a low dimensional way, with as little loss of information as possible.
The key point of PCA is dimensional reduction. It is to extract the most important features of a data set by reducing the total number of measured variables with a large proportion of the variance of all variables.
This reduction is done mathematically using linear combinations.
What are linear combinations?
PCA works in a purely exploratory way, searching the data for a linear pattern that best describes the data set.
These linear combinations can best be thought of as straight lines between variable values.
In the figure below, the linear combinations have been applied to a data set.
How does the algorithm work?
In the principal component analysis procedure, a set of fully uncorrelated principal components are first generated.
These contain the main changes in the data and are also known as latent variables, factors or eigenvectors.
The number of extracted components is given here by the data.
The first principal component is formed by minimizing the sum of squared variances of all variables.
During extraction, the variance component is maximized over all variables.
Then, the remaining variance is gradually resolved by the second component until the total variance of all data is explained by the principal components.
The first factor always points in the direction of the maximum variance in the data.
The second factor must be perpendicular to it and explain the next largest variance
PCA vs Linear Regression – How do they Differ?
We have studied the PCA and how it works in great detail. But what are the differences to linear regression?
In the following illustration the main difference is set up against each other.
With PCA, the error squares are minimized perpendicular to the straight line, so it is an orthogonal regression. In linear regression, the error squares are minimized in the y-direction.
Thus, linear regression is more about finding a straight line that best fits the data, depending on the internal data relationships.
Principal component analysis uses an orthogonal transformation to form the principal components, or linear combinations of the variables.
So this difference between the two techniques only becomes apparent when the data are not completely independent, but there is a correlation.
If you want to know more about machine learning methods and how they work, check out our article on the t-SNE algorithm.