Principal Component Analysis _ Dimensionality Reduction Machine Learning _ Applied AI Course

The Power of PCA: A Tool for Reducing Dimensionality and Explaining Variance

In many machine learning applications, it's essential to reduce the dimensionality of high-dimensional data while preserving as much information as possible. Principal Component Analysis (PCA) is a widely used technique that achieves this by transforming the original features into a new set of orthogonal features called principal components. In this article, we'll explore the concept of PCA and its application in reducing dimensionality and explaining variance.

To start with, let's consider why we need to reduce dimensionality in high-dimensional data. Imagine you have a dataset with thousands of features, each representing a specific attribute or characteristic of your data points. While these features might be relevant, they can also lead to the curse of dimensionality, where the increase in the number of features outpaces the number of data points, leading to poor performance and decreased accuracy.

PCA is a technique that addresses this issue by identifying the most important features in the data and combining them into new features called principal components. These principal components are orthogonal to each other, meaning they don't share any commonality, and together they explain most of the variance in the original data.

Now, let's consider how PCA works. To apply PCA, we need to calculate the covariance matrix of our data, which measures the variance and covariance between each pair of features. We then compute the eigenvalues and eigenvectors of this matrix, where the eigenvectors represent the principal components. The eigenvalues correspond to the amount of variance explained by each principal component.

By selecting a subset of these principal components, we can reduce the dimensionality of our data while preserving most of the information. This is achieved by computing the cumulative sum of the eigenvalues, which gives us the percentage of variance explained by each principal component.

For example, if we want to find the number of principal components that explain 90% of the variance in our data, we can compute the cumulative sum of the eigenvalues until we reach this threshold. In the provided code snippet, we see an example where the author computes the percentage variance explained for each principal component and plots the results.

The resulting plot shows that the first few principal components already explain a significant amount of variance in the data. Specifically, 20% of the variance is explained by the first two components, and 75% is explained by the first three components. As we increase the number of components, the percentage of variance explained increases rapidly.

The author then uses this information to determine the optimal number of principal components that explain 90% of the variance in their data. By computing the cumulative sum of the eigenvalues, they find that 200 dimensions correspond to approximately 90% of the variance being explained. This means that reducing the dimensionality from 784 features to 200 features using PCA reduces the information loss significantly.

The final section of the code snippet shows how to plot the results and explore the properties of the principal components. The author uses NumPy's `cumsum` function to compute the cumulative sum of the eigenvalues and plots the resulting percentages of variance explained for each component.

In conclusion, PCA is a powerful technique for reducing dimensionality and explaining variance in high-dimensional data. By identifying the most important features in the data and combining them into new principal components, we can preserve most of the information while reducing the number of dimensions. The resulting plot provides valuable insights into the properties of the principal components and enables us to select an optimal subset that explains a significant amount of the variance in the data.

The author's example demonstrates the practical application of PCA in real-world datasets. By applying PCA, they reduced the dimensionality of their data from 784 features to 200 features while preserving approximately 90% of the information. This reduction in dimensionality is crucial for improving model performance and accuracy, especially when dealing with large datasets.

In summary, PCA is a widely used technique that reduces dimensionality by transforming high-dimensional data into a lower-dimensional representation using principal components. By computing the eigenvalues and eigenvectors of the covariance matrix, we can identify the most important features in the data and combine them into new components that explain most of the variance. The resulting plot provides valuable insights into the properties of the principal components, enabling us to select an optimal subset that explains a significant amount of the variance in the data.

"WEBVTTKind: captionsLanguage: enwhile we have used PCA for visualization where we took a 784 dimensional data set and we converted into two dimensional data set and visualized this 2d on on a 2d plane like first principal component on this and second principal component on this this is what we did a while ago but there are other applications of PCA where we want to go from 784 dimensions to let's say 10 dimensions these applications are when your training machine learning models so this happens a lot this is mostly for data visualization right well you might want to go from 784 dimensions using PCA to get to 10 dimensions when your training machine learning models we learn about machine learning models and training little later in the future chapters but there is a case where you might want to convert your data from 784 dimensions to something greater than two if it's two or three it's mostly for data visualization right but if it's like 10 not 20 or to even 200 like this dimensional TD - we are going from D to D - and all we need is that D - is less than equal to D right so we do we do have situations in machine learning where we want to go from a D dimension - D - but D - is not 2 or 3 2 or 3 is mostly for visualization as I just stated right so there are cases again I will not be able to give you all the context on where we do it I promise you I will give you the context on where we exactly do this - 10 dimensions or 20 dimensions or 200 dimensions etcetera when we learn machine learning models but since we are learning PCA rent now I thought I'll cover the topic but the applications of it we will learn when we learn machine mental models later later in this later in this course ok so imagine imagine now that I want to go from 784 dimensional data to let's say 200 dimensions how do I do it now the question is how do I do it it's very simple right I have a data matrix X right which is let's say 15,000 points cross 784 points okay Oryx instead of taking instead of taking numbers here okay now if I multiply this matrix if I multiply this with an other matrix like we call this matrix B okay this V is let's say 784 cross 200 this V is composed of the top 200 eigen vectors okay each of my eigen vectors so if I take covariance matrix of X if I take covariance matrix of X let's see let's say that this is my covariance matrix let's assume X is already standardized just for simplicity okay so this is my covariance matrix and for this covariance matrix I can compute my eigenvalues and eigenvectors where I goes from where I goes from 1 to 780 for the largest one let's say is lambda 1 similarly here I goes from 1 to 784 right the largest vector the largest the eigen vector corresponding the largest eigen value is V 1 so if I make my matrix V such that my first eigenvector my top eigenvector is here and each of my eigenvector remember each of my eigen vector belongs to a 784 dimensional space right if my second eigenvector is written like this so on so forth my 200 I can vector is written like this now this matrix is 784 cross 200 right now this this now by doing this right by by by stacking up each of my eigen vectors I created a matrix which is 784 plus 200 now if I multiply these two if I multiply X with V if I multiply X with me what do I get I get let me call it nu x dash which is 15 K cross 200 this is my representation of my data points X in a 200 dimensional space by the way this is what we exactly did even for two dimensions where I just had two instead of 200 right this is how we convert a data from 784 dimensions to 200 dimensions using PC okay now since we understood how what it is now the big question that often pops up the often pops up is what is the right number lot of times I want to go from 784 should I go to 10 or should I go to 20 dimensions should I go to 50 dimensions 100 dimensions 200 dimensions or 500 dimensions or 700 dimensions okay this see it's mostly 2 or 3 when you want to do visualization when you don't want to do visualization what is the right number that's a big question mark that we all have now let's go to the fundamentals of what PCA is in PCA if you if you recall all the mathematics that we that that we worked on we are trying to maximize the variance of projected points the the fundamental the fundamental mathematical detail here is trying to maximize the variance of projected points right so BCA in image editor score is a variance maximization technique we want to retain as much variance as possible now the question is if I go from 784 dimensions to 10 dimensions how much of the variance how much of the original variance let me put it that way how much of the original variance is explained when I go from 784 to 10 dimensions which means how much of the variance is explained just by 10 dimensions can we get a sense of that right because if I can put numbers here I know whether I then I can make a choice whether I should go to 10 dimensions or 20 dimensions or 30 dimensions or 200 dimensions or 700 dimensions because the whole objective was maximizing variance of the projected points if I can somehow put a numerical value then I can decide what is the right number here that's where eigen values will be used recall when we did pca right when we computed pca for a given matrix x we computed its covariance matrix for the covariance matrix we computed its eigenvalues and eigenvectors the problem lies here we use the eigenvectors to go to lower dimensions but we never use the eigenvalues there is a very nice mathematical interpretation of eigenvalues which is beautiful okay so let's assume lambda 1 lambda 2 so on lambda 784 okay for our 784 dimensional data set are of course if you have d dimensions it's not 784 it's d I'm just flowing with the with the amidst example that we've been playing around with ok so imagine if I have so let's assume of course that lambda 1 is greater than equal to greater than equal to right so this is my largest eigenvalue second largest so on so forth at any time suppose if I if I if I take my data set from 784 dimensions let's say to ten dimensions now I want to understand what is the variance what is the variance that is already explained or retained in ten dimensions if this is the question I have there's a very simple formula for this take the top ten even values okay divided by summation of all the eigen values since the matrix since the covariance matrix is 784 cross 784 you have 784 eigen values now this ratio C here I am taking the sum of the first top 10 eigen values dividing but by the summation of all the eigen values this tells me what is the percentage this tells me what is the percentage of variance explained suppose this let's say this number is let's say let's say I'm just picking up a number here let's say this number is point two what this means is that 20% the 20% of the variance of the total variance of the total variance in 784 dimensions is explained I am writing it just is explained in ten dimensions okay what this says is only 20% of the whole information because we said variance is a nice measure of information only 20% of the information has been retained or explained when I projected to ten dimensions if you are okay with it 10 is a good number if you say oftentimes what people tell is when I want to go from 784 dimensions - D - dimensions I might through PCA I want to find a D - which retains 90% of information or 90% of variance right now we have to find the right D - okay which which which which retains 90% of the information and how do we do it we want to find lambda 1 lambda 2 so on so forth lambda D - by summation over all I lambda i we want this to be point 9 we want to find a D - such that this ratio is point 9 then 90% of the information or variance is explained using using PCA right so let's go and see some simple code for our Emnes data set ok so here what I'm doing here is again I'm just flowing with the flow here it's the same it's the same ipython notebook that I'm using I'm just saying that instead of two components I want to compute all the 784 components and on my sample data if you recall this is this is an extension of a tree base exercise here we use scikit-learn to compute a PCA and I'm just flowing the flowing with that okay now instead of rewriting all the code again I thought it's better if I if I just continue here now what i'm doing here is i'm saying a number of components 784 which means i want a 784 dimensional transformation so i'm going from 784 dimensions to 784 dimensions except that these are principal components and this is given features okay using PCM now i'm transforming my data so what pc gives you is pca has a variable after you fit the transform there's something called pca dot explained variance okay it literally gives you what what is the what is the variance of this explain which is nothing but your lambda i values if you think about it this is nothing but your eigen values so what I'm doing here is I am I'm dividing the PC explained variance by sum of PC explained variables so this is nothing but your lambda I by a sigma of lambda i okay for each on this is what i'm doing right for each i now I'm doing a cumulative sum okay so so all of this is stored in a percentage variance explained it says for each fee for each principal component I what percentage has it so lambda 1 so what I have here is lambda 1 by lambda I lambda 2 by Sigma lambda I lambda 3 by sigma lambda I so on so forth that's what I have in this in this in this vector now I am doing a cumulative sum of this so that my first value is okay using only the first eigen vector what is my value in the second since I'm doing a cumulative sum I'm doing lambda 1 plus lambda 2 by summation of lambda I this this is the numpy dot cumulative sum it keeps adding right my third value is lambda 1 plus lambda 2 plus lambda 3 by summation of lambda I and so on so forth so here I've summed everything which is here and before that and after that it's simple code to plot all of this nothing very fancy okay now let's see what happens now this plot this plot is very very interesting okay my x-axis is number of components or number of principal component at I want to use and our y-axis is cumulative variance explained and one thing that you will quickly notice is 20% of the variance is explained very very quickly using just a few features now if I say that I want to go from 780 four dimensions 200 dimensions which is somewhere here right I'm going to explain somewhere around 0.75 I'm just picking a number here somewhere between 0.75 and 0.8 right roughly about 0.75 or 75% of my variance will be explained if I if I do if I go from 784 200 if I say I want to find that D dash which explains roughly 90% of my variance let's go to 90% where it that roughly corresponds to 200 dimensions roughly right this is roughly 0.9 and that corresponds to 200 here so my D dash is 200 so if I go from 784 dimensions if a project is data to 200 dimensions using PCA roughly 90 percent of my information or 90 percent of the variance is explained okay very simple code here very simple code this plot is super useful in some cases in some applications as we'll see later when we learn machine learning models you might say I want I want to preserve 95 percent of the information meta in such a case you might have to pick a number that is but is probably around 350 ish right this is closer to 0.95 and this is 350 right in such a case this number might be closer to 350 but remember you're throwing away almost even if you pick 350 of the 784 this is less than half of 784 right which means half of your features are only adding 5% to your variance okay even this is a significant reduction going from 784 to 350 is literally more than 50% reduction in your feature space we'll see that this is useful when we learn mushrooming models I promise you for now just please take it at the face valuewhile we have used PCA for visualization where we took a 784 dimensional data set and we converted into two dimensional data set and visualized this 2d on on a 2d plane like first principal component on this and second principal component on this this is what we did a while ago but there are other applications of PCA where we want to go from 784 dimensions to let's say 10 dimensions these applications are when your training machine learning models so this happens a lot this is mostly for data visualization right well you might want to go from 784 dimensions using PCA to get to 10 dimensions when your training machine learning models we learn about machine learning models and training little later in the future chapters but there is a case where you might want to convert your data from 784 dimensions to something greater than two if it's two or three it's mostly for data visualization right but if it's like 10 not 20 or to even 200 like this dimensional TD - we are going from D to D - and all we need is that D - is less than equal to D right so we do we do have situations in machine learning where we want to go from a D dimension - D - but D - is not 2 or 3 2 or 3 is mostly for visualization as I just stated right so there are cases again I will not be able to give you all the context on where we do it I promise you I will give you the context on where we exactly do this - 10 dimensions or 20 dimensions or 200 dimensions etcetera when we learn machine learning models but since we are learning PCA rent now I thought I'll cover the topic but the applications of it we will learn when we learn machine mental models later later in this later in this course ok so imagine imagine now that I want to go from 784 dimensional data to let's say 200 dimensions how do I do it now the question is how do I do it it's very simple right I have a data matrix X right which is let's say 15,000 points cross 784 points okay Oryx instead of taking instead of taking numbers here okay now if I multiply this matrix if I multiply this with an other matrix like we call this matrix B okay this V is let's say 784 cross 200 this V is composed of the top 200 eigen vectors okay each of my eigen vectors so if I take covariance matrix of X if I take covariance matrix of X let's see let's say that this is my covariance matrix let's assume X is already standardized just for simplicity okay so this is my covariance matrix and for this covariance matrix I can compute my eigenvalues and eigenvectors where I goes from where I goes from 1 to 780 for the largest one let's say is lambda 1 similarly here I goes from 1 to 784 right the largest vector the largest the eigen vector corresponding the largest eigen value is V 1 so if I make my matrix V such that my first eigenvector my top eigenvector is here and each of my eigenvector remember each of my eigen vector belongs to a 784 dimensional space right if my second eigenvector is written like this so on so forth my 200 I can vector is written like this now this matrix is 784 cross 200 right now this this now by doing this right by by by stacking up each of my eigen vectors I created a matrix which is 784 plus 200 now if I multiply these two if I multiply X with V if I multiply X with me what do I get I get let me call it nu x dash which is 15 K cross 200 this is my representation of my data points X in a 200 dimensional space by the way this is what we exactly did even for two dimensions where I just had two instead of 200 right this is how we convert a data from 784 dimensions to 200 dimensions using PC okay now since we understood how what it is now the big question that often pops up the often pops up is what is the right number lot of times I want to go from 784 should I go to 10 or should I go to 20 dimensions should I go to 50 dimensions 100 dimensions 200 dimensions or 500 dimensions or 700 dimensions okay this see it's mostly 2 or 3 when you want to do visualization when you don't want to do visualization what is the right number that's a big question mark that we all have now let's go to the fundamentals of what PCA is in PCA if you if you recall all the mathematics that we that that we worked on we are trying to maximize the variance of projected points the the fundamental the fundamental mathematical detail here is trying to maximize the variance of projected points right so BCA in image editor score is a variance maximization technique we want to retain as much variance as possible now the question is if I go from 784 dimensions to 10 dimensions how much of the variance how much of the original variance let me put it that way how much of the original variance is explained when I go from 784 to 10 dimensions which means how much of the variance is explained just by 10 dimensions can we get a sense of that right because if I can put numbers here I know whether I then I can make a choice whether I should go to 10 dimensions or 20 dimensions or 30 dimensions or 200 dimensions or 700 dimensions because the whole objective was maximizing variance of the projected points if I can somehow put a numerical value then I can decide what is the right number here that's where eigen values will be used recall when we did pca right when we computed pca for a given matrix x we computed its covariance matrix for the covariance matrix we computed its eigenvalues and eigenvectors the problem lies here we use the eigenvectors to go to lower dimensions but we never use the eigenvalues there is a very nice mathematical interpretation of eigenvalues which is beautiful okay so let's assume lambda 1 lambda 2 so on lambda 784 okay for our 784 dimensional data set are of course if you have d dimensions it's not 784 it's d I'm just flowing with the with the amidst example that we've been playing around with ok so imagine if I have so let's assume of course that lambda 1 is greater than equal to greater than equal to right so this is my largest eigenvalue second largest so on so forth at any time suppose if I if I if I take my data set from 784 dimensions let's say to ten dimensions now I want to understand what is the variance what is the variance that is already explained or retained in ten dimensions if this is the question I have there's a very simple formula for this take the top ten even values okay divided by summation of all the eigen values since the matrix since the covariance matrix is 784 cross 784 you have 784 eigen values now this ratio C here I am taking the sum of the first top 10 eigen values dividing but by the summation of all the eigen values this tells me what is the percentage this tells me what is the percentage of variance explained suppose this let's say this number is let's say let's say I'm just picking up a number here let's say this number is point two what this means is that 20% the 20% of the variance of the total variance of the total variance in 784 dimensions is explained I am writing it just is explained in ten dimensions okay what this says is only 20% of the whole information because we said variance is a nice measure of information only 20% of the information has been retained or explained when I projected to ten dimensions if you are okay with it 10 is a good number if you say oftentimes what people tell is when I want to go from 784 dimensions - D - dimensions I might through PCA I want to find a D - which retains 90% of information or 90% of variance right now we have to find the right D - okay which which which which retains 90% of the information and how do we do it we want to find lambda 1 lambda 2 so on so forth lambda D - by summation over all I lambda i we want this to be point 9 we want to find a D - such that this ratio is point 9 then 90% of the information or variance is explained using using PCA right so let's go and see some simple code for our Emnes data set ok so here what I'm doing here is again I'm just flowing with the flow here it's the same it's the same ipython notebook that I'm using I'm just saying that instead of two components I want to compute all the 784 components and on my sample data if you recall this is this is an extension of a tree base exercise here we use scikit-learn to compute a PCA and I'm just flowing the flowing with that okay now instead of rewriting all the code again I thought it's better if I if I just continue here now what i'm doing here is i'm saying a number of components 784 which means i want a 784 dimensional transformation so i'm going from 784 dimensions to 784 dimensions except that these are principal components and this is given features okay using PCM now i'm transforming my data so what pc gives you is pca has a variable after you fit the transform there's something called pca dot explained variance okay it literally gives you what what is the what is the variance of this explain which is nothing but your lambda i values if you think about it this is nothing but your eigen values so what I'm doing here is I am I'm dividing the PC explained variance by sum of PC explained variables so this is nothing but your lambda I by a sigma of lambda i okay for each on this is what i'm doing right for each i now I'm doing a cumulative sum okay so so all of this is stored in a percentage variance explained it says for each fee for each principal component I what percentage has it so lambda 1 so what I have here is lambda 1 by lambda I lambda 2 by Sigma lambda I lambda 3 by sigma lambda I so on so forth that's what I have in this in this in this vector now I am doing a cumulative sum of this so that my first value is okay using only the first eigen vector what is my value in the second since I'm doing a cumulative sum I'm doing lambda 1 plus lambda 2 by summation of lambda I this this is the numpy dot cumulative sum it keeps adding right my third value is lambda 1 plus lambda 2 plus lambda 3 by summation of lambda I and so on so forth so here I've summed everything which is here and before that and after that it's simple code to plot all of this nothing very fancy okay now let's see what happens now this plot this plot is very very interesting okay my x-axis is number of components or number of principal component at I want to use and our y-axis is cumulative variance explained and one thing that you will quickly notice is 20% of the variance is explained very very quickly using just a few features now if I say that I want to go from 780 four dimensions 200 dimensions which is somewhere here right I'm going to explain somewhere around 0.75 I'm just picking a number here somewhere between 0.75 and 0.8 right roughly about 0.75 or 75% of my variance will be explained if I if I do if I go from 784 200 if I say I want to find that D dash which explains roughly 90% of my variance let's go to 90% where it that roughly corresponds to 200 dimensions roughly right this is roughly 0.9 and that corresponds to 200 here so my D dash is 200 so if I go from 784 dimensions if a project is data to 200 dimensions using PCA roughly 90 percent of my information or 90 percent of the variance is explained okay very simple code here very simple code this plot is super useful in some cases in some applications as we'll see later when we learn machine learning models you might say I want I want to preserve 95 percent of the information meta in such a case you might have to pick a number that is but is probably around 350 ish right this is closer to 0.95 and this is 350 right in such a case this number might be closer to 350 but remember you're throwing away almost even if you pick 350 of the 784 this is less than half of 784 right which means half of your features are only adding 5% to your variance okay even this is a significant reduction going from 784 to 350 is literally more than 50% reduction in your feature space we'll see that this is useful when we learn mushrooming models I promise you for now just please take it at the face value\n"