Machine Learning Interview Question - Outliers and Loss Functions

The Impact of Outliers on Regression Models

====================================================

In regression models, outliers can have a significant impact on the accuracy and performance of the model. An outlier is an observation that is significantly different from the other data points. In this article, we will discuss how outliers affect regression models and how to minimize their impact.

**L1 Loss vs L2 Loss**

When it comes to minimizing the loss function in regression models, two common approaches are used: L1 loss and L2 loss. The difference between these two lies in how they handle outliers.

In L1 loss, also known as absolute value function, the contribution of an outlier is smaller than that of a normal data point. This is because the L1 loss function adds up the absolute values of the residuals, whereas the L2 loss function squares the residuals. As a result, outliers will contribute less to the L1 loss than they would to the L2 loss.

To understand this better, let's consider an example. Suppose we have a regression model that predicts a continuous output variable based on a single input feature. The model is trained using the mean squared error (MSE) loss function, which is equivalent to the L2 loss function. We generate some data points with no outliers and then add some outlier data points that are significantly different from the rest of the data.

**Geometry of L1 Loss and L2 Loss**

To visualize the difference between L1 loss and L2 loss, let's use geometry. Imagine a quadratic curve representing the L2 loss function. This curve grows rapidly as the residuals move away from zero. In contrast, the absolute value function represents the L1 loss, which is equivalent to a straight line.

When we add an outlier data point with a large residual, its contribution to the L2 loss will be larger than that of any normal data point. On the other hand, in the case of L1 loss, the impact of outliers will be smaller because their contribution to the absolute value function is smaller.

**Sparsity and L1 Loss**

In linear regression models, sparsity refers to the practice of using a small set of coefficients (weights) that are learned during training. The goal of regularization techniques like L1 or L2 regularization is to reduce overfitting by adding a penalty term to the loss function.

In this context, we can ask: which of L1 and L2 losses will result in more sparsity? To answer this question, let's consider an example where we add some weights and biases to our linear regression model. We want to minimize both L1 and L2 losses but also introduce some regularization by multiplying the regularization term with a hyperparameter λ.

When we use L1 regularization (also known as Lasso), the coefficients will be sparse because the penalty term adds up to zero only when all coefficients are zero. In contrast, when we use L2 regularization (also known as Ridge), there is no such guarantee that the coefficients will be sparse.

**Outliers and Loss Functions**

To summarize, outliers have a significant impact on regression models, especially when using L2 loss. The geometry of L1 and L2 losses can help us understand how to handle outliers. However, sparsity depends on regularizing the weights and biases, not just the loss function itself.

The key takeaways are:

* L1 loss is less affected by outliers than L2 loss.

* Outliers will have a smaller impact on the L1 loss than the L2 loss.

* Sparsity in linear regression models depends on regularization techniques like L1 or L2 regularization, not just the choice of loss function.

By understanding how to handle outliers and choose the right loss function for our problem, we can build more robust and accurate regression models.

"WEBVTTKind: captionsLanguage: enhi friends here is a very interesting interview question about outliers and robustness to outliers so first i'll explain the the the question itself then i will request you pause the video try to think of the solution but i'll also explain the solution itself in this video right so imagine you have a data set which you're going to use to train your model let's say you met let's assume that the data set contains pairs of x i and y i x i are your data points x i is d dimensional real valued y i is also real value which means what you are going to train using this training data is a regression model because your y is a real value now imagine imagine you have a model that you build let's assume the model is called f think of this model as something like let's say a linear regression or a neural network model with some weights okay so let's assume this is your model f your model takes each x i as input and it predicts what y i is so the predicted value is called as y i hat right so this is the setup this is a problem setup this is a standard problem setup of any regression problem now while solving this regression problem let's say using a linear regression or using a neural network based model let's assume you have two choices either you can use your loss as l1 i'll define what l1 is or you can use l2 now l1 is defined as summation over all the i's the absolute value of the difference between y i which is the actual value and f x i which is a predicted value predicted by your model f right so this is how you define your l1 similarly let's assume there is an other loss called as l2 okay this l2 is summation over all i all of your training points basically y i minus f of x i whole squared okay so this is this is sort of like like your squared loss right this is very similar to this is this is your squared loss in in in regression right so this is this is like an absolute value of the difference between y i a and f x i now given the choice you can either use let's say l 1 loss or you can use l 2 loss now the most important question here is which of these which of l 1 or l 2 would work better if the data set d contains lots of outliers right i hope the question is clear so you have a standard regression setup you can either use l1 l1 loss which is defined as this or you can use l2 loss which is defined as this now which one would we use if you know that the data set contains outliers and which one would you use or which one would work better in reality right if the data set d contains lots of outliers that's the question so i would recommend you pause this video here and think about how to how to answer this again this is about applying the foundational mathematics to solving real world problems because outliers are a daily reality in real world machine learning and of course you should you should decide whether you will use l1 or l2 and you should be able to justify mathematically or geometrically why you're using l1 or l2 okay so i'm assuming that you have spent some time thinking about it here here is how big here is one way of solving it okay so let's let's write it down okay so let us say y i minus f of x i let us call it as e i or in other words error i because this f of x i is nothing but y i hat this is what the model predicted right the difference between the actual observed value and the predicted value we are just calling it ei now what are outliers basically if you think logically what is an outlier your model tried to predict given the data again this model can be a neural network model or a linear regression model whatever model you want to use but whatever you do these outliers are misbehaving they don't they don't they don't fit into the model very well or in other words your ei is typically very large that's one way of defining an outlier in this context right now so this is the first observation that you have to make okay that outliers basically mean that your eis which is nothing but the difference of y i and y i hat are large that that's how you can think of outliers mathematically the second key observation here is this so let's say you i draw e i on x axis and the loss on y axis how does my l1 look like look at this if you plot your l1 and l2 your l1 looks like the absolute value function right this is your absolute your l 1 is nothing but your absolute value function right so it will be like a straight line here straight line here 45 degree lines right again remember the x axis here is e i right so l one will look like this green line which is straight lines okay it's a straight 45 degree line it's a straight 45 degree line now how does l 2 look like look at this if you write if you you can write these formulations of l 1 l 2 like this l 1 is nothing but summation over i it is e of absolute value of e i right what is l 2 it is summation over all i e i squared so if you look at it with respect to e i the loss itself is squared which means it looked like a parabola like this it looked like a parabola like this right so l2 will look like this parabolic curve while l1 loss with respect to e1 looks like the straight line now look at this here i mean this is the second key observation that you have to make the third most important the third most important step in solving this is this imagine if ei is very large let's say umea is very very large let's say your mia is here some large value okay so let's assume your ei is large again when i say ei is large i mean it could be a large positive value or it could be a large negative value with that that's important that's important okay because if your if this is your axis ei let's say you let's assume you go here this your outliers will mostly have a very large positive value or it will have a very large negative value so most of your non outlier points the error of y i minus f of x i will be small which means they will they will lie somewhere in this region right somewhere i mean i'm just approximately pointing this but the outliers will have a large positive value or a large negative value i have to make that very clear which means they'll most likely lie in this region in these in these very far away regions because the errors are high either large positive or large negative values now notice that if if the error is large what happens your l2 is growing quadratically your l2 is growing very very fast as compared to l1 look at this look at this point itself look at this point at this point your l1 is smaller significantly smaller than your l2 look at this your l1 is significantly smaller than your l2 which means for an outlier point whether it is a outlier with a large positive value or a large negative value even here imagine you take out suppose imagine there is an outlier point here the l1 will be significantly smaller than your l2 l2 will be significantly larger than l1 so now what happens is if you have lot of outliers the contribution of the outlier again because what do you do you have a loss what do you do in regression you try to minimize these laws with respect to some parameters like weights and biases that you have that's how you solve these problems right similarly if you have l2 l2 loss what are you trying to do you're trying to minimize this with respect to some weights and biases in your neural network or your linear regression model right so now look at this outliers will contribute less to this loss if you are using a l1 or absolute value function here rather than a quadratic function if you are using a squared function its contribution to the loss is more because its contribution because remember it's a quadratic curve right it's a quadratic curve right l2 corresponds to a quadratic curve l one corresponds to a straight line a quadratic function grows significantly faster than the absolute value function so what is happening here the impact of outliers is significantly more with if you're trying to minimize the l2 loss than if you are minimizing the l1 loss which in other words means l2 will be more impacted by your outliers than l1 and hence in this case it is preferable to use this l1 loss than the l2 loss if you want a better model right at the end of the day you don't want your outliers to impact your model right in your minimization process you're trying to compute the parameters of the model you want you want this out layers to impact less right in l2 in the case of l2 they are impacting more than l1 so in this case it is preferable to use l1 loss than l2 loss again this whole thing can be explained very easily by just using this simple geometry if you if you can just quickly realize that y i minus f of x i is e i and what is what does what does how do outliers correspond to e i's what is the plot of this l1 loss l2 loss and if you can just connect this optimization concept here it's very simple to answer now here is a quick follow-up question okay those of you are interested okay so to this to the same setup okay to the same setup in lot of interviews the interviewer typically starts with a problem and extends the problem in multiple directions right a quick follow-up question for this problem is which of the which of the following which of l1 or l2 will result in a more sparsity in weights if you if let's assume your model f is let's say a neural network model which has a bunch of weights right and you're trying to minimize these weights so let's say you decide let's say you you have a bunch of weights in your neural network model right whether you you can either use l1 loss or l2 loss right l2 loss of course there will also be biases here you can either minimize l1 loss or l2 loss which of these l1 loss or l2 loss will result in us signific will result in more sparsity will result in more sparsity in your weights if if your model itself is a neural network right please pause and think about this follow-up question so i'm hoping that you've given it some thought many people will directly jump and say hey l1 will result in sparsity because they remember this concept that l1 regularization results in sparsity that's what they remember right and they will quickly jump to the conclusion that using l1 loss will result in sparsity which is incorrect because remember here the loss itself is l1 loss or l2 loss we have not regularized either of these models the sparsity of the model will depend on the regularization here see imagine to this to this if you add some lambda multiplied by some regularization if you add l one regularization here imagine if you add l one regularization here then you would get sparsity okay the sparsity is not dependent on the loss but on the regular razor okay in in this context in the definition of the problem we didn't say anything about regularization we only talked about l1 loss versus l2 loss right and what are we computing this is l1 on the residuals these again by the way these ei's are also sometimes referred to as residuals or errors right so this is not a l1 regularization on the weights or biases this is l1 regularization on the residuals or errors so neither of these models is guaranteed to give you any sparsity because sparsity depends on regularizing the weights and biases and not the loss function itself that's an important catch that you should get here okay i hope i hope these two questions give you some idea on how to think about translating the mathematics translating the geometric intuition that you have built to solving and attacking real world problems like how do you make your model more robust to outliershi friends here is a very interesting interview question about outliers and robustness to outliers so first i'll explain the the the question itself then i will request you pause the video try to think of the solution but i'll also explain the solution itself in this video right so imagine you have a data set which you're going to use to train your model let's say you met let's assume that the data set contains pairs of x i and y i x i are your data points x i is d dimensional real valued y i is also real value which means what you are going to train using this training data is a regression model because your y is a real value now imagine imagine you have a model that you build let's assume the model is called f think of this model as something like let's say a linear regression or a neural network model with some weights okay so let's assume this is your model f your model takes each x i as input and it predicts what y i is so the predicted value is called as y i hat right so this is the setup this is a problem setup this is a standard problem setup of any regression problem now while solving this regression problem let's say using a linear regression or using a neural network based model let's assume you have two choices either you can use your loss as l1 i'll define what l1 is or you can use l2 now l1 is defined as summation over all the i's the absolute value of the difference between y i which is the actual value and f x i which is a predicted value predicted by your model f right so this is how you define your l1 similarly let's assume there is an other loss called as l2 okay this l2 is summation over all i all of your training points basically y i minus f of x i whole squared okay so this is this is sort of like like your squared loss right this is very similar to this is this is your squared loss in in in regression right so this is this is like an absolute value of the difference between y i a and f x i now given the choice you can either use let's say l 1 loss or you can use l 2 loss now the most important question here is which of these which of l 1 or l 2 would work better if the data set d contains lots of outliers right i hope the question is clear so you have a standard regression setup you can either use l1 l1 loss which is defined as this or you can use l2 loss which is defined as this now which one would we use if you know that the data set contains outliers and which one would you use or which one would work better in reality right if the data set d contains lots of outliers that's the question so i would recommend you pause this video here and think about how to how to answer this again this is about applying the foundational mathematics to solving real world problems because outliers are a daily reality in real world machine learning and of course you should you should decide whether you will use l1 or l2 and you should be able to justify mathematically or geometrically why you're using l1 or l2 okay so i'm assuming that you have spent some time thinking about it here here is how big here is one way of solving it okay so let's let's write it down okay so let us say y i minus f of x i let us call it as e i or in other words error i because this f of x i is nothing but y i hat this is what the model predicted right the difference between the actual observed value and the predicted value we are just calling it ei now what are outliers basically if you think logically what is an outlier your model tried to predict given the data again this model can be a neural network model or a linear regression model whatever model you want to use but whatever you do these outliers are misbehaving they don't they don't they don't fit into the model very well or in other words your ei is typically very large that's one way of defining an outlier in this context right now so this is the first observation that you have to make okay that outliers basically mean that your eis which is nothing but the difference of y i and y i hat are large that that's how you can think of outliers mathematically the second key observation here is this so let's say you i draw e i on x axis and the loss on y axis how does my l1 look like look at this if you plot your l1 and l2 your l1 looks like the absolute value function right this is your absolute your l 1 is nothing but your absolute value function right so it will be like a straight line here straight line here 45 degree lines right again remember the x axis here is e i right so l one will look like this green line which is straight lines okay it's a straight 45 degree line it's a straight 45 degree line now how does l 2 look like look at this if you write if you you can write these formulations of l 1 l 2 like this l 1 is nothing but summation over i it is e of absolute value of e i right what is l 2 it is summation over all i e i squared so if you look at it with respect to e i the loss itself is squared which means it looked like a parabola like this it looked like a parabola like this right so l2 will look like this parabolic curve while l1 loss with respect to e1 looks like the straight line now look at this here i mean this is the second key observation that you have to make the third most important the third most important step in solving this is this imagine if ei is very large let's say umea is very very large let's say your mia is here some large value okay so let's assume your ei is large again when i say ei is large i mean it could be a large positive value or it could be a large negative value with that that's important that's important okay because if your if this is your axis ei let's say you let's assume you go here this your outliers will mostly have a very large positive value or it will have a very large negative value so most of your non outlier points the error of y i minus f of x i will be small which means they will they will lie somewhere in this region right somewhere i mean i'm just approximately pointing this but the outliers will have a large positive value or a large negative value i have to make that very clear which means they'll most likely lie in this region in these in these very far away regions because the errors are high either large positive or large negative values now notice that if if the error is large what happens your l2 is growing quadratically your l2 is growing very very fast as compared to l1 look at this look at this point itself look at this point at this point your l1 is smaller significantly smaller than your l2 look at this your l1 is significantly smaller than your l2 which means for an outlier point whether it is a outlier with a large positive value or a large negative value even here imagine you take out suppose imagine there is an outlier point here the l1 will be significantly smaller than your l2 l2 will be significantly larger than l1 so now what happens is if you have lot of outliers the contribution of the outlier again because what do you do you have a loss what do you do in regression you try to minimize these laws with respect to some parameters like weights and biases that you have that's how you solve these problems right similarly if you have l2 l2 loss what are you trying to do you're trying to minimize this with respect to some weights and biases in your neural network or your linear regression model right so now look at this outliers will contribute less to this loss if you are using a l1 or absolute value function here rather than a quadratic function if you are using a squared function its contribution to the loss is more because its contribution because remember it's a quadratic curve right it's a quadratic curve right l2 corresponds to a quadratic curve l one corresponds to a straight line a quadratic function grows significantly faster than the absolute value function so what is happening here the impact of outliers is significantly more with if you're trying to minimize the l2 loss than if you are minimizing the l1 loss which in other words means l2 will be more impacted by your outliers than l1 and hence in this case it is preferable to use this l1 loss than the l2 loss if you want a better model right at the end of the day you don't want your outliers to impact your model right in your minimization process you're trying to compute the parameters of the model you want you want this out layers to impact less right in l2 in the case of l2 they are impacting more than l1 so in this case it is preferable to use l1 loss than l2 loss again this whole thing can be explained very easily by just using this simple geometry if you if you can just quickly realize that y i minus f of x i is e i and what is what does what does how do outliers correspond to e i's what is the plot of this l1 loss l2 loss and if you can just connect this optimization concept here it's very simple to answer now here is a quick follow-up question okay those of you are interested okay so to this to the same setup okay to the same setup in lot of interviews the interviewer typically starts with a problem and extends the problem in multiple directions right a quick follow-up question for this problem is which of the which of the following which of l1 or l2 will result in a more sparsity in weights if you if let's assume your model f is let's say a neural network model which has a bunch of weights right and you're trying to minimize these weights so let's say you decide let's say you you have a bunch of weights in your neural network model right whether you you can either use l1 loss or l2 loss right l2 loss of course there will also be biases here you can either minimize l1 loss or l2 loss which of these l1 loss or l2 loss will result in us signific will result in more sparsity will result in more sparsity in your weights if if your model itself is a neural network right please pause and think about this follow-up question so i'm hoping that you've given it some thought many people will directly jump and say hey l1 will result in sparsity because they remember this concept that l1 regularization results in sparsity that's what they remember right and they will quickly jump to the conclusion that using l1 loss will result in sparsity which is incorrect because remember here the loss itself is l1 loss or l2 loss we have not regularized either of these models the sparsity of the model will depend on the regularization here see imagine to this to this if you add some lambda multiplied by some regularization if you add l one regularization here imagine if you add l one regularization here then you would get sparsity okay the sparsity is not dependent on the loss but on the regular razor okay in in this context in the definition of the problem we didn't say anything about regularization we only talked about l1 loss versus l2 loss right and what are we computing this is l1 on the residuals these again by the way these ei's are also sometimes referred to as residuals or errors right so this is not a l1 regularization on the weights or biases this is l1 regularization on the residuals or errors so neither of these models is guaranteed to give you any sparsity because sparsity depends on regularizing the weights and biases and not the loss function itself that's an important catch that you should get here okay i hope i hope these two questions give you some idea on how to think about translating the mathematics translating the geometric intuition that you have built to solving and attacking real world problems like how do you make your model more robust to outliers\n"