Interview question on SVMs for Machine Learning roles.

**Understanding Kernel SVMs and Hyperparameters**

One thing that I should look at is my kernel right, kernels have hyperparameters that I can play with and if you recall one of the most powerful and widely used kernels for non-linear separable data we are not given any other properties about this non-linearly separable data we are not told that its concentric circles or anything like that so if you don't know the properties of the non-linearly separable data just using a radial basis function kernel is a very popular choice and it also works fairly well in practice right, you will not be using a polynomial kernel or any of those kernels an rbf kernel is a much more generic kernel that works well on nonlinearly separable data so okay, so we decide okay, so we're going to use a hard margin svm no hyperparameters in this primal formulation itself so let's dive into rbf kernel.

**The Radial Basis Function (RBF) Kernel**

So how does the RBF kernel look like? So if you have two points x1 and x2 the RBF kernel is basically e^(-gamma * (x1 - x2)^2), where gamma = 1/2sigma^2. This is basically your Euclidean distance squared divided by 2 sigma^2 right, again some of you might have learned with this notation where you have gamma equals to 1/2sigma^2 right so this 1/2sigma^2 is often referred to as the scale of the RBF kernel like if you if you are familiar with scikit-learn one of the notations they use is they don't use this notation with sigma when you write code in a scalar they use the gamma as the hyperparameter okay so it doesn't matter I'll talk in terms of sigma but the same interpretation can be made in terms of gamma also if you are familiar with this notation in this notation what happens is that gamma is minus 1 divided by 2sigma^2.

**Understanding Hyperparameters**

Now, in an RBF kernel, what are the hyperparameters that we can play with? The biggest hyperparameter we have is the sigma here, this sigma is a hyperparameter that we can play with right. This is a hyperparameter that we can play with to either overfit or underfit right now what happens if this sigma is small so if i reduce the sigma okay i use this arrow mark to symbolize that we are reducing the sigma right if you reduce the sigma what happens intuitively this this RBF kernel if you plot it if you plot it like this your RBF kernel itself becomes narrower and narrower which means only points which are very close together will be treated as similar points because what is a curl at its very core kernel is basically a similarity measuring function right.

**The Effect of Sigma on Overfitting**

If your sigma is small then only points which are very close to each other if the distance between x1 and x2 is very small only then they will be treated as similar points and as sigma reduces again we have discussed this in lots of detail in our course videos also as sigma reduces you tend to overfit the model to the given data. As sigma increases what happens? So, farther and farther points so if you look at this x1 - x2 further and further points also because the width of this kernel is more right so for even farther and farther points will be treated as being similar so if if if the sigma is larger then you will underfit the model if the sigma is smaller you will overfit the model. So, what is that task at hand our task at hand here is how do you overfit an SVM given that we have to use a hard margin SVM on nonlinearly separable data.

**Using RBF Kernel SVM for Overfitting**

So in a hard margin SVM we don't have this hyperparameter c that we can play with if we don't have this c right so what are the hyperparameters we have it is only the kernel hyperparameters that we have given that we know that the data is non-linearly separable we should immediately come to the conclusion that we want to use a kernel SVM and that to an RBF kernel. So as soon as you arrive at the fact that we have to use the RBF kernel SVM you can see that the RBF kernel SVM has the sigma so to overfit given our requirements we simply use an RBF kernel SVM with a small value of sigma to overfit on this problem again these types of interview questions try to test your deeper understanding of the concept they try to understand whether whether you whether you really really understand the hard margin SVM do you understand the differences between hard margin and soft marginal stream do you understand about the RBF kernel do you understand about the hyperparameters in the RBF kernel do you understand how to overfit an RBF kernel SVM.

"WEBVTTKind: captionsLanguage: enhere is an interesting question which is typically asked for data scientists and for machine learning roles in many companies and this question is all about overfitting a support vector machine again this question is on the easier side of questions that you can expect and the question here is as follows imagine that you're given a data set d which consists of pairs x i y i right your standard classification setting let's assume we're using support vector machines for classification and let's assume you're given n points as your training data let's assume this is your training data okay and let's assume that this data it's also simplified and assume that it's binary classification okay so your y is are binary your x i is basically your feature vector right let's also assume that the data given to you is non-linearly separable and the question that is asked of you is imagine that you're trying to fit a hard margin svm this is not a soft marginal frame please note that okay so you're trying to fit a hard margin svm the question here is given a data set like this wherein you're trying to perform binary classification and given that the data is non-linearly separable how do you over fit a hard margin svm that's the question again this question is surely on the easier end of the spectrum of questions that you can expect okay because it's all about how do you over fit a specific model of course couple of couple of details given here are that it's a hard margin svm not a soft margin svm and the data is non-linearly separable that's all those are the two subtle hints given to you so i would like you to pause this video now and think about answering this question on your own before checking the rest of the video where i'll explain the solution okay so let's dive in first of all one fact given to us is a hard margin svm the second fact given to us is a non-linear separable data because you have non-linear separable data a linear svm won't cut so you will have to use a kernel svm so the first hint because the data is non-linear is non-linearly separable and you want to over fit to this data so what you need to use is a kernel sphere what kernel we will see that in a couple of minutes okay and it's also given to you that it's a hard margin sphere so if you look at the mathematical formulation of hard margin svm right so you don't have any hinge loss term you don't have any term here so typically you have the hinge loss term here right in a soft sphere you don't have any of that in a hard margin sphere all you're trying to do is you're trying to maximize this margin so what is your margin in svm this is your margin right this is your margin so you're trying to maximize margin or in other words you're trying to minimize the inverse of margin such that all the points are on the appropriate sides of the hyperplanes okay y i w transpose x i plus b is greater than equal to 1 for all points i for all i this has to be true that's what is your hard margin sphere imagine if this was a soft margin svm you could fine tune this parameter c here right to overfit or under fit if it was i mean if the question said soft margin svm typically what people say here is we have this parameter c here again depends on how you write mathematically some people place this hyper parameter c here some people place a lambda here depending on whichever whichever you use you will have this hyper parameter c in the case of a soft margin and by increasing c or by reducing c you can either over fit or under fit but in a hard margin svm you don't have any of this term here right you don't have any of this hinge loss term here you're trying to just maximize the margin or minimize the inverse of the margin such that all points are correctly classified so you can't say i'll change so some people mistakenly say hey i'll play around with this hyper parameter c you don't have that hyper parameter in a hard margin sphere note that right so this is your formulation what are hyper parameters do you have here because one way to over fit or under fit is to play with your hyper parameters right so there is no hyper parameter you are trying to minimize with respect to w and b obviously so there is no hyper parameter that you see here this is just your margin right so there is no hyper parameter that you see here that you can that you can tinker to either overfit or under it so which means and what is the task at hand our task at hand is to use a hard margin svm i don't have any hyper parameters to play with so the next thing that i should look at is my kernel right kernels have hyper parameters that i can play with and if you recall one of the one of the most powerful and widely used kernels for non-linear separable data we are not given any other properties about this non-linearly suppliable date we are not told that its concentric circles or anything like that so if you don't know the properties of the non-linearly separable data just using a radial basis function kernel is is a very popular choice and it also works fairly well in practice right you will not be using a polynomial kernel or any of those kernels an rbf kernel is a much more generic kernel that works well on nonlinearly separable data so okay so we decide okay so we're going to use a hard margin svm no hyper parameters in this primal formulation itself so let's let's dive into rbf kernel so how does rbf kernel look like so if you have two points x1 and x2 the rbf kernel is basically e power minus the distance squared between both these points this is basically your euclidean distance squared divided by 2 sigma squared right again some of you might have learned with this notation where you have gamma equals to 1 by 2 sigma square right so this 1 by 2 sigma square is often read as gamma it is also referred to as scale of the rbf kernel like if you if you are familiar with scikit learn one of the notations they use is they don't use this notation with sigma when you write code in a scalar they use the gamma as the hyper parameter okay so it doesn't matter i'll talk in terms of sigma but the same interpretation can be made in terms of gamma also if you are familiar with this notation in this notation what happens it is minus gamma into the euclidean distance squared so 1 by 2 sigma square is replaced by gamma okay so now in an rbf kernel what are the hyper parameters that we can play with the biggest hyper parameter we have is the sigma here this sigma is a hyper parameter that we can play with right this is a hyper parameter that we can play with to either overfit or under fit right now what happens if this sigma is small so if i reduce the sigma okay i use this arrow mark to symbolize that we are reducing the sigma right if you reduce the sigma what happens intuitively this this rbf kernel if you plot it if you plot it like this your rbf kernel itself becomes narrower and narrower which means only points which are very close together will be treated as similar points because what is a curl at its very core kernel is basically a similarity measuring function right if your sigma is very small then only points which are very close to each other if the distance between x1 and x2 is very small only then they will be treated as similar points and as sigma reduces again we have discussed this in lots of detail in our course videos also as sigma reduces you tend to over fit the model to the given data as sigma increases what happens as sigma increases farther and farther points so if you look at this x1 minus x2 further and further points also because the width of this kernel is more right so for even farther and farther points will be treated as being similar so if if if the sigma is larger then you will under fit the model if the sigma is smaller you will over fit the model so what is that task at hand our task at hand here is how do you over fit an svm given that we have to use a hard margin svm on nonlinearly separable data so in a hard margin svm we don't have this hyper parameter c that we can play with if we don't have this c right so what are the hyper parameters we have it is only the kernel hyper parameters that we have given that we know that the data is non-linearly separable we should immediately come to the conclusion that we want to use a kernel svm and that to an rbf kernel spm and as soon as you arrive at the fact that we have to use the rbf kernel svm you can see that the rbf kernel svm has the sigma so to overfit given our requirements we simply use an rbf kernel svm with a small value of sigma to over fit on this problem again these types of interview questions try to test your deeper understanding of the concept they try to understand whether whether you whether you really really understand the hard margin svm do you understand the differences between hard margin and soft marginal stream do you understand about the rbf kernel do you understand about the hyper parameters in the rbf kernel do you understand how to over fit an rbf kernel svm again this question is not hard if you know the basic foundations of svm if you know the basic mathematical formulation of hard and soft margin svm if you know the mathematics underlying rbf kernel this is a fairly straightforward questionhere is an interesting question which is typically asked for data scientists and for machine learning roles in many companies and this question is all about overfitting a support vector machine again this question is on the easier side of questions that you can expect and the question here is as follows imagine that you're given a data set d which consists of pairs x i y i right your standard classification setting let's assume we're using support vector machines for classification and let's assume you're given n points as your training data let's assume this is your training data okay and let's assume that this data it's also simplified and assume that it's binary classification okay so your y is are binary your x i is basically your feature vector right let's also assume that the data given to you is non-linearly separable and the question that is asked of you is imagine that you're trying to fit a hard margin svm this is not a soft marginal frame please note that okay so you're trying to fit a hard margin svm the question here is given a data set like this wherein you're trying to perform binary classification and given that the data is non-linearly separable how do you over fit a hard margin svm that's the question again this question is surely on the easier end of the spectrum of questions that you can expect okay because it's all about how do you over fit a specific model of course couple of couple of details given here are that it's a hard margin svm not a soft margin svm and the data is non-linearly separable that's all those are the two subtle hints given to you so i would like you to pause this video now and think about answering this question on your own before checking the rest of the video where i'll explain the solution okay so let's dive in first of all one fact given to us is a hard margin svm the second fact given to us is a non-linear separable data because you have non-linear separable data a linear svm won't cut so you will have to use a kernel svm so the first hint because the data is non-linear is non-linearly separable and you want to over fit to this data so what you need to use is a kernel sphere what kernel we will see that in a couple of minutes okay and it's also given to you that it's a hard margin sphere so if you look at the mathematical formulation of hard margin svm right so you don't have any hinge loss term you don't have any term here so typically you have the hinge loss term here right in a soft sphere you don't have any of that in a hard margin sphere all you're trying to do is you're trying to maximize this margin so what is your margin in svm this is your margin right this is your margin so you're trying to maximize margin or in other words you're trying to minimize the inverse of margin such that all the points are on the appropriate sides of the hyperplanes okay y i w transpose x i plus b is greater than equal to 1 for all points i for all i this has to be true that's what is your hard margin sphere imagine if this was a soft margin svm you could fine tune this parameter c here right to overfit or under fit if it was i mean if the question said soft margin svm typically what people say here is we have this parameter c here again depends on how you write mathematically some people place this hyper parameter c here some people place a lambda here depending on whichever whichever you use you will have this hyper parameter c in the case of a soft margin and by increasing c or by reducing c you can either over fit or under fit but in a hard margin svm you don't have any of this term here right you don't have any of this hinge loss term here you're trying to just maximize the margin or minimize the inverse of the margin such that all points are correctly classified so you can't say i'll change so some people mistakenly say hey i'll play around with this hyper parameter c you don't have that hyper parameter in a hard margin sphere note that right so this is your formulation what are hyper parameters do you have here because one way to over fit or under fit is to play with your hyper parameters right so there is no hyper parameter you are trying to minimize with respect to w and b obviously so there is no hyper parameter that you see here this is just your margin right so there is no hyper parameter that you see here that you can that you can tinker to either overfit or under it so which means and what is the task at hand our task at hand is to use a hard margin svm i don't have any hyper parameters to play with so the next thing that i should look at is my kernel right kernels have hyper parameters that i can play with and if you recall one of the one of the most powerful and widely used kernels for non-linear separable data we are not given any other properties about this non-linearly suppliable date we are not told that its concentric circles or anything like that so if you don't know the properties of the non-linearly separable data just using a radial basis function kernel is is a very popular choice and it also works fairly well in practice right you will not be using a polynomial kernel or any of those kernels an rbf kernel is a much more generic kernel that works well on nonlinearly separable data so okay so we decide okay so we're going to use a hard margin svm no hyper parameters in this primal formulation itself so let's let's dive into rbf kernel so how does rbf kernel look like so if you have two points x1 and x2 the rbf kernel is basically e power minus the distance squared between both these points this is basically your euclidean distance squared divided by 2 sigma squared right again some of you might have learned with this notation where you have gamma equals to 1 by 2 sigma square right so this 1 by 2 sigma square is often read as gamma it is also referred to as scale of the rbf kernel like if you if you are familiar with scikit learn one of the notations they use is they don't use this notation with sigma when you write code in a scalar they use the gamma as the hyper parameter okay so it doesn't matter i'll talk in terms of sigma but the same interpretation can be made in terms of gamma also if you are familiar with this notation in this notation what happens it is minus gamma into the euclidean distance squared so 1 by 2 sigma square is replaced by gamma okay so now in an rbf kernel what are the hyper parameters that we can play with the biggest hyper parameter we have is the sigma here this sigma is a hyper parameter that we can play with right this is a hyper parameter that we can play with to either overfit or under fit right now what happens if this sigma is small so if i reduce the sigma okay i use this arrow mark to symbolize that we are reducing the sigma right if you reduce the sigma what happens intuitively this this rbf kernel if you plot it if you plot it like this your rbf kernel itself becomes narrower and narrower which means only points which are very close together will be treated as similar points because what is a curl at its very core kernel is basically a similarity measuring function right if your sigma is very small then only points which are very close to each other if the distance between x1 and x2 is very small only then they will be treated as similar points and as sigma reduces again we have discussed this in lots of detail in our course videos also as sigma reduces you tend to over fit the model to the given data as sigma increases what happens as sigma increases farther and farther points so if you look at this x1 minus x2 further and further points also because the width of this kernel is more right so for even farther and farther points will be treated as being similar so if if if the sigma is larger then you will under fit the model if the sigma is smaller you will over fit the model so what is that task at hand our task at hand here is how do you over fit an svm given that we have to use a hard margin svm on nonlinearly separable data so in a hard margin svm we don't have this hyper parameter c that we can play with if we don't have this c right so what are the hyper parameters we have it is only the kernel hyper parameters that we have given that we know that the data is non-linearly separable we should immediately come to the conclusion that we want to use a kernel svm and that to an rbf kernel spm and as soon as you arrive at the fact that we have to use the rbf kernel svm you can see that the rbf kernel svm has the sigma so to overfit given our requirements we simply use an rbf kernel svm with a small value of sigma to over fit on this problem again these types of interview questions try to test your deeper understanding of the concept they try to understand whether whether you whether you really really understand the hard margin svm do you understand the differences between hard margin and soft marginal stream do you understand about the rbf kernel do you understand about the hyper parameters in the rbf kernel do you understand how to over fit an rbf kernel svm again this question is not hard if you know the basic foundations of svm if you know the basic mathematical formulation of hard and soft margin svm if you know the mathematics underlying rbf kernel this is a fairly straightforward question\n"