**Understanding Linear Regression and Its Applications**
So what we tend to do is calculate something called mean absolute error. So essentially if you're too small, we just remove the minus sign and call it an error of that amount. All right. So if your mean absolute error is 0.4 then what that's saying is but on average you're 0.4 units away from where you were hoping to be. It's also quite common to see similar measures like root mean squared error for every instance.
We take our error we square it, we sum them all up and then right at the end we take a square root, right? And again, this is a very similar measure to mean absolute error like the squaring. We move our negative symbols for us. It's also quite common particularly in fields like biology and medicine to see something called R squared or the R squared coefficient and this is essentially the correlation squared it's a measure of how well or how tightly correlated our Predictions and our ground truth were for example.
This would be a pretty good correlation if maybe 0.8 or 0.9 if these were our points like that and were Absolutely. Perfect. That would be an R squared of 1 if our points were everywhere. That will be an R squared of 0 and what I saying is it's a value between 0 and 1 that tells How well we predicted zero means you basically didn't predict anything at all.
It was completely random output one means you predicted everything exactly, correct. Now, of course that's unlikely to happen or a test set. What you'll find is you'll hope to get some number but somewhere around 0.7.8, right? But it will depend on how difficult your problem is to solve. So maybe on a really difficult problem an R-squared of 0.5 is actually pretty good, right?
So we've got our linear regression trained up. We know that the correlation coefficient is 0.85. We know that the mean absolute error for example is 13 degrees. What we haven't done is visualize it sometimes a simplify to do. This is just to plot a scatter plot of what we wanted and what we actually got from our predictor.
So I'm going to right click on linear regression. I'm going to say visualize classify errors, it's going to be a scatter plot of the Expected value and the prediction we actually got from our networks so you can see generally speaking. It's not too bad. Obviously the data set is quite bunched up in some of these areas which means that it's sometimes harder to predict.
But we've got a general upward trend which is exactly what we wanted. You can see that the prediction around zero is not good at all. The x-axis in this instance is the actual critical temperature of that particular substance. The y-axis is what the linear regression actually predicted. You can see that the range here is from about zero to about 136 on our actual values and the predicted values are from about -30 which doesn't really make sense to 131, but they're pretty close.
Most of the ones that caused a problem with a very low values right because you've essentially got lots and lots of values that have a very small critical temperature on this scale, but different attributes that's been hard to fit a line to something more powerful for example a Multi-layer perceptron, you know an artificial neural network might do a better job of those kind of instances.
But you can see that there's a general upward slope in this particular scatter plot. Be larger X's represent a larger error so you can see this is of a line. We're actually trying to fit here down here with all these small X's and there's quite a few of them on there.
So actually for a lot of these substances the prediction even by linear regression has been pretty good.
"WEBVTTKind: captionsLanguage: enClassification lets us pick one or the other or some small number of labels for our dataThe problem is that real life doesn't fit into these neat little categoriesWhen we have label data there isn't yes or no or a B or C or some labels?Right, then we have what we call a regression problem. We're actually trying to predict actual outputs, right so given these inputsWhat's the temperature at which something will occur or?Given this movie on a streaming site and the attributes and the people that have watched itWhat amount of action is it right because that informs who should watch that movieThere's lots of times when you don't want to say--but sees this and isn't this you want to say it's a little bit of thisAnd a little bit of thisand that's what regression is for and some of the algorithms we use for regression are actually quite similar toClassify. So for example, you can regress using a support vector machine or support vector of aggressor, right?But we also use other ones like so we're more likely to use things like linear regression and things like thisSo let's start off with perhaps for simplest form of regression. That's linear regression, right?It might not occur to people who use linear regression for actually what you're doing is machine learningBut you are let's imagine we have just data that's got one inputso one attribute attribute one andOur output which is why this is our table of data just like before and this is our instance dataSo we've got one two, three four like thisso what we want to do is we want to input attribute one andWe want to output Y which instead of being a yes or no is going to be some number on a scaleLet's say between Norton one. So really what we're trying to do is we've got our graph here of our input variableAttribute one and we've got our Y output and these are our data points in our training setSo here like this and they sort of go up like thiswhat we're going to do using linear regression is fit a line through this data and a line is of the form yequals MX plus Cso in this case M is going to be the gradient of our line and C is going to be B intercept so in thisCase I guess something along the lines of this straight up like thisSo if our M was one in this case MEquals one or maybe equals one point two to make it slightly more interesting and then our C is going to be let's say CHis naught point naught to these are the values that we're going to learn using linear regressionSo, how do we train something like this?What we're going to do is we want to find the values for our unknowns which are M and CGiven a lot of x and y pairs, right?So we've got our x and y pairs here and we want to predict these values the optimal values for this data setSo we're going to find values for M. And C where this distance the prediction error is minimized the better fitThis line is the average prediction error is going to go down if this line is over hereIt's going to be a huge error. And so the hope is that if we predict this correctly and we have an MAnd we have a C then when we come up with a newValue that we're trying to predict we can pass it through this formula. We can multiply it by 1.2 and then add0.02 and that will produce our prediction for y and hopefully that would be quite close to what it isSo for example, let's imagine. We have a new value for attribute 1. Let's come in hereWe're gonna look up here and this is going to be the prediction for our Y and that's the output of our aggressorSo this linear regression is capable of producingPredictions based on its attribute now if we have more than one attributeThis is called multivariate linear regression and the principle is exactly the same is this we're going to have lots of these multiplier endsWe could say something like Y ism1 x1 plusm2 x2 and so on for all of our different attributesso it's going to be a linear combination a bit like PCA a linear combination ofThese different attributes and it's obviously going to be multi-dimensionalSo one interesting thing about linear regression is but what it's going to do is predict us a straight lineregardless of how many dimensions we've got now sometimes if we want to use this for a classificationPurpose we still can all rightNow I'm supposed to be talking about regression not classificationBut just briefly if you indulge me we can pass this function through something called a logistic function or in the sigmoid curveAnd we can squash it into something. There's this shapeAnd now what we're doing is we're pushing our values up to 1 and down to 0Right and that is our classification between 1 and 0So it is possible to perform linear regression using this additional logistic function to performClassification and this is called logistic regression. IJust what I mention, but that's something you will see being done on some dataSo let's talk a little bit about something more powerfulThat's artificial neural networksnowAnytime in the media at the moment when you see the term AI what they're actually talking about is machine learning and what they're talkingAbout is some large neural network. Now. Let's keep it a little bit smallerLet's imagine what we want to do is take item for attributes and map them to some prediction some regressed value, right?How are we going to do this?Well, what we can do is we can essentially combine a lot of different linear regressions through some nonlinear functions into a really powerfulRegression algorithm, right. So let's imagine that we have some data which has got three inputsSo we've got our instances and we've got our attributes a B and C. Our inputs are a B and CAnd then we have some hidden New Orleans right and I explained a neuron in a momentThen we have an output value that we'd like to address. This is where we're trying to predict the valueSo, you know how much disease does something have how hot is it these kind of things depending on our attributes?this is where we put in a this is where we put in B and this is where we put in C andThen we perform a weighted sum of all of these things for each of these neuronsSo for example this neurons going to have three inputs from these three here and this is going to have weight oneThis is going to be weight - this is going to be weight threeAnd we're gonna do a weighted sum just like in linear regressionSo we're going to do weight one times a plus weight two times B plus weight three times c addthem together and then we're going to add any bias that we want to so this is going to be plus some bias and that'sGoing to give us a value for this neuron, which let's call it hidden want right because this is generally speakingWe don't look at these values. It's not too important. We're going to do a different sum for this oneSo I'm going to all them in different colors so we don't get confused. So this has got three weightsSo this is going to be a different wayThis is going to have another different weightAnd we're going to do this much times a Plus this much times B plus this much times CAdd them all up add a plus a bias and we're going to get hidden - and we're going to do the same thingWith these ones here like thisThis is going to be hidden three hidden for hidden five and so on for as far as we like to goAll rightnowthe nice thing about this is for each of these can calculate a different weighted sum now the problem is that if we just didThis then what happens is we actually get a series of linear regressionsAll rightbecause this is just multivariate linear regression and in the end ourAlgorithm doesn't end up any good right? If you combine multiple linear functions together, you just get one different linear functionso we pass all of these hidden values through a nonlinear function like a sigmoid orTan so a sigmoid goes between naught and 1 so this is not than 1 and a tanHyperbolic tangent will go between minus 1 and 1Things like this and what that will do is add a sufficiently complexFunction that when we combined them all togetherWe can actually get quite a powerful algorithm the way this works is we put in a B and CWe calculate all the weighted sums through these functions into our hidden units and then we calculate another series of weighted sumsso add together to be our final output and this will be our final output prediction Y now the way we train this iswe're going to put in lots and lots of test data where we have the values for a b c and we know what theOutput should have been we go through the network and then we say, well actually we were a little bit offSo can we change all of these weights so that next time we're a little bit closer to the correct answer and let's keep doingthis over and over again in a process called gradient descent andSlowly settle upon some weights where for the most part when we put in our a B and CWe get what we want out the other side now, it's unlikely to be perfectbut just like with the other machine learning as we've talked about we're going to be trying to make ourPrediction on our testing set as good as possibleAll rightSo we've put in a lot of training data and hopefully when we take this network and apply it to some new data it alsoPerforms. Well, let's look at an exampleWe looked at credit checks in the previous video and we will classify whether or not someone should be given creditWell something that we cut we often calculate is credit ratingwhich is a value from let's say naught to 1 ofHow good your credit score is so a could be how much money you have in your bank B could be whether you have anyLoans and C could be whether you own a car and obviously there's going to be more of these because you can't make a decisionOn this those three things. So what we do is we get a number of people that we've already made decisions about right?so we know the person a has a bank account balance of five thousand two thousand in loans, and he does own a car andHe has a credit rating of 75 or Northpoint 75 whatever your scale isSo we put this in we sieze waitSo but this is the correctPrediction and then hopefully when another person comes along with a different set of variables will predict the right thing for themSo you can make this network as deep or as big as you want. We're typicallyMulti-layer perceptrons or artificial neural networks, like this won't be very deepone two three hidden units deep maybe but what's been shown in the literature is but actuallyIf you have a sufficient number of hidden unitsYou can basically model any function like this right as long as you've got sufficient training data to create itSo we're going to use Weka againbecause Weka has lots ofregression algorithms built-in like artificial neural networksAnd linear regression. So let's open up a data set. We're going to use this timeSo they said we've got is a data set on superconductivity right nowObviously my knowledge of physics is should we say average?But a superconductor is something that when you get it to a critical temperature it becomes it has no resistanceRight, which is very useful for electrical circuitsAnd so this is a data set about what are the properties of material and what is the critical temperature?Below, which it will be a superconductorNow, I'm sure there's going to be some physicists in the comments that might point out some areas of what I just saidBut we'll move on. So we're reading a file. This is quite a big data setSo we have a lot of input attributes and then at the end we have this critical temperature that we're trying to predict thistemperature if we look at this histogram goes from 0 to185 if we look at some of the other things so for example, we've got this entropy atomic radius, which I can pretendI know what that is, which goes from naught to two point one four. Is that good?Right, what we're going to do is we're going to start by usingMultivariate linear regression to try and predict this critical temperature as a combination of these input featuresSo I'm going to go to classify. There's just one classified tab even for regressionwe're going to use our same percentage splitters before so 70% andWe're going to use a simple linear regression function for thisLet's goSo we've trained our linear regression and what we want to do now is work out whether it's worked or not on our testing setWe've got the variables. We wanted Y and we've got the variables that have been predicted Y hat andHopefully they're exactly the same if they're exactly the same then they're going to be on a straight line like thisSo we were hoping to get a why down here and we it now, of course this won't actually happenedWhat will happen is these wines are ever so slightly different than the Y'swe were expecting so you might see a bit of noise around the center like this andThe way we would normally measure this is something called mean absolute error or mean squared error or root mean squared errorWhich all very similar ways to measure the same thingIt's to measure what is the average distance between what we wanted and what we gotso if we were hoping to get away of North Point - but we actually got a Y of North Point for then ourMistake was we were not point to too highAnd so for every single instance in our test set we can sum up all of the areas we've got and we can work outWhat the average error was right. So we have a hundred in our test setWe sum up the errors and we divide by a hundred and that tells us I mean error was a certain amountWhat will sometimes happen is your predictions will be above or below right?and so your actual mean error might be zero because half a time you predicted too high half a time you predicted too low andSo on average, you've got it exactly right. Obviously, that's not correctSo what we tend to do is calculate something called mean absolute errorSo essentially if you're too small, we just remove the minus sign and call call it an error of that amountAll rightSo if your mean absolute error is nour point four then what that's saying is but on average you're naught point far awayLive above or below than where you were hoping to beIt's also quite common to see similar measures like root mean squared error for every instanceWe take our error we square it we sum them all up and then right at the endWe take a square root, right?And again, this is a very similar measure to mean absolute error like the squaring. We move our negative symbols for usIt's also quite common particularly infields like biology and medicine to see something called R squared or the R squared coefficient and this is essentially theCorrelation squared it's a measure of how well or how tightly correlated ourPredictions and our ground truth were for exampleThis would be a pretty good correlation if maybe naught point eight or nor point nine if these were our points like this and wereAbsolutely. Perfect. That would be an R squared of one if our points were everywhereThat will be an R squared of 0 and what I saying is it's a value between 0 and 1 that tellsHow well we predicted zero means you basically didn't predict anything at allIt was completely random output one means you predicted everything exactly, correctNow, of course that's unlikely to happen or a test setWhat you'll find is you'll hope to get some number but somewhere around point seven point eight, right?But it will depend on how difficult your problem is to solveSo maybe on a really difficult problem an r-squared of 0.5 is actually pretty good, right?So it's just going to depend on the situation. So we've got our linear regression trained upWe know that the correlation coefficient is 0.85. We know that the mean absolute error for example is 13 degreesWhat we haven't done is visualize cyst sometimes a simplify to doThis is just to plot a scatter plot of what we wanted and what we actually got from our predictorSo I'm going to right click on linear regression. I'm going to say visualizeclassify errorsit's going to be a scatter plot of theExpected value and the prediction we actually got from our networks so you can see generally speaking. It's not too badObviously the data set is quite bunched up in some of these areas which means that it's sometimes harder to predictBut we've got a general upward trend which is exactly what we wantedYou can see that the prediction around zero is not good at allThe x-axis in this instance is the actual critical temperature of that particular substanceThe y-axis is what the linear regression actually predictedYou can see that the range here is from about zero to about136 on our actual values and the predicted values are from about minus 30 which doesn't really make sense to 131, but they're pretty closemost of the ones that caused a problem with a very low values right because you've essentially got lots and lots of values that havea very small critical temperature on this scale, but different attributes that's been hard to fit a line tosomething more powerful for example aMulti-layer perceptron, you know an artificial neural network might do a better job of those kind of instancesBut you can see that there's a generalUpward slope in this particular scatter plot be larger X's represent a larger error so you can see this is of a lineWe're actually trying to fit here down here with all these small X's and there's quite a few of them on thereSo actually for a lot of these substances the prediction even by linear regression has been pretty goodregression algorithmsLet us predict real scalar data out of our input variables and then this can be really useful in a huge array of differentSituations where we want to predict some it doesn't fit neatly into a yes-or-no answer or an ABC category labelWe've looked at linear regression and artificial neural networks, and obviously neural networks get pretty deep these daysBut these are a great starting point, soThanks very much for watching. I hope you enjoyed this series on data analysis something a little bit different from computerphileI wanted to thank my colleague. Dr. Mercedes Torres Torres for helping me design the contentPlease let us know what you liked what you didn't like let us know in the comments what you'd like to see more ofAnd we'll see you back again next time\n"