now I have a question by just look see all these plots go from 0 to 1 0 to 1 0 to 1 alright so I have one very interesting question we said that to distinguish of course to distinguish setters or flowers this is my petal length right I could say if if the petal length is less than equal to 2 it is setosa that's perfect right because all setosa flowers have a value less than equal to 2 because you see this senior if you can say 100% of points 100% of your setters or points or setosa flasks have a pattern length of less than equal to 2 so you're done so if you say if petal length is less than 2 then Sentosa okay and you've got to be 100% correct with this because 100% of the points have a value less than this now comes the tricky case between virginica and mercy color now let's assume I put a suppose I put a value of five as my threshold because that's where these two that's where these two pdfs are intersecting right now if i use five as my threshold an interesting thing happens let's see what is the corresponding value of five here the corresponding value of 5 for this cdf is point it looks close to 0.95 right what about this other one okay let me just put it here so this looks like on the y-axis it looks like point nine five what about this this looks like point one zero it's somewhere in the half line here right so this looks like point 1 0 on the y axis okay now if I say if petal length is greater than 2 and and petal length is less than 5 then declare it to be I keep getting confused which is versicolor which is virginica let's say this is versicolor and this is virginica let's say that just say if petal length is greater than 2 and petal length is less than 5 then what C color if I say that if I say that okay okay and if I say if petal length is greater than 2 and petal length is greater than 5 greater than 5 let's say less than equal to then virginica let's assume I just should written these two if-else conditions so for Seto it's very clear right for setosa less than 2 etcetera for virginica and versi color for versicolor and virginica it's a you might wrote rules like this of course these rules are not going to be perfect because these PDFs are intersecting but when I wrote a rule like this I'm going to be correct with virginica 95% of the times okay so this is going to tell me this is going to be accurate 95 percent of the times how do i how do I say 95 percent here because if you look at the CDF plot for versi color here at the intersection point which is 5 take the point corresponding to 5 go go and check on Syria this value corresponds to roughly 0.95 which means my until five percent of virginica flowers right half sorry of versicolor flowers have a petal length less than equal to 5 which means 95% of my versicolor flux will be labeled correctly 5% will not be labeled correctly okay similarly what about virginica now what unique I am saying if it is greater than 5 only then it's virginica less than 5 it's not virginica but there 10 percent of points how did I know about 10 percent I looked at the CDF I looked at the CDF of virginica and that CDF at value of 5 has the y-axis equal to point 1 0 which is 10% which means if I use this rule 10 percent of the times I'm going to make a mistake here so here I'm going to be correct I'm going to be correct 99% of the times and wrong or incorrect 5 percent of the times here 10 percent of the times I'm going to make a mistake for virginica and 90 percent of times I'm going to be accurate 90 percent the queer times I'm going to be right so such an information cannot be understood from your CD f's which you can use from your PDFs so when you plot PDFs and when you're building simple models like the simply fills conditions you can say how accurate your simple model we just use as one feature could be by reading the data off of your CDF which you cannot get from PDF that's one of the biggest use cases of CDF in data analysis and in modeling
Cumulative Distribution Function(CDF) - EDA Lecture 7 @ Applied AI Course
"WEBVTTKind: captionsLanguage: enthe next very important concept that we'll learn about is called the cumulative distribution function shortly you short refer to a CDF let's learn what CDF is how is it related to PDF why is it useful and things like that in in this in this section of the course so let's go let's go and see visually what's here face before we go into coding things like that so your blue line here is the PDF your blue line here is the probability density function your orange line here is the cumulative distribution function okay for the same again let me just put axis here this is better length of setosa flowers okay so I'm just taking petal length offset rows of flowers and this is basically probabilities just to quickly recap we know that PDF basically represents how many points so for example if you look here right so about about 20% of the points typically have typically have have their petal lengths between 1.5 and 1.6 right so this height represents how many points are there in this range typically right that's what your PDF is CDF C is a completely different story let's see what a PDF tells us so if you take a value of one point six so let's stay so this is petal length just to be clear offset rows of flowers so if you take a value of one point six so this is our C be afraid this function is our CDF right so let's see what is the corresponding value for one point six the corresponding value is roughly just above point eight so let's say this value let's say u is 0.82 I'm just making approximations here so what this means is so let's write it in English because that's what that's how we can understand it better it means that there are 82% 0.8 to here I'm just putting it here there are 82% off Seto's of flowers of setosa flowers that have that have their petal length less than equal to one point six this is what this is what we can read off this plot this is extremely useful information can we cannot get that information out of your your PDF this is your PD right we can't get the same insight out of your PDF your CD of basically says suppose 41.6 the value here corresponding value in the y-axis the probability here is point eight two right so what that means is that 82% of your setosa flowers have a petal length of less than equal to one point six that's what it means so first thing you'll notice as soon as you see this plot is it always starts at zero on the left bottom and em set one at the top right because what what does this mean this means that so let's understand what this means okay say just just using the similar argument here so what this means is all so hundred-percent this is one which means hundred-person it means 100% of setosa flowers have a petal length that is less than equal to one point nine similar let's take a value here I'll just do multiple values here okay this value seems to be around point one five okay what this means is 15% of setosa flowers have petal in that is less than equal to one point three right so what it's telling you is what is what percentage of flux any value here says what percentage of flowers the percentage you can read on the y-axis what percentage of flowers have a value less than equal to the corresponding x axis point here that's what your CDF says of course nothing can have less than zero probability and nothing can have more than one prop built because all probabilities lie between 0% and 100% right that's what we can quickly read off a CD of plant now the immediate question is how do you build a CDF so we understood how to read a CDF but how do you build a pit how do you build a CDF we know how to build a CD F right we saw that you can build a PDF by taking histogram and then smoothing it up will as a promise you will learn about how the smoothing exactly works when we learn Gaussian distribution but for now let's see let's say how CDF is built so let's take any value here okay so let's take a value of let's say one point six itself okay so what is the corresponding point on the CDF plot for one point six how do I get this point okay the way you get it is you say for one point six how many of so you have 50 flowers right these are your pedaling sub setters of Lars okay so you basically say how many flowers how many flowers have a better length of less than equal to one point six and when you count off your of all your points so you have fifty points right if let's assume 41 points or 41 floods have a petal length that is less than so let's say 41 flowers have 41 setters of class that's what I meant so 41 setosa flood said say have their petal length less than equal to one point six and a total data set is 50 points right so you divide 41 by 58 that gives you 82% or 0.82 and you put that on the y-axis okay and you keep repeating this you keep repeating it for 1.6 1.6 51.7 so on so forth and you get this whole curve right that's how you construct your CDF it's basically for every value of your petal length or x-axis you say how many points so for example if you take this right suppose for example if you take let's take one point three four one point three it looks like somewhere around 15% of points are below one of setosa flowers have it better length less than equal to one point three and that we can build it ourselves that's one way of doing it the second way is we know that your PDF basically counts how many flowers are there in each for each petal length right so this is I'm just plotting the histogram here okay so you can plot a histogram like this which we smooth to get PDF right so if we count all of them so for one point six if you want the value for one point six you can simply sum up this Plus this Plus this Plus this Plus this Plus this that will give you this height okay so it's basically a cumulative sum so these are called cumulative sums for any value X here the corresponding value Y the corresponding value Y can be found using the approach that I just told you which is 41 out of 50 case that I just explained awhile ago or you can count all this the heights of all of these probabilities that come before that including that value that's called a cumulative sum and when you do that you can understand what is you can easily compute what is the Y value corresponding to an x coordinate in Syria for those of you who know calculus right who remember area under curve integration things like that if you don't remember or if you don't recall it's okay we will go over it when we learn calculus in more detail but if you recall your calculus basically if you have a PDF write the corresponding value of the CDF is basically the area under the curve the area under the curve of your PDF in that point that's what is your CDF right so if you differentiate if you differentiate if you if you differentiate if you differentiate your CDF right you get your PDF and if you do if you do integration on your if you do integration on your PDF right you get your CDF those of you who don't know calculus or were forgotten calculus it's okay this is just an additional point that I'm making here we will learn some of these concepts when we learn calculus and optimization later in the course so if you already know this that's easy if you if you if you do if you differentiate your CDF you get PDF and if you if you integrate your PDF you get CDF those of you who know calculus if you don't know it's okay you can just look at the histogram approach or basically saying how many how many Seto's of flowers have better length less than equal to one point six even that will give us how to how to plot your CDF from a code perspective it's very very straightforward we just saw the code for PDF right so till here is the code for PDF right so this is where I computed my PDF also my bin it just here I'm using iris setosa data set and I'm computing the PDF on petal links with ten bins we saw how changing bins will change your plots now to compute CDF it's just one line literally one line in numpy there is a function called through minute of sum so this cube sum is basically cumulative sum cumulative sum the word cumulative I used earlier right so if you take your histogram and sum up everything before that value that's called a cumulative sum a function cumulative sum in numpy when you when you apply it on PDF you get C here and you just simply plot your feed it's very very straightforward there so this is all the code literally one line to compute your CDF from PDF using a function called cumulative sum and then you're just plotting it here nothing fancy here we are plotting your PDF and CDF as I've shown you here so now the next question is why is CDF useful how is CDF useful for our IDs dataset let's look at it so here I have plotted petal lengths this is my setosa virginica and versicolor my three flower types so these are the PDFs of my three flower types right I've also plotted the Syrians so this is the CDF for setosa let me just verify so this is the yeah this is for setosa this is for virginica just let me verify which is sets genic and versicle I always get confused with that so let's just go up and see okay where is my petal length here is my petal length of course the second one is versicolor the third one is virginica okay so let me just let me just put that in here so this is my virginica and versicolor am i right with that I hope yeah so I have my three types of flowers here just quickly going over not to make a mistake again you have your versicolor and virginica okay sounds good so let's let's go so these are my three PDFs of virginica versicolor and setosa if I mix this thing two things up I'm getting confused sorry bear with me but it just lets flow with the flow so these are this is my PDF of setosa this is my PDF of virginica this is my PDF of versatility right now comes the interesting part these are your CD f's now I have a question by just look see all these plots go from 0 to 1 0 to 1 0 to 1 alright so I have one very interesting question we said that to distinguish of course to distinguish setters or flowers this is my petal length right I could say if if the petal length is less than equal to 2 it is setosa that's perfect right because all setosa flowers have a value less than equal to 2 because you see this senior if you can say 100% of points 100% of your setters or points or setosa flasks have a pattern length of less than equal to 2 so you're done so if you say if petal length is less than 2 then Sentosa okay and you've got to be 100% correct with this because 100% of the points have a value less than this now comes the tricky case between virginica and mercy color now let's assume I put a suppose I put a value of five as my threshold because that's where these two that's where these two pdfs are intersecting right now if i use five as my threshold an interesting thing happens let's see what is the corresponding value of five here the corresponding value of 5 for this cdf is point it looks close to 0.95 right what about this other one okay let me just put it here so this looks like on the y-axis it looks like point nine five what about this this looks like point one zero it's somewhere in the half line here right so this looks like point 1 0 on the y axis okay now if I say if petal length is greater than 2 and and petal length is less than 5 then declare it to be I keep getting confused which is versicolor which is virginica let's say this is versicolor and this is virginica let's say that just say if petal length is greater than 2 and petal length is less than 5 then what C color if I say that if I say that okay okay and if I say if petal length is greater than 2 and petal length is greater than 5 greater than 5 let's say less than equal to then virginica let's assume I just should written these two if-else conditions so for Seto it's very clear right for setosa less than 2 etcetera for virginica and versi color for versicolor and virginica it's a you might wrote rules like this of course these rules are not going to be perfect because these PDFs are intersecting but when I wrote a rule like this I'm going to be correct with virginica 95% of the times okay so this is going to tell me this is going to be accurate 95 percent of the times how do i how do I say 95 percent here because if you look at the CDF plot for versi color here at the intersection point which is 5 take the point corresponding to 5 go go and check on Syria this value corresponds to roughly 0.95 which means my until five percent of virginica flowers right half sorry of versicolor flowers have a petal length less than equal to 5 which means 95% of my versicolor flux will be labeled correctly 5% will not be labeled correctly okay similarly what about virginica now what unique I am saying if it is greater than 5 only then it's virginica less than 5 it's not virginica but there 10% of points how did I know about 10% I looked at the CDF I looked at the CDF of virginica and that CDF at value of 5 has the y-axis equal to point 1 0 which is 10% which means if I use this rule 10 percent of the times I'm going to make a mistake here so here I'm going to be correct I'm going to be correct 99% of the times and wrong or incorrect 5 percent of the times here 10 percent of the times I'm going to make a mistake for virginica and 90 percent of times I'm going to be accurate 90 percent the queer times I'm going to be right so such an information cannot be understood from your CD f's which you can use from your PDFs so when you plot PDFs and when you're building simple models like the simply fills conditions you can say how accurate your simple model we just use as one feature could be by reading the data off of your CDF which you cannot get from PDF that's one of the biggest use cases of CDF in data analysis and in modelingthe next very important concept that we'll learn about is called the cumulative distribution function shortly you short refer to a CDF let's learn what CDF is how is it related to PDF why is it useful and things like that in in this in this section of the course so let's go let's go and see visually what's here face before we go into coding things like that so your blue line here is the PDF your blue line here is the probability density function your orange line here is the cumulative distribution function okay for the same again let me just put axis here this is better length of setosa flowers okay so I'm just taking petal length offset rows of flowers and this is basically probabilities just to quickly recap we know that PDF basically represents how many points so for example if you look here right so about about 20% of the points typically have typically have have their petal lengths between 1.5 and 1.6 right so this height represents how many points are there in this range typically right that's what your PDF is CDF C is a completely different story let's see what a PDF tells us so if you take a value of one point six so let's stay so this is petal length just to be clear offset rows of flowers so if you take a value of one point six so this is our C be afraid this function is our CDF right so let's see what is the corresponding value for one point six the corresponding value is roughly just above point eight so let's say this value let's say u is 0.82 I'm just making approximations here so what this means is so let's write it in English because that's what that's how we can understand it better it means that there are 82% 0.8 to here I'm just putting it here there are 82% off Seto's of flowers of setosa flowers that have that have their petal length less than equal to one point six this is what this is what we can read off this plot this is extremely useful information can we cannot get that information out of your your PDF this is your PD right we can't get the same insight out of your PDF your CD of basically says suppose 41.6 the value here corresponding value in the y-axis the probability here is point eight two right so what that means is that 82% of your setosa flowers have a petal length of less than equal to one point six that's what it means so first thing you'll notice as soon as you see this plot is it always starts at zero on the left bottom and em set one at the top right because what what does this mean this means that so let's understand what this means okay say just just using the similar argument here so what this means is all so hundred-percent this is one which means hundred-person it means 100% of setosa flowers have a petal length that is less than equal to one point nine similar let's take a value here I'll just do multiple values here okay this value seems to be around point one five okay what this means is 15% of setosa flowers have petal in that is less than equal to one point three right so what it's telling you is what is what percentage of flux any value here says what percentage of flowers the percentage you can read on the y-axis what percentage of flowers have a value less than equal to the corresponding x axis point here that's what your CDF says of course nothing can have less than zero probability and nothing can have more than one prop built because all probabilities lie between 0% and 100% right that's what we can quickly read off a CD of plant now the immediate question is how do you build a CDF so we understood how to read a CDF but how do you build a pit how do you build a CDF we know how to build a CD F right we saw that you can build a PDF by taking histogram and then smoothing it up will as a promise you will learn about how the smoothing exactly works when we learn Gaussian distribution but for now let's see let's say how CDF is built so let's take any value here okay so let's take a value of let's say one point six itself okay so what is the corresponding point on the CDF plot for one point six how do I get this point okay the way you get it is you say for one point six how many of so you have 50 flowers right these are your pedaling sub setters of Lars okay so you basically say how many flowers how many flowers have a better length of less than equal to one point six and when you count off your of all your points so you have fifty points right if let's assume 41 points or 41 floods have a petal length that is less than so let's say 41 flowers have 41 setters of class that's what I meant so 41 setosa flood said say have their petal length less than equal to one point six and a total data set is 50 points right so you divide 41 by 58 that gives you 82% or 0.82 and you put that on the y-axis okay and you keep repeating this you keep repeating it for 1.6 1.6 51.7 so on so forth and you get this whole curve right that's how you construct your CDF it's basically for every value of your petal length or x-axis you say how many points so for example if you take this right suppose for example if you take let's take one point three four one point three it looks like somewhere around 15% of points are below one of setosa flowers have it better length less than equal to one point three and that we can build it ourselves that's one way of doing it the second way is we know that your PDF basically counts how many flowers are there in each for each petal length right so this is I'm just plotting the histogram here okay so you can plot a histogram like this which we smooth to get PDF right so if we count all of them so for one point six if you want the value for one point six you can simply sum up this Plus this Plus this Plus this Plus this Plus this that will give you this height okay so it's basically a cumulative sum so these are called cumulative sums for any value X here the corresponding value Y the corresponding value Y can be found using the approach that I just told you which is 41 out of 50 case that I just explained awhile ago or you can count all this the heights of all of these probabilities that come before that including that value that's called a cumulative sum and when you do that you can understand what is you can easily compute what is the Y value corresponding to an x coordinate in Syria for those of you who know calculus right who remember area under curve integration things like that if you don't remember or if you don't recall it's okay we will go over it when we learn calculus in more detail but if you recall your calculus basically if you have a PDF write the corresponding value of the CDF is basically the area under the curve the area under the curve of your PDF in that point that's what is your CDF right so if you differentiate if you differentiate if you if you differentiate if you differentiate your CDF right you get your PDF and if you do if you do integration on your if you do integration on your PDF right you get your CDF those of you who don't know calculus or were forgotten calculus it's okay this is just an additional point that I'm making here we will learn some of these concepts when we learn calculus and optimization later in the course so if you already know this that's easy if you if you if you do if you differentiate your CDF you get PDF and if you if you integrate your PDF you get CDF those of you who know calculus if you don't know it's okay you can just look at the histogram approach or basically saying how many how many Seto's of flowers have better length less than equal to one point six even that will give us how to how to plot your CDF from a code perspective it's very very straightforward we just saw the code for PDF right so till here is the code for PDF right so this is where I computed my PDF also my bin it just here I'm using iris setosa data set and I'm computing the PDF on petal links with ten bins we saw how changing bins will change your plots now to compute CDF it's just one line literally one line in numpy there is a function called through minute of sum so this cube sum is basically cumulative sum cumulative sum the word cumulative I used earlier right so if you take your histogram and sum up everything before that value that's called a cumulative sum a function cumulative sum in numpy when you when you apply it on PDF you get C here and you just simply plot your feed it's very very straightforward there so this is all the code literally one line to compute your CDF from PDF using a function called cumulative sum and then you're just plotting it here nothing fancy here we are plotting your PDF and CDF as I've shown you here so now the next question is why is CDF useful how is CDF useful for our IDs dataset let's look at it so here I have plotted petal lengths this is my setosa virginica and versicolor my three flower types so these are the PDFs of my three flower types right I've also plotted the Syrians so this is the CDF for setosa let me just verify so this is the yeah this is for setosa this is for virginica just let me verify which is sets genic and versicle I always get confused with that so let's just go up and see okay where is my petal length here is my petal length of course the second one is versicolor the third one is virginica okay so let me just let me just put that in here so this is my virginica and versicolor am i right with that I hope yeah so I have my three types of flowers here just quickly going over not to make a mistake again you have your versicolor and virginica okay sounds good so let's let's go so these are my three PDFs of virginica versicolor and setosa if I mix this thing two things up I'm getting confused sorry bear with me but it just lets flow with the flow so these are this is my PDF of setosa this is my PDF of virginica this is my PDF of versatility right now comes the interesting part these are your CD f's now I have a question by just look see all these plots go from 0 to 1 0 to 1 0 to 1 alright so I have one very interesting question we said that to distinguish of course to distinguish setters or flowers this is my petal length right I could say if if the petal length is less than equal to 2 it is setosa that's perfect right because all setosa flowers have a value less than equal to 2 because you see this senior if you can say 100% of points 100% of your setters or points or setosa flasks have a pattern length of less than equal to 2 so you're done so if you say if petal length is less than 2 then Sentosa okay and you've got to be 100% correct with this because 100% of the points have a value less than this now comes the tricky case between virginica and mercy color now let's assume I put a suppose I put a value of five as my threshold because that's where these two that's where these two pdfs are intersecting right now if i use five as my threshold an interesting thing happens let's see what is the corresponding value of five here the corresponding value of 5 for this cdf is point it looks close to 0.95 right what about this other one okay let me just put it here so this looks like on the y-axis it looks like point nine five what about this this looks like point one zero it's somewhere in the half line here right so this looks like point 1 0 on the y axis okay now if I say if petal length is greater than 2 and and petal length is less than 5 then declare it to be I keep getting confused which is versicolor which is virginica let's say this is versicolor and this is virginica let's say that just say if petal length is greater than 2 and petal length is less than 5 then what C color if I say that if I say that okay okay and if I say if petal length is greater than 2 and petal length is greater than 5 greater than 5 let's say less than equal to then virginica let's assume I just should written these two if-else conditions so for Seto it's very clear right for setosa less than 2 etcetera for virginica and versi color for versicolor and virginica it's a you might wrote rules like this of course these rules are not going to be perfect because these PDFs are intersecting but when I wrote a rule like this I'm going to be correct with virginica 95% of the times okay so this is going to tell me this is going to be accurate 95 percent of the times how do i how do I say 95 percent here because if you look at the CDF plot for versi color here at the intersection point which is 5 take the point corresponding to 5 go go and check on Syria this value corresponds to roughly 0.95 which means my until five percent of virginica flowers right half sorry of versicolor flowers have a petal length less than equal to 5 which means 95% of my versicolor flux will be labeled correctly 5% will not be labeled correctly okay similarly what about virginica now what unique I am saying if it is greater than 5 only then it's virginica less than 5 it's not virginica but there 10% of points how did I know about 10% I looked at the CDF I looked at the CDF of virginica and that CDF at value of 5 has the y-axis equal to point 1 0 which is 10% which means if I use this rule 10 percent of the times I'm going to make a mistake here so here I'm going to be correct I'm going to be correct 99% of the times and wrong or incorrect 5 percent of the times here 10 percent of the times I'm going to make a mistake for virginica and 90 percent of times I'm going to be accurate 90 percent the queer times I'm going to be right so such an information cannot be understood from your CD f's which you can use from your PDFs so when you plot PDFs and when you're building simple models like the simply fills conditions you can say how accurate your simple model we just use as one feature could be by reading the data off of your CDF which you cannot get from PDF that's one of the biggest use cases of CDF in data analysis and in modeling\n"