Meadian,Percentiles,Quantiles EDA Lecture 10@Applied AI Course

The Concept of Quantiles and Percentiles

======================================

Quantiles are values that divide a dataset into equal-sized groups based on the data's distribution. They provide a way to understand the spread and variability of a dataset, and are often used in statistics and machine learning applications. In this article, we will explore the concept of quantiles and percentiles, and how they can be applied in real-world scenarios.

The Basics of Quantiles

-----------------------

A quantile is a value that divides a dataset into equal-sized groups based on the data's distribution. The most common types of quantiles are:

* **Mean**: The mean, also known as the average, is the middle value of a dataset when it is ordered from smallest to largest.

* **Median**: The median is the middle value of a dataset when it is ordered from smallest to largest. If there are an even number of values, the median is the average of the two middle values.

* **75th percentile** (also known as the third quartile or Q3): This value divides the dataset into three equal groups and represents the value below which 75% of the data falls.

The Role of Percentiles

-----------------------

Percentiles are a type of quantile that represent the percentage of data points below a certain value. The most common percentiles are:

* **25th percentile** (also known as Q1): This value divides the dataset into two equal groups and represents the value below which 25% of the data falls.

* **50th percentile** (also known as the median or Q2): This value divides the dataset into two equal groups and represents the value below which 50% of the data falls.

* **75th percentile** (Q3): As mentioned earlier, this value divides the dataset into three equal groups and represents the value below which 75% of the data falls.

Using Percentiles with Numpy

-----------------------------

The NumPy library in Python provides a function called `percentile` that can be used to calculate percentiles. This function takes two arguments: the array of values, and the percentile(s) for which we want to calculate the value.

```python

import numpy as np

# Create an array of values

data = np.array([1, 2, 3, 4, 5])

# Calculate the 25th percentile

q1 = np.percentile(data, 25)

# Calculate the median (50th percentile)

median = np.median(data)

# Calculate the 75th percentile

q3 = np.percentile(data, 75)

print(q1) # Output: 2.5

print(median) # Output: 3.0

print(q3) # Output: 4.25

```

Interpreting Percentiles in Real-World Scenarios

------------------------------------------------

Percentiles can be extremely useful in real-world scenarios where we want to understand the spread and variability of a dataset. One common example is in supply chain management, where a certain percentage of orders may not be delivered within a certain timeframe.

For instance, suppose we are managing an e-commerce company like Amazon, and we have delivery times for a product from the time it was ordered to the time it was delivered. We want to understand how many orders were delivered within a certain timeframe. In this case, calculating the 95th percentile would give us the value below which 95% of the data falls, representing the value at which 5% of the data points fall.

Similarly, calculating the 99th percentile would give us the value below which 99% of the data falls, representing the value at which 1% of the data points fall. This information can be extremely useful in identifying trends and areas for improvement in our supply chain.

Real-World Applications

------------------------

The concept of quantiles and percentiles has numerous real-world applications across various fields, including:

* **Finance**: Quantiles are used to understand the distribution of financial returns, which is essential for risk management and portfolio optimization.

* **Machine Learning**: Percentiles are used as a measure of model performance, particularly in regression analysis.

* **Statistics**: Quantiles are used to summarize and describe the shape of a dataset.

Conclusion

----------

In conclusion, quantiles and percentiles provide a powerful way to understand the distribution and spread of data. By calculating these values, we can gain insights into our data's behavior and make informed decisions in various fields. Whether you're working with financial returns, machine learning models, or statistical analysis, understanding quantiles and percentiles is essential for making sense of your data.

"WEBVTTKind: captionsLanguage: ensince we've understood what a median is just a while ago let me introduce you to some of the related terms called quantiles in percentiles before we go and understand the equivalent of standard deviation called mean absolute median absolute deviation first let's understand what our quantiles in line and percentiles right first I'll explain you what is a percentile so that explaining what a quantile is becomes very very straightforward okay let's go and take an example here let's assume I have a data set X right with hundred values let's assume this is sorted I just put X subscript s which basically means it is sorted right suppose if I have hundreds values here okay so my N equals 200 now given hundred values there will be a value at the first index second index third index fourth index so on so forth at the hundredth index right because because they're there 100 values here what will be the median so median of X median of X is since you already have the sorted value we take the fiftieth value corresponding to the fiftieth index and we take the value corresponding to the 51st index right so let's assume we have some values x1 and x2 here okay we take the average or the mean of X 1 comma X 2 as the median right as the median of X so if you think about it what we have just done is is a very very simple idea so I just say is this ok so what we've done is we have taken the 50th percentile so we have taken the 50th value in a sorted list of 100 values so if you pick the 50th value it's called the 50th percentile value this is not percentage this is percentile percentile basically says they do lie in the sorted list if you like it the 50th rank right suppose suppose I sought all of them since they're 100 objects since I've sorted them I can think of them like a rank right if you if the value that is at the 50th position is called the 50th percentile value of x right so let's assume this array is X itself let's assume this is an array not a set let's assume this is an array right and these are the indices so the value of x s corresponding to the 50th when you 50th value whatever the value is this value could be any value depending on what this what this array is right that is a 50th percentile value and 50th percentile value is nothing but your median if you think about your 50th percentile value is nothing but your medium right similarly what is 10th percentile value what is 10th percentile value mean right 10th percentile value is nothing but in the sorted array you pick the 10th value right corresponding to the 10th index what does the 10th percentile value tell you it tells you that about 10% of points are roughly less than this value right and about 90% of points are greater than this value that's what the percentile value is telling you the percentile value is telling you what percentage of points are less than this value and what percentage are greater than this money right that's what a percentile is so we call the 25th percentile so the 50th percentile we call it a median print the 25th percentile 50th percentile and 75th percentile are called quantiles and the hundredth of course and 100 percent quantile basically median is basically the 1/2 quante basically means you basically break it into 4 regions the 25th percentile value will tell you that 25 percent of all your observations are less than this menu 50th percentile value is nothing but median which means 50 percent of values are less than this and fifty percent of values are more than this right so here 25 percent of values are less than this and 75 percent of values are greater than this and so on so forth right so quantize is just a term so if somebody tells me what is the third quantum the word quantum is nothing but the 75th percentile the second quantile is nothing but the mean the first quantile is nothing but the 25th percentile and the fourth quantile is nothing but the maximum value right so a related terminology so let's look at some quantiles here so if you look at so contents can easily become so your numpy basically has percentile function and you can say which person tell you want here I'm saying that I want all the percentiles between 0 to 200 with a gap of 25 so if you look at the output of this what it gives us is it tells me that for iris setosa my 25th percentile my 0 8 percent that so this is 0 so in this case what it's doing us is it's giving you the zero at percentile 25 50th and 75th right these are the contents different libraries implemented differently you can have 0 25 50 and 75 or you can have 25 50 75 and 100 right I think number is implementing them so this is the minimum value the minimum value we know is 1 this is this is nothing but your median this is your 25th percentile this is your 50th percentile and this is your 75th percentile right and the 100 percentile is nothing but so when you say quantile so if you look at this code this code is very very straightforward you are saying that you want also what actually the way the reason we are getting 0 25 50 and 75 is because of this because I said shown get me the values get me the percentile values from 0 to 100 with a gap of 25 okay so of course 0 will be there 0 + 25 25 25 + 25 50 75 and 100 won't be there because it's a max value right so these these are the percentiles the 0-8 percentile 25th percentile 50th and these are your contents similarly you can also get 90th percentiles right 98% ends are very very useful let's look at the 90th percentile values right the 90th percentile value for setosa is 1.7 for virginica and versicolor over six point three one and 4.8 now you might say why the heck are 90th percentile did you pick up 90th percentile specifically so let me give you an intuition let me let me connect it with a real-world use case so let's assume you work at a ecommerce company like Amazon of course I'll keep bringing up Amazon examples because of my personal biases I love this great company again if I I worked there for five years so I'll have my biases Springs bed with them sorry so let's assume the time it takes suppose I have a I have an array of all the times it's taking four for a product to go from from from the time a person ordered to the to the time it's delivered to him so let's assume I have delivery times right so let's assume I have delivery times so it could be let's say one day one and a half days two days so let's assume these are all delivery times of some let's say 10k shipments that Amazon did in a short duration of time right now I want to ask here in problems like this 95th percentile and 99th percentile are important let's understand why if let's say my 99th percentile is four days so let's say my 95th percentile is four days and my 99th percentile is let's say five point six days these two numbers will tell me a lot about customer satisfaction let's understand why what this is telling me is that 99 percentage of orders were delivered within five point six days from the time of placing the order to the time it was delivered to the customer right and that is good because 99 percent of your customers got it within within five and half days of course there are 1 percent of users or customers who place order who didn't reach it in who didn't get their placement in 5.6 days and now your question if you are managing this supply chain if your if you're responsible for delivering orders on time you would say ok 1 percent I need to double down and reduce this number right so if a 95th percentile is 4 days you know that there are five percent of customers who are still not getting their product within within 4 days and now you would say I want to understand why that's happening so a single number like a 95th percentile or a 99th percentile could be extremely important it's a single number that will help you very well in understanding whether your customers are happy or not ok so a lot of times people use percentiles instead of standard deviations and things like that rightsince we've understood what a median is just a while ago let me introduce you to some of the related terms called quantiles in percentiles before we go and understand the equivalent of standard deviation called mean absolute median absolute deviation first let's understand what our quantiles in line and percentiles right first I'll explain you what is a percentile so that explaining what a quantile is becomes very very straightforward okay let's go and take an example here let's assume I have a data set X right with hundred values let's assume this is sorted I just put X subscript s which basically means it is sorted right suppose if I have hundreds values here okay so my N equals 200 now given hundred values there will be a value at the first index second index third index fourth index so on so forth at the hundredth index right because because they're there 100 values here what will be the median so median of X median of X is since you already have the sorted value we take the fiftieth value corresponding to the fiftieth index and we take the value corresponding to the 51st index right so let's assume we have some values x1 and x2 here okay we take the average or the mean of X 1 comma X 2 as the median right as the median of X so if you think about it what we have just done is is a very very simple idea so I just say is this ok so what we've done is we have taken the 50th percentile so we have taken the 50th value in a sorted list of 100 values so if you pick the 50th value it's called the 50th percentile value this is not percentage this is percentile percentile basically says they do lie in the sorted list if you like it the 50th rank right suppose suppose I sought all of them since they're 100 objects since I've sorted them I can think of them like a rank right if you if the value that is at the 50th position is called the 50th percentile value of x right so let's assume this array is X itself let's assume this is an array not a set let's assume this is an array right and these are the indices so the value of x s corresponding to the 50th when you 50th value whatever the value is this value could be any value depending on what this what this array is right that is a 50th percentile value and 50th percentile value is nothing but your median if you think about your 50th percentile value is nothing but your medium right similarly what is 10th percentile value what is 10th percentile value mean right 10th percentile value is nothing but in the sorted array you pick the 10th value right corresponding to the 10th index what does the 10th percentile value tell you it tells you that about 10% of points are roughly less than this value right and about 90% of points are greater than this value that's what the percentile value is telling you the percentile value is telling you what percentage of points are less than this value and what percentage are greater than this money right that's what a percentile is so we call the 25th percentile so the 50th percentile we call it a median print the 25th percentile 50th percentile and 75th percentile are called quantiles and the hundredth of course and 100 percent quantile basically median is basically the 1/2 quante basically means you basically break it into 4 regions the 25th percentile value will tell you that 25 percent of all your observations are less than this menu 50th percentile value is nothing but median which means 50 percent of values are less than this and fifty percent of values are more than this right so here 25 percent of values are less than this and 75 percent of values are greater than this and so on so forth right so quantize is just a term so if somebody tells me what is the third quantum the word quantum is nothing but the 75th percentile the second quantile is nothing but the mean the first quantile is nothing but the 25th percentile and the fourth quantile is nothing but the maximum value right so a related terminology so let's look at some quantiles here so if you look at so contents can easily become so your numpy basically has percentile function and you can say which person tell you want here I'm saying that I want all the percentiles between 0 to 200 with a gap of 25 so if you look at the output of this what it gives us is it tells me that for iris setosa my 25th percentile my 0 8 percent that so this is 0 so in this case what it's doing us is it's giving you the zero at percentile 25 50th and 75th right these are the contents different libraries implemented differently you can have 0 25 50 and 75 or you can have 25 50 75 and 100 right I think number is implementing them so this is the minimum value the minimum value we know is 1 this is this is nothing but your median this is your 25th percentile this is your 50th percentile and this is your 75th percentile right and the 100 percentile is nothing but so when you say quantile so if you look at this code this code is very very straightforward you are saying that you want also what actually the way the reason we are getting 0 25 50 and 75 is because of this because I said shown get me the values get me the percentile values from 0 to 100 with a gap of 25 okay so of course 0 will be there 0 + 25 25 25 + 25 50 75 and 100 won't be there because it's a max value right so these these are the percentiles the 0-8 percentile 25th percentile 50th and these are your contents similarly you can also get 90th percentiles right 98% ends are very very useful let's look at the 90th percentile values right the 90th percentile value for setosa is 1.7 for virginica and versicolor over six point three one and 4.8 now you might say why the heck are 90th percentile did you pick up 90th percentile specifically so let me give you an intuition let me let me connect it with a real-world use case so let's assume you work at a ecommerce company like Amazon of course I'll keep bringing up Amazon examples because of my personal biases I love this great company again if I I worked there for five years so I'll have my biases Springs bed with them sorry so let's assume the time it takes suppose I have a I have an array of all the times it's taking four for a product to go from from from the time a person ordered to the to the time it's delivered to him so let's assume I have delivery times right so let's assume I have delivery times so it could be let's say one day one and a half days two days so let's assume these are all delivery times of some let's say 10k shipments that Amazon did in a short duration of time right now I want to ask here in problems like this 95th percentile and 99th percentile are important let's understand why if let's say my 99th percentile is four days so let's say my 95th percentile is four days and my 99th percentile is let's say five point six days these two numbers will tell me a lot about customer satisfaction let's understand why what this is telling me is that 99 percentage of orders were delivered within five point six days from the time of placing the order to the time it was delivered to the customer right and that is good because 99 percent of your customers got it within within five and half days of course there are 1 percent of users or customers who place order who didn't reach it in who didn't get their placement in 5.6 days and now your question if you are managing this supply chain if your if you're responsible for delivering orders on time you would say ok 1 percent I need to double down and reduce this number right so if a 95th percentile is 4 days you know that there are five percent of customers who are still not getting their product within within 4 days and now you would say I want to understand why that's happening so a single number like a 95th percentile or a 99th percentile could be extremely important it's a single number that will help you very well in understanding whether your customers are happy or not ok so a lot of times people use percentiles instead of standard deviations and things like that right\n"