How to build machine learning models for imbalanced datasets

The i Log Function and Creating Custom Plots with Python

=====================================================

In this article, we will explore the i log function and create custom plots using Python. We will use the matplotlib library to create various types of plots.

Using the i Log Function

------------------------

To start, let's use the i log function in Python. The i log function is used to calculate the natural logarithm of a given value. Here's an example:

```python

import numpy as np

# Create an array of numbers from 1 to 10

x = np.logspace(0, 2, 10)

# Calculate the natural logarithm of x

y = np.log(x)

```

As we can see, the i log function is used to calculate the natural logarithm of a given value. The `np.logspace` function creates an array of numbers with a specified range and number of values.

Creating Custom Plots with Python

--------------------------------

Now that we have used the i log function, let's create a custom plot using Python. We will use the bar function to create a plot with multiple bars.

```python

import matplotlib.pyplot as plt

import numpy as np

# Create an array of numbers from 1 to 10

x = np.arange(3)

# Create an array of values for the bars

y1 = np.array([0.2, 0.4, 0.6])

y2 = np.array([0.7, 0.5, 0.3])

# Create a bar plot with two sets of bars

plt.bar(x, y1, color='red', alpha=0.8)

plt.bar(x, y2, color='green', alpha=0.8)

# Set the title and labels for the plot

plt.title('Custom Bar Plot')

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

# Display the plot

plt.show()

```

In this example, we create a bar plot with two sets of bars using the `bar` function. We specify the x-values, y-values, and colors for each set of bars. The `alpha` parameter is used to set the transparency of the fill color.

Creating a Polar Plot with Python

---------------------------------

Now that we have created a custom plot, let's create a polar plot using Python. We will use the `polar` function to create a plot with multiple slices.

```python

import matplotlib.pyplot as plt

import numpy as np

# Create an array of numbers from 0 to 10

theta = np.linspace(0, 2*np.pi, 100)

# Calculate the x and y values for the polar plot

r1 = np.array([1, 1.2, 1.5])

r2 = np.array([0.8, 0.9, 0.7])

# Create a polar plot with multiple slices

plt.polar(theta, r1, color='red', alpha=0.8)

plt.polar(theta, r2, color='green', alpha=0.8)

# Set the title and labels for the plot

plt.title('Custom Polar Plot')

plt.xlabel('Theta')

plt.ylabel('R')

# Display the plot

plt.show()

```

In this example, we create a polar plot with multiple slices using the `polar` function. We specify the theta values, x and y values, and colors for each slice. The `alpha` parameter is used to set the transparency of the fill color.

Creating Custom Plots with a Function

--------------------------------------

Now that we have created custom plots, let's create a function to generate multiple plots using Python. We will use the `polar` function to create a polar plot and the `bar` function to create a bar plot.

```python

import matplotlib.pyplot as plt

import numpy as np

def make_polar_plot(data, label):

# Create an array of numbers from 0 to 10

theta = np.linspace(0, 2*np.pi, 100)

# Calculate the x and y values for the polar plot

r1 = np.array([1, 1.2, 1.5])

r2 = np.array([0.8, 0.9, 0.7])

# Create a polar plot with multiple slices

plt.polar(theta, r1, color='red', alpha=0.8)

plt.polar(theta, r2, color='green', alpha=0.8)

# Set the title and labels for the plot

plt.title(label)

plt.xlabel('Theta')

plt.ylabel('R')

# Display the plot

plt.show()

def make_bar_plot(data, label):

# Create an array of numbers from 1 to 10

x = np.arange(3)

# Create an array of values for the bars

y1 = np.array([0.2, 0.4, 0.6])

y2 = np.array([0.7, 0.5, 0.3])

# Create a bar plot with two sets of bars

plt.bar(x, y1, color='red', alpha=0.8)

plt.bar(x, y2, color='green', alpha=0.8)

# Set the title and labels for the plot

plt.title(label)

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

# Display the plot

plt.show()

# Create multiple plots using the function

make_polar_plot(np.array([1, 1.2, 1.5]), 'Custom Polar Plot')

make_bar_plot(np.array([0.2, 0.4, 0.6]), 'Custom Bar Plot')

```

In this example, we create a function `make_polar_plot` to generate a polar plot and the `make_bar_plot` function to generate a bar plot. We use these functions to create multiple plots using the `polar` and `bar` functions.

Conclusion

----------

In this article, we have explored the i log function and created custom plots using Python. We used the `polar` and `bar` functions to create polar and bar plots with multiple slices. We also created a function to generate multiple plots using these functions.

"WEBVTTKind: captionsLanguage: enin a prior video i've talked about how you can handle imbalanced data set in python using the imbalance learn library and in this video we're going to take it a step further we're going to take the balanced data set and then we're also going to use that to build a machine learning model and then we're also going to implement another feature of scikit-learn that will allow you to use the class weight instead of performing data balancing and the class weight will essentially assign a weight value that will be inversely proportional to the frequency that will be inversely proportional to the number of samples in each of the majority and minority class and so let's dive in let's fire up the google code lab or the jupyter notebook and i'll provide you this in the video description so check out the link and then you could open this up and then the first thing that we want to do here is we're going to install the imbalance learn library so go ahead and run the cell and this should take you a short moment and apparently it is already installed and now let's read in the data set and the data set that we're using today is the hepatitis c virus classification data set and so this data set was already published in one of the research article in our research group and so we're going to use pandas by importing pandas as pd and then we're going to use the pd.read csv function to read in the data and assign it to the df variable let's run the cell and then you're going to notice here that we have 578 rows and 882 columns and then the last column will be the y variable which is the activity therefore we're going to split it into the x and the y variables so the y variable will be the last column and the remaining columns here will be the x variable so we do this by selecting the activity column and then assigning it to y and then we're going to drop only the activity column and then assigning it to x whereby we're assigning axis equals to one where one will signify that we're working with columns because if absence is equal to zero it means that we're working with the rows all right and so let's go ahead and take a look at the class distribution so the class distribution will allow us to look at the number of actives and the number of inactives for the activity of the compound data set so active will mean that the compound or molecule will have favorable activity whereas inactive will mean that the compound has unfavorable activity and so let's run the cell here in order to perform a value count oh but then we have to first run this cell here in order to perform the data splits and now let's perform the value count on the y variable so now we're going to perform the value count whereby we're going to count the number of actives and the number of inactives and here we can see that there are 412 actives and 166 inactive compounds and now we're going to translate this into a pie plot and so let's run it so note here that we're using the inbuilt function of pandas in order to perform the pi plot making and if you're going to use matplotlib directly you could do this as well so you're going to import matplotlib dot pi plot as plt and then you're going to assign fig 1 xs1 into the plt.subplot and then we're going to make the plot okay so let's run it and roughly you're going to get this similar image okay and so the percent.2f here will give us the text in two digits and then we're labeling actives and inactives over here and the label comes from the activity count dot index which is a list so let me show you so it is the list here active and inactive which are the labels all right and now we're going to perform the data split before we perform the data balancing and so we're going to take this imbalance data set we're going to split it using the ratio of 80 and 20 and then we're going to use the training set to perform the data balancing and then we're going to use the balanced data set to build a model and then we're going to apply the model to make prediction on the test set which we have left out at the beginning all right and so let's do it and now we have splitted the data into the x strain x test y train y test and now let's take a look at the shape or the dimension of the data so here we can see that there are 462 rows because we have already performed the data split using 80 20 ratio so from 500 right here let's see how many we have original 578 rows so 80 of that would be here 462. and then the 20 will be for the test set which has 116 compounds okay so let's make a pie chart of that so you can see that the imbalance is still shown here and let's have a look at the value count and so we have 330 and 132 so they're roughly two and a half times more for the majority class which is the active compounds and then the minority class is the inactive compounds with 132. all right and here we're going to perform two types of class balancing using the imbalance learn library and so for the first type we're going to perform the random undersampling whereby the majority class will be reduced so that it contains the same number of compounds of the minority class so for an in-depth explanation of this i recommend to check out this particular video and so i'll provide you the link to it and so we're here gonna run the self here and now you can see that it performs the data balancing and now we have equal number of compounds in the active and inactive so they're 50 50 let's take a look at the actual value and so now we have 132 and 132. so you can see here that the and so here you can see that the majority class active has 330 and so we're going to reduce that into 132 right here and now for over sampling we're going to increase 132 to become 330 and so that the final number will be 330 for both classes so let's perform it i mean this one over sampling all right and so they're both equal now and let's have a look at the value counts okay let's see did i do something wrong oh okay i have to change this to x string all right so this is extreme all right okay extreme okay so i forgot to put underscore train here so before it was x and y and now i added underscore train and underscore train here okay so let's run it again all right and so both classes are equal now and let's take a look at the value count all right so as expected the inactive compounds was increased so that it becomes 330 so that it becomes the same number as that of the majority class all right and so now we have balanced data set for the oversampling all right and so here we're going to perform model building using the various balanced or imbalanced data sets so for here we're going to have four models so the first model will be without class balancing and so the second model will be with undersampled balance class so here we perform class balancing using the random undersampling and the third model here we're performing the over sampling and the fourth model we're using the class weight option here in scikit-learn so we specify it inside the random forest classifier to have class weight equal to balanced all right so let's take a look at the code for model building and so this block of code here will perform the random forest model building and so we're assigning the random forest classifier to the model variable and then we're training the model using model.fit and then the input data will be xtrain and y-train so make note here that this is the imbalanced data set okay and so the other ones here will use the balanced data set from under sampling for model 2 and for model 3 we're using the over sampling which has been class balanced and in model 4 we're also using the imbalance data set but then we're using the class weight equals to balance okay so after we specify model training and the model is already trained here we're now going to perform the cross validation models as well so here we're instantiating the random force classifier again also using random state equals 242 and then we're assigning it to the model cv variable and on this line we're specifying a custom scoring function and here we're going to use the make score function and then as input argument we're using the mathews correlation coefficient or the mcc and so the reason for using mcc is because it's a great metric for imbalanced data sets because it considers all of the true positive true negative false positive of negative so the four t p t n f p f n whereas other equations from accuracy or sensitivity or specificity they're looking at the positive or the negative classes or the number of correct predictions versus the total but then for mathews correlation coefficient it considers all of the terms in the same equation and so even if there's imbalanced data set this particular metric is pretty robust and so here we're specifying the cv scoring to be scoring and so for here we're specifying the cv scoring variable for the option scoring and we're going to perform five-fold cross-validation and the data here as you could see because we're not performing any form of class balancing we're using the imbalanced data sets and now we're using the cross validate function to build the cross validation model and so after both models have been built with model.fits and with cross validate function for the cross validation we're going to perform the calculation of the model's performance using matthew's correlation coefficients and so we're making predictions using model.predict so the model here has already been trained and then we apply the model to make a prediction as input argument we're using xtreme and also x test and then we're assigning it back to the y variable here y train thread and y test spread and so these are the prediction variables so we put in x and we get y we put in x from the test set and we get y prediction and here we're calculating the mcc input argument would be the y train and the widescreen thread and the y test and the white test spread and because the cross validate function here we already specify the mcc using our custom function therefore we could just calculate the average from the cross validation results because we have here fivefold we have a list of five values and so we're gonna calculate the mean of that and here we're going to combine all of the results from mcc train variable mcc test variable mcc cv variable and we're going to create a summary data frame okay and so let's build the model all right and so the summary data frame looks like this okay so we have the mcc training and then the corresponding value the mcc cv and the corresponding value mcc test and the corresponding value and the second column and the first column is the name of the train cv and test sets all right and so let's perform the same random undersampled balance class model building and so the explanation is the same as mentioned earlier on with just modification to the training sets here using the under sample data and so here you can see that the performance deteriorated a bit so let's see yeah so from 75 and 72 for the cv and test it decreased to 68 and 71 okay so let's see how this oversample data set performs all right and so the mcc cv increased and the mcc test set increased slightly to 76. originally it was 72 and so the cv increased okay for the oversampling and let's take a look at the class weight and so make note here that we use the class weight equals to balanced for the random force classifier for both the training model and for the cross validation model so the test set deteriorated a bit and the cv increased slightly okay so you can see here that without any form of data balancing the model performance was reasonably good and with balanced data sets it appears that the oversampled data set provided the best performance and so let's have a look at the summary all of it together instead of having to move up and down for various table we're going to create a summary table so one table to rule them all here and then let's see yeah so as mentioned already the balanced data set using oversampling provided the best performance because the cv increased significantly from 0.75 to 0.9 and the test set it also increased from 0.72 to 0.76 and so we could also convert this particular data frame into a latex format and so here let's have a look so it's in the summary table like here and then we could you know copy it so this was from a prior run and then i'll paste it in here this is on overleaf for a manuscript that i will work on soon and so this is a template provided by overleaf and so let's recompile it again and so here you can see that the table looks pretty good right so all of the code here for the table the formatting has already been performed right inside pandas okay so let's close the window here let's proceed further and so here we're going to make a polar plot in order to visualize the results here and so here we're using the note class balancing data and so let's run it and let's have a look at the polar plot that we're creating here you go and so this is a polar plot so each slice that you see here corresponds to the training set in pink color green is the cross validation set and the cyan color is the test set and so we make use of numpy and we use the plotlib and here we specify the data frame to be the first row from the summary table right here the first row we're using the i log function and then here we're specifying the length of the data so how many we have here we have three so this is going to be n equals three and then the three will be reflected here so we have three slices here all right and so this will tell that we're using a single plot and then the bars and the x-axis set ticks will be specifying the different formatting of the plus so the bar function is used to perform the plots so the input argument are theta and radii let's have a look at theta and a radii and how about radii so radii comes from the results of the mcc values and so it's relatively the weight here so training has a value of one so it has the height of one and cv and test has roughly 0.75 and 0.72 height so you can see that it's approaching 0.8 radial or radius so each circular radius will be the tick so we have here 0.2 0.4 0.6 0.8 and 1.0 in here and here we specify the custom function to be red green and cyan so they're right here so the color scheme was inspired from gg plot 2. and then this particular for loop will set the edge color of each of the slice to be black and it's going to set the alpha transparency of the fill color so here we set it to 0.7 we can make it 0.3 and let's see what happens so you can see that the color became lighter or we can make it one even so it's going to be not translucent okay so let's make it how about 0.8 like before okay and this will make use of the color of the custom function okay and here we specify the legends that you see here if you don't want it you could take it out and then the legend box will be gone okay and here we set the title which is right here on top of the plot and we're using tight layout so that when we save it out as a pdf file the white space will be to a minimal so this is the polar plot of the no class balance data so in the following code cell we're going to take the same code and then we're going to convert it into a function right here we're creating a custom function called make polar plot and then we're going to accept two input argument which is the data and the plot label so the plot label will be the title that you see at the top here no class balancing and so i'm going to create the same plot four times and therefore it's better that i create a custom function and so the only thing that will be different is the input data and the plot label okay so let's run it and then to make use of the custom function we specify the name of the function make polar plot and then we specify the data input so you're gonna notice here that for the node class balancing it is the first row class balancing with under sampling it is the second row with over sampling it's the third row with class weights it is the fourth row and then the labels or the title of the plot we have here no class balancing class balancing undersampling class balancing over sampling class balancing class weight so let's run it let's run it all right let's run it and okay and so here you can see the relative performance of each of the slices here for the no balancing and if we perform under sampling you're going to see that they're roughly the same right nothing is noticeably different however for over sampling we could see that the green here or let's see which one it is the green would be the cv okay so the cross validation increased significantly whereas the test set also increased right because it is approaching the 0.8 circle here and for the class weights it increased a bit right did it no it's roughly the same right okay and so here we're going to make a zip file of everything and then zip it up and then download it into our computer and so a moment earlier on i created a combined version here but it was using another data and so here you're going to see that it is quite different from the one we have made so different data sets and so the performance as you can see we can see it at a glance comparing all of them at the same time right so it makes us aware of the model performance for each of the training sets cv set or test sets which one perform better right so for this the over sampling also performed better right and so if you're finding value in the video please support the channel by smashing the like button subscribing if you haven't already and also hit on the notification bell so that you will be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journeyin a prior video i've talked about how you can handle imbalanced data set in python using the imbalance learn library and in this video we're going to take it a step further we're going to take the balanced data set and then we're also going to use that to build a machine learning model and then we're also going to implement another feature of scikit-learn that will allow you to use the class weight instead of performing data balancing and the class weight will essentially assign a weight value that will be inversely proportional to the frequency that will be inversely proportional to the number of samples in each of the majority and minority class and so let's dive in let's fire up the google code lab or the jupyter notebook and i'll provide you this in the video description so check out the link and then you could open this up and then the first thing that we want to do here is we're going to install the imbalance learn library so go ahead and run the cell and this should take you a short moment and apparently it is already installed and now let's read in the data set and the data set that we're using today is the hepatitis c virus classification data set and so this data set was already published in one of the research article in our research group and so we're going to use pandas by importing pandas as pd and then we're going to use the pd.read csv function to read in the data and assign it to the df variable let's run the cell and then you're going to notice here that we have 578 rows and 882 columns and then the last column will be the y variable which is the activity therefore we're going to split it into the x and the y variables so the y variable will be the last column and the remaining columns here will be the x variable so we do this by selecting the activity column and then assigning it to y and then we're going to drop only the activity column and then assigning it to x whereby we're assigning axis equals to one where one will signify that we're working with columns because if absence is equal to zero it means that we're working with the rows all right and so let's go ahead and take a look at the class distribution so the class distribution will allow us to look at the number of actives and the number of inactives for the activity of the compound data set so active will mean that the compound or molecule will have favorable activity whereas inactive will mean that the compound has unfavorable activity and so let's run the cell here in order to perform a value count oh but then we have to first run this cell here in order to perform the data splits and now let's perform the value count on the y variable so now we're going to perform the value count whereby we're going to count the number of actives and the number of inactives and here we can see that there are 412 actives and 166 inactive compounds and now we're going to translate this into a pie plot and so let's run it so note here that we're using the inbuilt function of pandas in order to perform the pi plot making and if you're going to use matplotlib directly you could do this as well so you're going to import matplotlib dot pi plot as plt and then you're going to assign fig 1 xs1 into the plt.subplot and then we're going to make the plot okay so let's run it and roughly you're going to get this similar image okay and so the percent.2f here will give us the text in two digits and then we're labeling actives and inactives over here and the label comes from the activity count dot index which is a list so let me show you so it is the list here active and inactive which are the labels all right and now we're going to perform the data split before we perform the data balancing and so we're going to take this imbalance data set we're going to split it using the ratio of 80 and 20 and then we're going to use the training set to perform the data balancing and then we're going to use the balanced data set to build a model and then we're going to apply the model to make prediction on the test set which we have left out at the beginning all right and so let's do it and now we have splitted the data into the x strain x test y train y test and now let's take a look at the shape or the dimension of the data so here we can see that there are 462 rows because we have already performed the data split using 80 20 ratio so from 500 right here let's see how many we have original 578 rows so 80 of that would be here 462. and then the 20 will be for the test set which has 116 compounds okay so let's make a pie chart of that so you can see that the imbalance is still shown here and let's have a look at the value count and so we have 330 and 132 so they're roughly two and a half times more for the majority class which is the active compounds and then the minority class is the inactive compounds with 132. all right and here we're going to perform two types of class balancing using the imbalance learn library and so for the first type we're going to perform the random undersampling whereby the majority class will be reduced so that it contains the same number of compounds of the minority class so for an in-depth explanation of this i recommend to check out this particular video and so i'll provide you the link to it and so we're here gonna run the self here and now you can see that it performs the data balancing and now we have equal number of compounds in the active and inactive so they're 50 50 let's take a look at the actual value and so now we have 132 and 132. so you can see here that the and so here you can see that the majority class active has 330 and so we're going to reduce that into 132 right here and now for over sampling we're going to increase 132 to become 330 and so that the final number will be 330 for both classes so let's perform it i mean this one over sampling all right and so they're both equal now and let's have a look at the value counts okay let's see did i do something wrong oh okay i have to change this to x string all right so this is extreme all right okay extreme okay so i forgot to put underscore train here so before it was x and y and now i added underscore train and underscore train here okay so let's run it again all right and so both classes are equal now and let's take a look at the value count all right so as expected the inactive compounds was increased so that it becomes 330 so that it becomes the same number as that of the majority class all right and so now we have balanced data set for the oversampling all right and so here we're going to perform model building using the various balanced or imbalanced data sets so for here we're going to have four models so the first model will be without class balancing and so the second model will be with undersampled balance class so here we perform class balancing using the random undersampling and the third model here we're performing the over sampling and the fourth model we're using the class weight option here in scikit-learn so we specify it inside the random forest classifier to have class weight equal to balanced all right so let's take a look at the code for model building and so this block of code here will perform the random forest model building and so we're assigning the random forest classifier to the model variable and then we're training the model using model.fit and then the input data will be xtrain and y-train so make note here that this is the imbalanced data set okay and so the other ones here will use the balanced data set from under sampling for model 2 and for model 3 we're using the over sampling which has been class balanced and in model 4 we're also using the imbalance data set but then we're using the class weight equals to balance okay so after we specify model training and the model is already trained here we're now going to perform the cross validation models as well so here we're instantiating the random force classifier again also using random state equals 242 and then we're assigning it to the model cv variable and on this line we're specifying a custom scoring function and here we're going to use the make score function and then as input argument we're using the mathews correlation coefficient or the mcc and so the reason for using mcc is because it's a great metric for imbalanced data sets because it considers all of the true positive true negative false positive of negative so the four t p t n f p f n whereas other equations from accuracy or sensitivity or specificity they're looking at the positive or the negative classes or the number of correct predictions versus the total but then for mathews correlation coefficient it considers all of the terms in the same equation and so even if there's imbalanced data set this particular metric is pretty robust and so here we're specifying the cv scoring to be scoring and so for here we're specifying the cv scoring variable for the option scoring and we're going to perform five-fold cross-validation and the data here as you could see because we're not performing any form of class balancing we're using the imbalanced data sets and now we're using the cross validate function to build the cross validation model and so after both models have been built with model.fits and with cross validate function for the cross validation we're going to perform the calculation of the model's performance using matthew's correlation coefficients and so we're making predictions using model.predict so the model here has already been trained and then we apply the model to make a prediction as input argument we're using xtreme and also x test and then we're assigning it back to the y variable here y train thread and y test spread and so these are the prediction variables so we put in x and we get y we put in x from the test set and we get y prediction and here we're calculating the mcc input argument would be the y train and the widescreen thread and the y test and the white test spread and because the cross validate function here we already specify the mcc using our custom function therefore we could just calculate the average from the cross validation results because we have here fivefold we have a list of five values and so we're gonna calculate the mean of that and here we're going to combine all of the results from mcc train variable mcc test variable mcc cv variable and we're going to create a summary data frame okay and so let's build the model all right and so the summary data frame looks like this okay so we have the mcc training and then the corresponding value the mcc cv and the corresponding value mcc test and the corresponding value and the second column and the first column is the name of the train cv and test sets all right and so let's perform the same random undersampled balance class model building and so the explanation is the same as mentioned earlier on with just modification to the training sets here using the under sample data and so here you can see that the performance deteriorated a bit so let's see yeah so from 75 and 72 for the cv and test it decreased to 68 and 71 okay so let's see how this oversample data set performs all right and so the mcc cv increased and the mcc test set increased slightly to 76. originally it was 72 and so the cv increased okay for the oversampling and let's take a look at the class weight and so make note here that we use the class weight equals to balanced for the random force classifier for both the training model and for the cross validation model so the test set deteriorated a bit and the cv increased slightly okay so you can see here that without any form of data balancing the model performance was reasonably good and with balanced data sets it appears that the oversampled data set provided the best performance and so let's have a look at the summary all of it together instead of having to move up and down for various table we're going to create a summary table so one table to rule them all here and then let's see yeah so as mentioned already the balanced data set using oversampling provided the best performance because the cv increased significantly from 0.75 to 0.9 and the test set it also increased from 0.72 to 0.76 and so we could also convert this particular data frame into a latex format and so here let's have a look so it's in the summary table like here and then we could you know copy it so this was from a prior run and then i'll paste it in here this is on overleaf for a manuscript that i will work on soon and so this is a template provided by overleaf and so let's recompile it again and so here you can see that the table looks pretty good right so all of the code here for the table the formatting has already been performed right inside pandas okay so let's close the window here let's proceed further and so here we're going to make a polar plot in order to visualize the results here and so here we're using the note class balancing data and so let's run it and let's have a look at the polar plot that we're creating here you go and so this is a polar plot so each slice that you see here corresponds to the training set in pink color green is the cross validation set and the cyan color is the test set and so we make use of numpy and we use the plotlib and here we specify the data frame to be the first row from the summary table right here the first row we're using the i log function and then here we're specifying the length of the data so how many we have here we have three so this is going to be n equals three and then the three will be reflected here so we have three slices here all right and so this will tell that we're using a single plot and then the bars and the x-axis set ticks will be specifying the different formatting of the plus so the bar function is used to perform the plots so the input argument are theta and radii let's have a look at theta and a radii and how about radii so radii comes from the results of the mcc values and so it's relatively the weight here so training has a value of one so it has the height of one and cv and test has roughly 0.75 and 0.72 height so you can see that it's approaching 0.8 radial or radius so each circular radius will be the tick so we have here 0.2 0.4 0.6 0.8 and 1.0 in here and here we specify the custom function to be red green and cyan so they're right here so the color scheme was inspired from gg plot 2. and then this particular for loop will set the edge color of each of the slice to be black and it's going to set the alpha transparency of the fill color so here we set it to 0.7 we can make it 0.3 and let's see what happens so you can see that the color became lighter or we can make it one even so it's going to be not translucent okay so let's make it how about 0.8 like before okay and this will make use of the color of the custom function okay and here we specify the legends that you see here if you don't want it you could take it out and then the legend box will be gone okay and here we set the title which is right here on top of the plot and we're using tight layout so that when we save it out as a pdf file the white space will be to a minimal so this is the polar plot of the no class balance data so in the following code cell we're going to take the same code and then we're going to convert it into a function right here we're creating a custom function called make polar plot and then we're going to accept two input argument which is the data and the plot label so the plot label will be the title that you see at the top here no class balancing and so i'm going to create the same plot four times and therefore it's better that i create a custom function and so the only thing that will be different is the input data and the plot label okay so let's run it and then to make use of the custom function we specify the name of the function make polar plot and then we specify the data input so you're gonna notice here that for the node class balancing it is the first row class balancing with under sampling it is the second row with over sampling it's the third row with class weights it is the fourth row and then the labels or the title of the plot we have here no class balancing class balancing undersampling class balancing over sampling class balancing class weight so let's run it let's run it all right let's run it and okay and so here you can see the relative performance of each of the slices here for the no balancing and if we perform under sampling you're going to see that they're roughly the same right nothing is noticeably different however for over sampling we could see that the green here or let's see which one it is the green would be the cv okay so the cross validation increased significantly whereas the test set also increased right because it is approaching the 0.8 circle here and for the class weights it increased a bit right did it no it's roughly the same right okay and so here we're going to make a zip file of everything and then zip it up and then download it into our computer and so a moment earlier on i created a combined version here but it was using another data and so here you're going to see that it is quite different from the one we have made so different data sets and so the performance as you can see we can see it at a glance comparing all of them at the same time right so it makes us aware of the model performance for each of the training sets cv set or test sets which one perform better right so for this the over sampling also performed better right and so if you're finding value in the video please support the channel by smashing the like button subscribing if you haven't already and also hit on the notification bell so that you will be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey\n"

Random Videos