How to Build a Penguin Classification Web App in Python _ Streamlit #3

Building a Penguin Prediction Web Application using Streamlit

Importing Libraries and Writing Header

----------------------------------------

The first step in building our penguin prediction web application is to import the necessary libraries. We will use Streamlit for building the web application, pandas for data manipulation, and scikit-learn for machine learning. In addition, we will also import matplotlib for displaying plots.

After importing the libraries, we write the header of the web application. This includes writing the title of the application, a brief description of what the application does, and some metadata about the author and date of creation.

Sidebar Header and CSV File Link

------------------------------------

Next, we write the header of the sidebar along with a link to an example CSV file. In this case, we are using the penguins dataset from UCI Machine Learning Repository.

The Upload Functionality

-------------------------

After writing the headers, we move on to implementing the upload functionality. We create a slider bar that allows users to input their own features for the penguin data. This feature will be used in addition to the existing penguin dataset.

If there is an uploaded file, we display its contents in the user interface. If not, we use the slider bar as the input feature. This ensures that the application can work with both uploaded files and manual inputs.

Conditional Logic for Input Features

-----------------------------------------

In this section, we implement conditional logic to handle different scenarios regarding the input features. We check if there is an uploaded file or not, and based on the result, we use either the slider bar input or the uploaded file's contents as the input feature.

Reading in Data from CSV File

------------------------------

Next, we read in the data from the penguins underscore cleaned dot CSV file. This dataset contains various features such as body mass, bill length, and sex of the penguin.

We then drop the ax species column because we want to predict this column. We combine the input underscore DF with the entire dataset of penguins.

Encoding Code

----------------

The encoding code that we are using expects multiple values in a particular column. For example, the island variable has three possibilities: Biscoe, Adelie, and Gentoo. Similarly, the sex column has two possibilities: male and female.

However, when we use the input feature from one penguin sample, it only knows of one possibility. Therefore, we need to integrate this input feature on top of the existing penguin dataset. This means that instead of having 333 rows in the original dataset, we now have three times as many rows because we are adding a new column for the input feature.

Performing Encoding

--------------------

We perform encoding on the input features by combining them with the existing dataset. We use the pandas encode function to achieve this.

Displaying User Input in Streamlit Interface

----------------------------------------------

Next, we display the user's input in the streamlit interface using a conditional statement. If there is an uploaded file, we write out its contents. Otherwise, we display the slider bar's input. We also include a text that informs the user that they need to upload a CSV file.

Classification Model Part

---------------------------

In this final section of our application, we implement the classification model using a saved file called penguin CLF.pkl. This file was created in a previous tutorial on building a gradient forest model for an iris dataset.

We read in the saved file and assign it to a variable called load COF. We then create a prediction variable and use the load COF object's predict function to get the predicted value. The input argument is DF, which corresponds to either the uploaded file or the slider bar input.

Writing Predicted Value and Probability

-----------------------------------------

Finally, we write out the predicted value of the penguin in the streamlit interface. We also display the prediction probability using the pandas DataFrame format.

"WEBVTTKind: captionsLanguage: enso in the previous video I've shown you how you can use an alternative to the iris dataset caught the Palmer Penguins data set and so in this video I'm going to show you how you could develop a web application for the Palmer Penguins data set and so without further ado we're starting right now so as briefly mentioned we're going to use this Palmer penguins data set that is provided by this art library called Palmer penguins so for your convenience I'm going to provide that data set on the github of the data professor and so make note that this data set that we're gonna use today was derived from this github library and so I'm going to show you where you could have access to the data that I have already exported out from the Palmer Penguins library package so it's right here in the data professor data repository find penguins cleaned so I have already cleaned the data set so you could also make use of this okay so we're gonna use this in this tutorial ok so let's head to the working directory so I'm gonna provide you the links to all of these code files that we're gonna use in this tutorial ok so before we begin let's have a quick recap and so in the first part of this tutorial series on streamlet I have shown you how you could use data directly from the Y finance library and how we could display a simple line chart in part two I have shown you how you could build a simple classification prediction web app for the iris data set and in this part 3 we're gonna use the palmer penguins data set in order to make a classification web application so let's have a look at this data set okay so it has already been cleaned and here there are a total of so here there are a total of seven columns so we have species Island bill Lang bill depth flipper link body man's and sex and the scroll down so there are a total of three three four rolls so not including the first row which is the heading there are a total of three three three and so we could see that we have already deleted some of the missing values and so it should be noted that the missing values were deleted totally so that is the simplest approach that I'm using and you could feel free to do some imputation at the data set in order to retain more of the data so the missing value that were deleted could be less than 10 and so please feel free to provide the link to your github page where you have applied some unique imputation approach and perhaps I could also include it in the github of the data professor as well so all of you guys all of us can have access to your imputed and cleaned data set and so in the meantime we're gonna use this data set that I have already cleaned and it is called penguins underscore cleaned dot CSV and so in part two where we built a simple iris classification web application you might notice that the code is building the prediction model every time that we load the file in and every time that we make adjustment to the input features it's going to rebuild the model over and over and so as some of you have pointed out this particular flaw of the code I totally agree with you and so the previous version was built like that for the simplicity of the tutorial and so in this tutorial we're gonna use another approach where we could beforehand build a prediction model pickle the object which is to save it into a file and then within the streamlet code we're going to read in the saved file and so the advantage of that is that there is no need to rebuild a model every time that the input parameters are changed and so let's have a look at that okay so the code that we're using is called penguins model building dot py and let's take a look at the code let's hit it with atom all right so here so we're gonna use pandas as PD and then we're gonna read into CSV of the Penguins clean CSV here which is also provided in the same directory as you can see here and then we're going to take this penguins data frame take the data and put it into the DF variable and then we're gonna define the target and the encode variable according to the excellent kernel Cagle provided in this link from protec and so kudos to protec for the code that we're using as the basis of this tutorial and so here we're gonna use ordinal feature encoding in order to encode the qualitative features such as species island and sex and so the objective of this tutorial we're going to predict the species of the penguin and so if you would like to predict the sex of the penguin you could replace species with sex and then you could put species in here okay so actually this was the exact parameters used by protec in his cattle kernel but in this tutorial I have modified it a bit by using the species as the target where we're going to predict the species of the penguin and the section island will be used as the input parameters okay so this block of code here will be encoding the sex and Island columns and in this block of code here it's going to encode the target species and in this line of code here we're going to apply this custom function in order to perform the encoding and so in this two lines of code we're going to separate the data set into x and y data matrices in order to use it for model building and scikit-learn right because here X will be the input features and Y will be the species so in X we have six features and in Y we have one feature and so here we're gonna build a random forest model and from SK learn ensembl we're going to import random forest classifier and we're going to assign the random forest classifier to the CLF variable and then we're gonna use the fit function in order to build the model using x and y as the import argument and then finally here we're going to save the model using the pickle library and we're going to use the pickle dump function and input argument we're going to use the CLF which is the model that we have already built and then we're going to open or we're going to create a file called penguins underscore CLF PKL the file and run this code so we could do this right inside the command line so I'm opening up the command prompt heading over to desktop going into the streamlet folder going into the Penguins folder and then I'm going to activate my environment and then we're going to run the code in model building Python so I have to make sure to type in penguins - and then the top function okay there you have it alright so as you can see the file popped up here and the pig code file has been created successfully all right so we're gonna copy this to the previous folder in here okay so let's have a look at the Penguins app here okay so the first five lines we're going to import the necessary libraries so here we're going to import streamlet SST import tandas SPD import numpy as NP import pickle and from SK learned ensemble import random forest classifier okay and in this block of code here is the title in markdown format and a corresponding description of this web application so why don't we have a look side-by-side let's resize the window a bit CD desktop CD streamlet Conda activate DP penguins okay scream lip run penguins - app in order to run the application so it's popping up the window here okay so this is the finished application that we're gonna build today and as you can see if we change the input parameters the prediction here will be changed automatically so we could see right and we're also gonna get the corresponding prediction probability as well and so it should be noted here that this web application was much more difficult than using the iris dataset partly because of the issue with the - qualitative features that were using so the thing is with iris dataset if we're using that it's going to contain only the quantitative features so there won't be any ordinal features like sex or Island and so under the hood we have to encode the ordinal features and for example for the island feature here we're going to create three additional columns called island Bisco island dream island Torgerson and for each of this three feature we're going to have binary value 1 or 0 it has the island being Bisco if the island Bisco is having a value of 1 and so if the island visco has a value of 1 for a particular penguin then it will also have corresponding value of 0 for dream and corresponding value of 0 for Torgerson and so the same thing for sex we're gonna create two additional features sex male sex female and for input feature here if the sex is male then the sex male feature will have a value of 1 and the sex female will have a value of 0 so therefore we will create five additional features on top of the four features that we have that are quantitative so that will bring us to a total of nine features so that was a bit complicated but we have already solved the issue for you guys and the code is inside here so let's have a look so here we're using STD sidebar Heather user input feature and so that is the name of this sidebar heading and then we have SC sidebar markdown and so we're going to provide example of the CSV file and the link to the CSV file that is an example is called penguins example and so you can see that we're providing the link to this CSV file in markdown formats and so you could click on it and it will bring you to this data sense so in order to download this you would have to right-click and then Save Link As in order to do that so the reason for having the example CSV input file here is because I have been receiving some comments whether I could include a feature where the user could upload their input file and so in this tutorial I'm going to show you that okay so here we're gonna download the CSV file and notice that is providing the extension of dot txt and we have to make it CSV save it alright so this is the input features now we're going to use and we're going to upload the file and there you go the uploaded file are used as the input features here and a prediction is being made here okay so it predicts this input as a deli and with corresponding prediction probability okay so as you can see there are two possibilities for the input features so the first option is to import the file as CSV format and the second possibility is to directly input the parameters by the slider bar and so under the hood the code will be using one or the other as the input for the predictions to be made and so for doing that we're gonna use the if-else conditional so let's have a look further so in line number 22 here it is using the STD bar file uploader and so it will give us this functionality to allow us to upload files and so the uploaded file will go into the variable called uploaded file and so we're gonna make use of conditionals here and so there will be two scenarios we're gonna use if there is an uploaded file and the uploaded file is not empty not none meaning that if the uploaded file is not empty then we should create a input DF variable and then we're going to wheat in the uploaded file or in else then we're gonna run this block of code here and so the block of code here will essentially define a function that will accept the input parameters from the slide bar here okay so as you can see the conditional have two possibilities if there is an uploaded file create a data frame and read in the uploaded file else read in the input parameters directly from the slider bar okay and so notice that the contents of the input will be saved into the same input DF variable so that will be easy for the following blocks of code so previously I've shown you now how you can import the library write the header of the web app shown here and along with the description and then the header of the sidebar along with the link to the example CSV file here and then the upload functionality shown here and then in this block of conditional if-else regarding the input features if there is a file to be uploaded or if not then we're going to use the slider bar as the input okay so now the fun part comes in reading in the data from the Penguins underscore cleaned dot CSV and then we're gonna drop the ax species column because we're going to predict that column and therefore we're gonna drop it here and now we're going to combine the input underscore DF to the entire data set of penguins and so the encoding code that we're using it is expecting that there are multiple values inside the particular column that it wants to encode so for example in the island variable it is expecting that there are three possibilities right the three different island or in the sex column it is expecting that there are two male female okay and so the thing is the input feature that we're using here will be from one penguin sample and so let's say that one penguin sample could have island as Bisco and therefore it will only know of one possibility and so this block of code will not work so therefore we have to integrate this input features on top of the existing penguin data set so before we might have 333 rows then it's three three three plus one so that makes it three three four and then we're gonna perform the encoding and then there will be two possibility for sex 3 possibility for Island ok all right so now we're going to display the user input in the user input features so right here a user input feature is this block of code and we're gonna use conditional again and so the first possibility is if there is an uploaded file writes out the content otherwise write out the content of the slider bar and then also put a text that we are awaiting the CSV file to be uploaded so in this scenario is telling us that there is no input file uploaded and if we upload the input file then this line of code will be disappearing ok so let's proceed further and so this is the classification model part so in the previous part two of the streamlet tutorial on iris dataset we've built a grana forest model right inside the streamlet application but in this tutorial we're just reading in the saved file which I have shown you at the beginning of this video so the pic code object is called penguin CLF pkl so we're gonna read that in and we're gonna assign it the load COF variable and then here on line 71 we're going to create a prediction variable and then we're going to assign the value of the predicted value and so we're using load underscore COF dot predict function and then input argument is DF + DF is corresponding to the input features either from the uploaded file or from the sidebar okay so that is DF and the online 72 we're also going to use DF as the input argument to the predict roba function and that will provide you the prediction probability okay and so in the block of code of line 75 through 77 it's going to write out the predicted value of the penguin and so here it is predicted to be Gentoo and then on line 79 and 80 is going to predict the probability okay and so the probability values are shown in this data frame here and so here we can see that it is predicted to be Gen 2 and there is a prediction probability of 81 percent so this can be taken as the relative confidence that we have in this prediction so it's kind of like we're 81 percent confident that this prediction is correct however for other prediction the probability could be decreased to 67 so as you can see when we reduce the body mass the probability decreased but if we increase the body mass the probability increased okay so if you're finding value in this video please help me out by smashing the like button subscribing if you haven't yet done so hit on the notification bell in order to be notified of the next video and comment down below and so if you do all of the above it will help the YouTube algorithm know that this content is useful to you so that tutorials like this could be discovered by other YouTube users and so kudos to you you have now built the Penguins prediction web application and as always the best way to learn data science is to do data science please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videosso in the previous video I've shown you how you can use an alternative to the iris dataset caught the Palmer Penguins data set and so in this video I'm going to show you how you could develop a web application for the Palmer Penguins data set and so without further ado we're starting right now so as briefly mentioned we're going to use this Palmer penguins data set that is provided by this art library called Palmer penguins so for your convenience I'm going to provide that data set on the github of the data professor and so make note that this data set that we're gonna use today was derived from this github library and so I'm going to show you where you could have access to the data that I have already exported out from the Palmer Penguins library package so it's right here in the data professor data repository find penguins cleaned so I have already cleaned the data set so you could also make use of this okay so we're gonna use this in this tutorial ok so let's head to the working directory so I'm gonna provide you the links to all of these code files that we're gonna use in this tutorial ok so before we begin let's have a quick recap and so in the first part of this tutorial series on streamlet I have shown you how you could use data directly from the Y finance library and how we could display a simple line chart in part two I have shown you how you could build a simple classification prediction web app for the iris data set and in this part 3 we're gonna use the palmer penguins data set in order to make a classification web application so let's have a look at this data set okay so it has already been cleaned and here there are a total of so here there are a total of seven columns so we have species Island bill Lang bill depth flipper link body man's and sex and the scroll down so there are a total of three three four rolls so not including the first row which is the heading there are a total of three three three and so we could see that we have already deleted some of the missing values and so it should be noted that the missing values were deleted totally so that is the simplest approach that I'm using and you could feel free to do some imputation at the data set in order to retain more of the data so the missing value that were deleted could be less than 10 and so please feel free to provide the link to your github page where you have applied some unique imputation approach and perhaps I could also include it in the github of the data professor as well so all of you guys all of us can have access to your imputed and cleaned data set and so in the meantime we're gonna use this data set that I have already cleaned and it is called penguins underscore cleaned dot CSV and so in part two where we built a simple iris classification web application you might notice that the code is building the prediction model every time that we load the file in and every time that we make adjustment to the input features it's going to rebuild the model over and over and so as some of you have pointed out this particular flaw of the code I totally agree with you and so the previous version was built like that for the simplicity of the tutorial and so in this tutorial we're gonna use another approach where we could beforehand build a prediction model pickle the object which is to save it into a file and then within the streamlet code we're going to read in the saved file and so the advantage of that is that there is no need to rebuild a model every time that the input parameters are changed and so let's have a look at that okay so the code that we're using is called penguins model building dot py and let's take a look at the code let's hit it with atom all right so here so we're gonna use pandas as PD and then we're gonna read into CSV of the Penguins clean CSV here which is also provided in the same directory as you can see here and then we're going to take this penguins data frame take the data and put it into the DF variable and then we're gonna define the target and the encode variable according to the excellent kernel Cagle provided in this link from protec and so kudos to protec for the code that we're using as the basis of this tutorial and so here we're gonna use ordinal feature encoding in order to encode the qualitative features such as species island and sex and so the objective of this tutorial we're going to predict the species of the penguin and so if you would like to predict the sex of the penguin you could replace species with sex and then you could put species in here okay so actually this was the exact parameters used by protec in his cattle kernel but in this tutorial I have modified it a bit by using the species as the target where we're going to predict the species of the penguin and the section island will be used as the input parameters okay so this block of code here will be encoding the sex and Island columns and in this block of code here it's going to encode the target species and in this line of code here we're going to apply this custom function in order to perform the encoding and so in this two lines of code we're going to separate the data set into x and y data matrices in order to use it for model building and scikit-learn right because here X will be the input features and Y will be the species so in X we have six features and in Y we have one feature and so here we're gonna build a random forest model and from SK learn ensembl we're going to import random forest classifier and we're going to assign the random forest classifier to the CLF variable and then we're gonna use the fit function in order to build the model using x and y as the import argument and then finally here we're going to save the model using the pickle library and we're going to use the pickle dump function and input argument we're going to use the CLF which is the model that we have already built and then we're going to open or we're going to create a file called penguins underscore CLF PKL the file and run this code so we could do this right inside the command line so I'm opening up the command prompt heading over to desktop going into the streamlet folder going into the Penguins folder and then I'm going to activate my environment and then we're going to run the code in model building Python so I have to make sure to type in penguins - and then the top function okay there you have it alright so as you can see the file popped up here and the pig code file has been created successfully all right so we're gonna copy this to the previous folder in here okay so let's have a look at the Penguins app here okay so the first five lines we're going to import the necessary libraries so here we're going to import streamlet SST import tandas SPD import numpy as NP import pickle and from SK learned ensemble import random forest classifier okay and in this block of code here is the title in markdown format and a corresponding description of this web application so why don't we have a look side-by-side let's resize the window a bit CD desktop CD streamlet Conda activate DP penguins okay scream lip run penguins - app in order to run the application so it's popping up the window here okay so this is the finished application that we're gonna build today and as you can see if we change the input parameters the prediction here will be changed automatically so we could see right and we're also gonna get the corresponding prediction probability as well and so it should be noted here that this web application was much more difficult than using the iris dataset partly because of the issue with the - qualitative features that were using so the thing is with iris dataset if we're using that it's going to contain only the quantitative features so there won't be any ordinal features like sex or Island and so under the hood we have to encode the ordinal features and for example for the island feature here we're going to create three additional columns called island Bisco island dream island Torgerson and for each of this three feature we're going to have binary value 1 or 0 it has the island being Bisco if the island Bisco is having a value of 1 and so if the island visco has a value of 1 for a particular penguin then it will also have corresponding value of 0 for dream and corresponding value of 0 for Torgerson and so the same thing for sex we're gonna create two additional features sex male sex female and for input feature here if the sex is male then the sex male feature will have a value of 1 and the sex female will have a value of 0 so therefore we will create five additional features on top of the four features that we have that are quantitative so that will bring us to a total of nine features so that was a bit complicated but we have already solved the issue for you guys and the code is inside here so let's have a look so here we're using STD sidebar Heather user input feature and so that is the name of this sidebar heading and then we have SC sidebar markdown and so we're going to provide example of the CSV file and the link to the CSV file that is an example is called penguins example and so you can see that we're providing the link to this CSV file in markdown formats and so you could click on it and it will bring you to this data sense so in order to download this you would have to right-click and then Save Link As in order to do that so the reason for having the example CSV input file here is because I have been receiving some comments whether I could include a feature where the user could upload their input file and so in this tutorial I'm going to show you that okay so here we're gonna download the CSV file and notice that is providing the extension of dot txt and we have to make it CSV save it alright so this is the input features now we're going to use and we're going to upload the file and there you go the uploaded file are used as the input features here and a prediction is being made here okay so it predicts this input as a deli and with corresponding prediction probability okay so as you can see there are two possibilities for the input features so the first option is to import the file as CSV format and the second possibility is to directly input the parameters by the slider bar and so under the hood the code will be using one or the other as the input for the predictions to be made and so for doing that we're gonna use the if-else conditional so let's have a look further so in line number 22 here it is using the STD bar file uploader and so it will give us this functionality to allow us to upload files and so the uploaded file will go into the variable called uploaded file and so we're gonna make use of conditionals here and so there will be two scenarios we're gonna use if there is an uploaded file and the uploaded file is not empty not none meaning that if the uploaded file is not empty then we should create a input DF variable and then we're going to wheat in the uploaded file or in else then we're gonna run this block of code here and so the block of code here will essentially define a function that will accept the input parameters from the slide bar here okay so as you can see the conditional have two possibilities if there is an uploaded file create a data frame and read in the uploaded file else read in the input parameters directly from the slider bar okay and so notice that the contents of the input will be saved into the same input DF variable so that will be easy for the following blocks of code so previously I've shown you now how you can import the library write the header of the web app shown here and along with the description and then the header of the sidebar along with the link to the example CSV file here and then the upload functionality shown here and then in this block of conditional if-else regarding the input features if there is a file to be uploaded or if not then we're going to use the slider bar as the input okay so now the fun part comes in reading in the data from the Penguins underscore cleaned dot CSV and then we're gonna drop the ax species column because we're going to predict that column and therefore we're gonna drop it here and now we're going to combine the input underscore DF to the entire data set of penguins and so the encoding code that we're using it is expecting that there are multiple values inside the particular column that it wants to encode so for example in the island variable it is expecting that there are three possibilities right the three different island or in the sex column it is expecting that there are two male female okay and so the thing is the input feature that we're using here will be from one penguin sample and so let's say that one penguin sample could have island as Bisco and therefore it will only know of one possibility and so this block of code will not work so therefore we have to integrate this input features on top of the existing penguin data set so before we might have 333 rows then it's three three three plus one so that makes it three three four and then we're gonna perform the encoding and then there will be two possibility for sex 3 possibility for Island ok all right so now we're going to display the user input in the user input features so right here a user input feature is this block of code and we're gonna use conditional again and so the first possibility is if there is an uploaded file writes out the content otherwise write out the content of the slider bar and then also put a text that we are awaiting the CSV file to be uploaded so in this scenario is telling us that there is no input file uploaded and if we upload the input file then this line of code will be disappearing ok so let's proceed further and so this is the classification model part so in the previous part two of the streamlet tutorial on iris dataset we've built a grana forest model right inside the streamlet application but in this tutorial we're just reading in the saved file which I have shown you at the beginning of this video so the pic code object is called penguin CLF pkl so we're gonna read that in and we're gonna assign it the load COF variable and then here on line 71 we're going to create a prediction variable and then we're going to assign the value of the predicted value and so we're using load underscore COF dot predict function and then input argument is DF + DF is corresponding to the input features either from the uploaded file or from the sidebar okay so that is DF and the online 72 we're also going to use DF as the input argument to the predict roba function and that will provide you the prediction probability okay and so in the block of code of line 75 through 77 it's going to write out the predicted value of the penguin and so here it is predicted to be Gentoo and then on line 79 and 80 is going to predict the probability okay and so the probability values are shown in this data frame here and so here we can see that it is predicted to be Gen 2 and there is a prediction probability of 81 percent so this can be taken as the relative confidence that we have in this prediction so it's kind of like we're 81 percent confident that this prediction is correct however for other prediction the probability could be decreased to 67 so as you can see when we reduce the body mass the probability decreased but if we increase the body mass the probability increased okay so if you're finding value in this video please help me out by smashing the like button subscribing if you haven't yet done so hit on the notification bell in order to be notified of the next video and comment down below and so if you do all of the above it will help the YouTube algorithm know that this content is useful to you so that tutorials like this could be discovered by other YouTube users and so kudos to you you have now built the Penguins prediction web application and as always the best way to learn data science is to do data science please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos\n"