Building a Penguin Prediction Web Application using Streamlit
Importing Libraries and Writing Header
----------------------------------------
The first step in building our penguin prediction web application is to import the necessary libraries. We will use Streamlit for building the web application, pandas for data manipulation, and scikit-learn for machine learning. In addition, we will also import matplotlib for displaying plots.
After importing the libraries, we write the header of the web application. This includes writing the title of the application, a brief description of what the application does, and some metadata about the author and date of creation.
Sidebar Header and CSV File Link
------------------------------------
Next, we write the header of the sidebar along with a link to an example CSV file. In this case, we are using the penguins dataset from UCI Machine Learning Repository.
The Upload Functionality
-------------------------
After writing the headers, we move on to implementing the upload functionality. We create a slider bar that allows users to input their own features for the penguin data. This feature will be used in addition to the existing penguin dataset.
If there is an uploaded file, we display its contents in the user interface. If not, we use the slider bar as the input feature. This ensures that the application can work with both uploaded files and manual inputs.
Conditional Logic for Input Features
-----------------------------------------
In this section, we implement conditional logic to handle different scenarios regarding the input features. We check if there is an uploaded file or not, and based on the result, we use either the slider bar input or the uploaded file's contents as the input feature.
Reading in Data from CSV File
------------------------------
Next, we read in the data from the penguins underscore cleaned dot CSV file. This dataset contains various features such as body mass, bill length, and sex of the penguin.
We then drop the ax species column because we want to predict this column. We combine the input underscore DF with the entire dataset of penguins.
Encoding Code
----------------
The encoding code that we are using expects multiple values in a particular column. For example, the island variable has three possibilities: Biscoe, Adelie, and Gentoo. Similarly, the sex column has two possibilities: male and female.
However, when we use the input feature from one penguin sample, it only knows of one possibility. Therefore, we need to integrate this input feature on top of the existing penguin dataset. This means that instead of having 333 rows in the original dataset, we now have three times as many rows because we are adding a new column for the input feature.
Performing Encoding
--------------------
We perform encoding on the input features by combining them with the existing dataset. We use the pandas encode function to achieve this.
Displaying User Input in Streamlit Interface
----------------------------------------------
Next, we display the user's input in the streamlit interface using a conditional statement. If there is an uploaded file, we write out its contents. Otherwise, we display the slider bar's input. We also include a text that informs the user that they need to upload a CSV file.
Classification Model Part
---------------------------
In this final section of our application, we implement the classification model using a saved file called penguin CLF.pkl. This file was created in a previous tutorial on building a gradient forest model for an iris dataset.
We read in the saved file and assign it to a variable called load COF. We then create a prediction variable and use the load COF object's predict function to get the predicted value. The input argument is DF, which corresponds to either the uploaded file or the slider bar input.
Writing Predicted Value and Probability
-----------------------------------------
Finally, we write out the predicted value of the penguin in the streamlit interface. We also display the prediction probability using the pandas DataFrame format.