Machine Learning: The Key to Unlocking Data Insights
Hello world, welcome to Sirajology! Ever wonder how Netflix recommends awesome shows you'd like? Or how Facebook can auto-tag your face? Or how Google's self-driving cars work? Or how Bing can...whats that? You don't care what Bing does? It's okay, nobody does. Anyways, the answer is Machine Learning. Machine Learning is the study of algorithms that learn from examples and experience instead of hard-coded rules.
To understand how machine learning works, let's consider an example. Imagine you want to build an app that can recognize an image of a specific species of flower called Iris. If you decide to code this without machine learning, you'd have to write a bunch of different functions to detect all the different features of an Iris flower. The problem is, there are a bunch of corner cases and there's no way you could account for all of them ahead of time. What if one of the leaves is partially obstructed or a flower is a certain color that you didn't account for or the shape is totally different than what you expected? You can't just code all that up before-hand! Unless you're Jeff Dean, who's not even sure he could do it.
The good news is that machine learning makes this problem super easy and doesn't require any math skills. There are four steps involved in the process: collect data, pick a model, train the model, and test the model. We'll basically just add data to a model and it will start to find patterns in the data for us.
The first step is to get our data. Datasets come in all different kinds of formats (PDFs, TXTs, CSVs, holograms), it doesn't matter the format, we can easily parse it in our code to get the relevant details. We'll be using a well-known dataset that contains 150 samples of Iris flowers. Luckily for us, this dataset comes preloaded with SciKit learn so we can just load it here. Each sample has a label, one of three types of Iris (setosa, virginica, or versicolor) and a set of features (sepallength, sepal width, petal length, and petalwidth). Because our data is labeled, the type of learning we're doing is called supervised learning. If we didn't have labels for our data, just features, then it would be called unsupervised learning.
Supervised learning means that we have a target or response variable to predict. In this case, our goal is to classify an Iris flower as one of the three types (setosa, virginica, or versicolor). If we didn't have labels for our data, just features, then it would be called unsupervised learning. Unsupervised learning means that there's no target variable and we're trying to find patterns in the data.
Now that we have our dataset, the next step is to pick the model. To do that, you just have to calculate the multivariate equation for discriminant analysis by squaring the delta of the...just kidding – you literally just paste in a single line of code. The real question is how do you know which of the bajillion machine learning models to use? Well, we're trying to classify an image as either an Iris flower or not an Iris flower, so we know this is a classification problem. Therefore, we'll need to use a classifier.
Ok, that narrows our options, but what type? There are a lot of those too! My gut reaction is to use a deep neural network because it just sounds dope you know what I mean? But there are countless machine learning models out there, each with its own strengths and weaknesses. The key is to choose the right one for your specific problem.
One popular choice is a classifier that takes in multiple features and outputs a probability distribution over all possible classes. In our case, we're trying to classify an Iris flower into one of three categories (setosa, virginica, or versicolor). This type of model is well-suited for this task because it can take advantage of the complex relationships between the different features.
Now that we have a classifier, the next step is to train our model. Since we're using a classifier, we just need to call the fit method on our object to train our model. Fit is our training algorithm, this method will input the training dataset into our model find patterns in our data. Boom, done.
Now, whenever we input a new flower from our testing dataset, it'll automatically be able to classify it as one of the three types of Iris flowers. We can see in the terminal that the accuracy score for classification is pretty high. How easy was that? Just 7 lines of code and now you have your first model trained and ready to recognize Iris flowers! You just made a learning machine.
And you can use this same model to classify other things like cars, dresses, and even republican candidates. Machine learning can be applied to so many different things from fraud detection to generating paintings like Picasso. The possibilities are endless, and it's up to us to explore them.