R Tutorial - Data normalization

Introduction to Machine Learning Interview Preparation Course

Hello, my name is Rafael and I will be your instructor for this course. Have you ever wondered what a machine learning interview looks like? Well, this course will help you prepare for it by answering 30 questions that will put both your theoretical machine learning knowledge and our coding skills to the test. Let's dive in.

Course Overview

The first chapter of this course is about data pre-processing techniques. You will learn about data normalization, handling missing data, and detecting outliers. Chapter 2 is devoted to supervised learning, where we will cover important aspects to consider when developing regression or classification models. We will also discuss the three most common strategies for model selection and sampling in Chapter 3. In Chapter 4, you will hone in your unsupervised learning skills through questions about clustering and dimensionality reduction. Finally, in Chapter 5, we will delve into model selection and evaluation, including imbalance classification and hyperparameter tuning.

Chapter 1: Data Pre-Processing Techniques

Data normalization, also known as feature scaling, is an important step in your data pre-processing pipeline. Although it is not always needed most of the times, it will be beneficial to your machine learning model. Decision trees for example can deal quite well with features having dissimilar and disproportionate skills but sadly this is not the case for the majority of the machine learning models you will often use such as support vector machines k-nearest neighbors logistic regression neural networks or an entire suite of clustering and feature extraction algorithms.

It is therefore a good practice to consider normalizing your data before passing it on to other components in your machine learning pipeline. Min max scaling is a very common way to normalize the data, where every feature value is scaled between its minimum and maximum in the scaled version. The minimum value is mapped to zero, the maximum is mapped to one, and the rest of the values lie in between.

Another popular method is called standardization or set score normalization, which represents a numerical value as units of a standard deviation from the feature mean. As a result, values below the mean will be mapped to negative units and values above the mean will be mapped to positive units. Let's take a look at a scatter plot of 150 players ages and their monetary value in millions of euros. We immediately notice the difference in the feature scales which will be problematic for many machine learning algorithms when trying to compute distances.

Min max scaling brought both feature scales to the zero one interval, however the new age values are in the zero to zero point three range due to the presence of the outlier. So, the wide axis will still dominate when these two features are compared. The set score method showed more robustness to the outlier as both features are now in the minus one to two range although their ranges are not identical.

Summary

Min Max normalization ensures that all features will share the exact same scale but does not cope well with outliers. Set score normalization on the other hand is more robust to outliers but produces normalized values in different scales. Time for you to practice.

"WEBVTTKind: captionsLanguage: enhi my name is Rafael and I will be your instructor for this course have you ever wondered what a machine learning interview looks like well this course will help you prepare for it by answering 30 questions that will put both your theoretical machine learning knowledge and our coding skills to the test let's dive in the first chapter is about data pre-processing techniques you will learn about data normalization handling missing data and detecting outliers chapter 2 is devoted to supervised learning we will cover important aspects to consider when developing regression or classification models we will also discuss the three most common strategies for model and sampling in Chapter three you will hone in your unsupervised learning skills through questions about clustering and dimensionality reduction finally in Chapter four we will delve into model selection and evaluation including imbalance classification and hyper parameter tuning you will wrap up studying the algorithmic differences between two very popular and sample models random forests and gradient boosted trees keep in mind that this course is meant to be more challenging than your average data camp course make sure to complete your prerequisite courses so you can gain the most out of the topics we will cover now that you know what will be covered in the course let's get it started with data normalization title normalization also called feature scaling is an important step in your data pre-processing pipeline although it is not always needed most of the times it will be beneficial to your machine learning model decision trees for example can deal quite well with features having dissimilar and disproportionate skills but sadly this is not the case for the majority of the machine learning models you will often use such as support vector machines k-nearest neighbors logistic regression neural networks or an entire suite of clustering and feature extraction algorithms it is therefore a good practice to consider normalizing your data before passing it on to other components in your machine learning pipeline min max scaling is a very common way to normalize the data in a scales every feature value between its minimum and maximum in the scaled version the minimum value is mapped to zero the maximum is mapped to one and the rest of the values lie in between another popular method is called standardization or set score normalization it represents a numerical value as units of a standard deviation from the feature mean as a result values below the mean will be mapped to negative units and values above the mean will be mapped to positive units here's a scatter plot of 150 players ages and their monetary value in millions of euros we immediately notice the difference in the feature scales which will be problematic for many machine learning algorithms when trying to compute distances for example to make things worse there is an erroneous observation color in red which is clearly an outlier min max is scaling brought both feature scales to the zero one interval however the new age values are in the zero to zero point three range due to the presence of the outlier so the wide axis will still dominate when these two features are compared the set score method showed more robustness to the outlier as both features are now in the minus one to two range although their ranges are not identical to summarise min Max normalization ensures that all features will share the exact same scale but does not cope well with outliers set score normalization on the other hand is more robust to outliers but produces normalized values in different scales time for you to practicehi my name is Rafael and I will be your instructor for this course have you ever wondered what a machine learning interview looks like well this course will help you prepare for it by answering 30 questions that will put both your theoretical machine learning knowledge and our coding skills to the test let's dive in the first chapter is about data pre-processing techniques you will learn about data normalization handling missing data and detecting outliers chapter 2 is devoted to supervised learning we will cover important aspects to consider when developing regression or classification models we will also discuss the three most common strategies for model and sampling in Chapter three you will hone in your unsupervised learning skills through questions about clustering and dimensionality reduction finally in Chapter four we will delve into model selection and evaluation including imbalance classification and hyper parameter tuning you will wrap up studying the algorithmic differences between two very popular and sample models random forests and gradient boosted trees keep in mind that this course is meant to be more challenging than your average data camp course make sure to complete your prerequisite courses so you can gain the most out of the topics we will cover now that you know what will be covered in the course let's get it started with data normalization title normalization also called feature scaling is an important step in your data pre-processing pipeline although it is not always needed most of the times it will be beneficial to your machine learning model decision trees for example can deal quite well with features having dissimilar and disproportionate skills but sadly this is not the case for the majority of the machine learning models you will often use such as support vector machines k-nearest neighbors logistic regression neural networks or an entire suite of clustering and feature extraction algorithms it is therefore a good practice to consider normalizing your data before passing it on to other components in your machine learning pipeline min max scaling is a very common way to normalize the data in a scales every feature value between its minimum and maximum in the scaled version the minimum value is mapped to zero the maximum is mapped to one and the rest of the values lie in between another popular method is called standardization or set score normalization it represents a numerical value as units of a standard deviation from the feature mean as a result values below the mean will be mapped to negative units and values above the mean will be mapped to positive units here's a scatter plot of 150 players ages and their monetary value in millions of euros we immediately notice the difference in the feature scales which will be problematic for many machine learning algorithms when trying to compute distances for example to make things worse there is an erroneous observation color in red which is clearly an outlier min max is scaling brought both feature scales to the zero one interval however the new age values are in the zero to zero point three range due to the presence of the outlier so the wide axis will still dominate when these two features are compared the set score method showed more robustness to the outlier as both features are now in the minus one to two range although their ranges are not identical to summarise min Max normalization ensures that all features will share the exact same scale but does not cope well with outliers set score normalization on the other hand is more robust to outliers but produces normalized values in different scales time for you to practice\n"