R Tutorial - The importance of scale

Calculating Distance Between Players on a Soccer Field: A Problem with Different Scales

When calculating the distance between two players on a soccer field, we use two features: X and y. These features are the coordinates of the players and are measured in the same manner, making them comparable to one another and able to be used together to calculate the Euclidean distance between the players.

However, what happens when the features aren't measured in the same manner or when the values of these features aren't comparable to one another? To answer this question, let's walk through an example. Imagine you're provided with a dataset that contains the heights and weights for a large number of men in the United States. The height feature is measured in feet, while the weight feature is measured in pounds.

We are interested in calculating the distance between these individuals. Let's start by comparing observations 1 & 2. Both men are the same height, 6 feet, but they differ slightly in weight. In this case, the difference is 2 pounds. If we calculated the Euclidean distance between them, we would get a value of 2.

Now, let's look at observations 1 & 3. In this comparison, the weights are the same, but the height is different by 10 feet. If we calculate the distance once more, you guessed it - it's also 2. The distances between both pairs are identical. If we saw these three men standing side by side, would you really believe that one observation is just as similar as the other two? Of course not.

Then why are there distances that are the same? This happens because these features are on different scales, meaning they have different averages and different expected variability. While in these comparisons, these features only vary by a magnitude of 2, we intuitively know that a change in two pounds is very different from a change in ten feet. So, how can we adjust these features to calculate a distance that better aligns with our expectations?

To do this, we need to convert our features to be on a similar scale with one another. There are various methods for doing this, but for this course, we will use the method called standardization. This entails updating each measurement for feature by subtracting the average value of that feature and then dividing by its standard deviation. By doing this across our features, they are placed on a similar scale where each feature has a mean of 0 and a standard deviation of 1.

Going back to the previous scenario, we can use the mean and standard deviation of the height and weight features to standardize the value for our three observations. Now, if we calculate the Euclidean distances between them, voila - the values make sense. They agree with our intuition: 1 & 3 are much less similar to one another and 1 & 2 in our set.

Using the scale function to standardize height and weight to the same scale is what we've just seen. Using the scale functions with the default parameters will normalize each feature column to a mean of 0 and a variance of 1. This will help us further explore how scales can affect our ability to interpret the distance value.

Standardizing features, especially when they are measured on different scales, is crucial in many applications, not just calculating distances between players on a soccer field. By understanding how to standardize features, we can make more accurate calculations and better align our results with our expectations. In the next exercise, you will have a chance to further explore how scales can affect your ability to interpret the distance value.

"WEBVTTKind: captionsLanguage: enwhen calculating the distance between two players on a soccer field we use two features X and y both of these features are the coordinates of the players and both are measured in the same manner because of this they are comparable to one another and can be used together to calculate the Euclidean distance between the players but what happens when the features aren't measured in the same manner or to put it another way when the values of these features aren't comparable to one another to answer this question let's walk through an example imagine you're provided with a data set that contains the heights and weights for a large number of men in the United States the height feature is measured in feet and the weight feature and pounds you are interested in calculating the distance between these individuals let us start by comparing observations 1 & 2 here both men are the same height 6 feet but they differ slightly in weight in this case the difference is 2 pounds if we calculated the Euclidean distance between them we would get a value of 2 now let's look at observations 1 & 3 in this comparison the weights are the same but the height is different by to eat if we calculate the distance once more you guessed it it's also 2 the distances between both pairs are identical if we saw these three men standing side by side would you really believe that one observation is just as similar as the other 2 of course not then why are there distances the same this happens because these features are on different scales meaning they have different averages and different expected variability while in these comparisons these features only vary by a magnitude of 2 we intuitively know that a change in two pounds is very different than a change in two repeat so how can we adjust these features to calculate a distance a better aligns with our expectations to do this we need to convert our features to be on a similar scale with one another there are various methods for doing this but for this course we will use the method called standardization this entails updating each measurement for feature by subtracting the average value of that feature and then dividing by its standard deviation doing this across our features places them on a similar scale where each feature has a mean of 0 and a standard deviation of 1 going back to the previous scenario we can use the mean and standard deviation of the height and weight features to standardize the value for our three observations now if we calculate the Euclidean distances between them voila the values makes sense they agree with our intuition 1 & 3 are much less similar to one another and 1 & 2 in our we can use the scale function to standardize height and weight to the same scale if height weight is our matrix of observations similar to what we've just seen using the scale functions with the default parameters will normalize each feature column to a mean of 0 and a variance of 1 in the next exercise you will have a chance to further explore how scales can affect your ability to interpret the distance valuewhen calculating the distance between two players on a soccer field we use two features X and y both of these features are the coordinates of the players and both are measured in the same manner because of this they are comparable to one another and can be used together to calculate the Euclidean distance between the players but what happens when the features aren't measured in the same manner or to put it another way when the values of these features aren't comparable to one another to answer this question let's walk through an example imagine you're provided with a data set that contains the heights and weights for a large number of men in the United States the height feature is measured in feet and the weight feature and pounds you are interested in calculating the distance between these individuals let us start by comparing observations 1 & 2 here both men are the same height 6 feet but they differ slightly in weight in this case the difference is 2 pounds if we calculated the Euclidean distance between them we would get a value of 2 now let's look at observations 1 & 3 in this comparison the weights are the same but the height is different by to eat if we calculate the distance once more you guessed it it's also 2 the distances between both pairs are identical if we saw these three men standing side by side would you really believe that one observation is just as similar as the other 2 of course not then why are there distances the same this happens because these features are on different scales meaning they have different averages and different expected variability while in these comparisons these features only vary by a magnitude of 2 we intuitively know that a change in two pounds is very different than a change in two repeat so how can we adjust these features to calculate a distance a better aligns with our expectations to do this we need to convert our features to be on a similar scale with one another there are various methods for doing this but for this course we will use the method called standardization this entails updating each measurement for feature by subtracting the average value of that feature and then dividing by its standard deviation doing this across our features places them on a similar scale where each feature has a mean of 0 and a standard deviation of 1 going back to the previous scenario we can use the mean and standard deviation of the height and weight features to standardize the value for our three observations now if we calculate the Euclidean distances between them voila the values makes sense they agree with our intuition 1 & 3 are much less similar to one another and 1 & 2 in our we can use the scale function to standardize height and weight to the same scale if height weight is our matrix of observations similar to what we've just seen using the scale functions with the default parameters will normalize each feature column to a mean of 0 and a variance of 1 in the next exercise you will have a chance to further explore how scales can affect your ability to interpret the distance value\n"

R Tutorial - The importance of scale

Random Videos