R Tutorial - Background on modeling for prediction

Exploring Data Analysis with the House Sales Dataset in King County, USA

To begin our data analysis journey, we have been introduced to an example of modeling for explanation. In this instance, we are examining factors that might explain teacher evaluation scores as provided by students at the University of Texas Austin. However, today, we will shift our focus towards modeling for prediction using the house sales dataset available at Cackle.com. This dataset consists of homes sold near Seattle, Washington, in 2014 and 2015, with features such as size (measured by square feet of living space), condition number of bedrooms, year built, and whether it had a view of the waterfront.

The dataset has been included in the modern dive package, which I preview using the glimpse function. Upon observation, we find that there are 21 thousand rows representing houses and 21 variables. Our goal is to predict the sale price of houses based on these features.

One of the approaches to Exploratory Data Analysis (EDA) is looking at data visualizations and summary statistics. Since I have already performed the first approach, let's now visualize our data using a histogram for the outcome variable "price." This will give us a sense of the distribution of the new numerical outcome variable price.

As we look at the x-axis tick marks, since E+0-6 means 10 to the 6 or 1 million, we see that a majority of houses are less than 2 million dollars. However, why does the x axis stretch so far to the right? It's because there are very few houses with prices closer to 8 million. We can say that the variable "price" is right-skewed as exhibited by the long right tail. This skew makes it difficult to compare prices of less expensive houses.

This situation reminds us of a similar scenario in the intro to the tidyverse course when visualizing the variable "country population" from the Gapminder dataset. We used a scatterplot to visualize the relationship between countries' life expectancies and populations, but because the populations of India and China were so large, it was hard to study the relationship for less populated countries. To remedy this, we rescaled the x-axis to be on a log 10 scale. This allows us to better distinguish points for countries with smaller populations, and horizontal intervals on the x-axis now correspond to multiplicative differences instead of additive ones.

For example, distances between successive vertical white grid lines correspond to multiplicative increases by a factor of 10. Now, I will apply this same principle where I log 10 transformed "price" using the mutate function to create a new variable "log10_price." Let's observe the effects of this transformation on these two variables.

In particular, we have a house in the sixth row with a price between 1.2 and 5 million. Since 10^6 is 1 million, its log10 price is 0.9. In contrast, all other houses with "log10_price" less than 6 will also have lower prices. I can treat "log10_price" as our new outcome variable because log transformations are monotonic and preserve orderings.

This means that if a house's original price was lower than another house's price, its log10 price will also be lower. Now, let's take the earlier code to plot a histogram of the original outcome variable and tweak it to plot a histogram of the new "log10_price" transformed outcome variable. Upon observing these distributions, we see that after transformation, the distribution is much less skewed and in this case more symmetric and bell-shaped.

Although this isn't always necessary for every dataset, this transformation allows us to better discriminate between houses at the lower end of the price scale. Now that we have seen a log transformation was warranted for the outcome variable "price," let's examine if the predictor variable "size" (measured by square feet of living space) warrants a similar transformation.

The Relationship Between Log10 Price and Size

We can explore this relationship by examining how changes in size affect the log10 price. However, without additional information or visualizations, we cannot draw any concrete conclusions about the nature of this relationship based on the provided data alone.

In conclusion to our exploration of the dataset, we have seen that both "price" and "size" exhibit characteristics of right-skewed distributions. We applied a log transformation to "price," which led to a more symmetric distribution and allowed us to better compare prices across houses at different price points. However, without further analysis or data, it is unclear if a similar transformation would be effective for the predictor variable "size."

"WEBVTTKind: captionsLanguage: enpreviously you were introduced to an example of modeling for explanation understanding what factors might explain teacher evaluation scores as given by students at the University of Texas Austin let's now consider an example of modeling for prediction the data set I'll use is the house sales in King County USA available at cackle comm it consists of homes sold near Seattle Washington in 2014 and 2015 I'll predict the sale price of houses based on their features such as size as measured by square feet of living space where one square foot is approximately 1/10 of a square meter condition number of bedrooms year built and whether it had a view of the waterfront just as with the evals data set I've included this data in the modern dive package which I preview using the glimpse function observe there are 21 thousand rows representing houses and 21 variables as before let's perform an exploratory data analysis or EDA of the outcome variable price recall the three approaches to EDA looking at the data visualizations and summary statistics since I've just done the first approach let's now visualize our data just as with the outcome variable score from our explanatory modeling example let's get a sense of the distribution of our new numerical outcome variable price using a histogram first let's look at the x axis tick marks since E Plus 0-6 means 10 to the 6 or 1 million we see that a majority of houses are less than 2 million dollars but why does the x axis stretch so far to the right it's because there are very number of houses with price closer to 8 million you say that the variable price is right skewed as exhibited by the long right tail this skew makes it difficult to compare prices of the less expensive houses recall that you saw something similar in the intro to the tidy verse course when visualizing the variable country population from the Gapminder dataset you visualize the relationship between countries life expectancies and populations using a scatterplot similar to this one because the populations of the two green dots corresponding to India and China were so large it was hard to study the relationship for less populated countries to remedy this you rescale the x axis to be on a log 10 scale now you can better distinguish points for countries with smaller populations furthermore horizontal intervals on the x-axis now correspond to multiplicative differences instead of additive ones for example distances between successive vertical white grid lines correspond to multiplicative increases by a factor of 10 now I'll do something similar where I log 10 transformed price using the mutate function to create a new variable log 10 price let's view the effects of this transformation on these two variables observe in particular the house in the sixth row with price 1.2 to 5 million since 10 to the 6 is 1 million it's log 10 price is six point zero 9 contrast this with all other houses with log 10 price less than 6 I'll treat log 10 price as our new outcome variable I can do this since log transformations are monotonic meaning they preserve orderings so if house aides price is lower than house B's then house a z' log10 price will also be lower than house bees log10 price let's take the earlier code to plot a histogram of the original outcome variable copy-paste and tweak the code to plot a histogram of the new log10 transformed outcome variable observe that after the transformation the distribution is much less skewed and in this case more symmetric and bell-shaped although this isn't always necessarily the case you can now better discriminate between houses at the lower end of the price scale now that you've seen a log transformation was warranted for the outcome variable price let's see if the predictor variable size as measured by the square feet of living space warrants a similar transformationpreviously you were introduced to an example of modeling for explanation understanding what factors might explain teacher evaluation scores as given by students at the University of Texas Austin let's now consider an example of modeling for prediction the data set I'll use is the house sales in King County USA available at cackle comm it consists of homes sold near Seattle Washington in 2014 and 2015 I'll predict the sale price of houses based on their features such as size as measured by square feet of living space where one square foot is approximately 1/10 of a square meter condition number of bedrooms year built and whether it had a view of the waterfront just as with the evals data set I've included this data in the modern dive package which I preview using the glimpse function observe there are 21 thousand rows representing houses and 21 variables as before let's perform an exploratory data analysis or EDA of the outcome variable price recall the three approaches to EDA looking at the data visualizations and summary statistics since I've just done the first approach let's now visualize our data just as with the outcome variable score from our explanatory modeling example let's get a sense of the distribution of our new numerical outcome variable price using a histogram first let's look at the x axis tick marks since E Plus 0-6 means 10 to the 6 or 1 million we see that a majority of houses are less than 2 million dollars but why does the x axis stretch so far to the right it's because there are very number of houses with price closer to 8 million you say that the variable price is right skewed as exhibited by the long right tail this skew makes it difficult to compare prices of the less expensive houses recall that you saw something similar in the intro to the tidy verse course when visualizing the variable country population from the Gapminder dataset you visualize the relationship between countries life expectancies and populations using a scatterplot similar to this one because the populations of the two green dots corresponding to India and China were so large it was hard to study the relationship for less populated countries to remedy this you rescale the x axis to be on a log 10 scale now you can better distinguish points for countries with smaller populations furthermore horizontal intervals on the x-axis now correspond to multiplicative differences instead of additive ones for example distances between successive vertical white grid lines correspond to multiplicative increases by a factor of 10 now I'll do something similar where I log 10 transformed price using the mutate function to create a new variable log 10 price let's view the effects of this transformation on these two variables observe in particular the house in the sixth row with price 1.2 to 5 million since 10 to the 6 is 1 million it's log 10 price is six point zero 9 contrast this with all other houses with log 10 price less than 6 I'll treat log 10 price as our new outcome variable I can do this since log transformations are monotonic meaning they preserve orderings so if house aides price is lower than house B's then house a z' log10 price will also be lower than house bees log10 price let's take the earlier code to plot a histogram of the original outcome variable copy-paste and tweak the code to plot a histogram of the new log10 transformed outcome variable observe that after the transformation the distribution is much less skewed and in this case more symmetric and bell-shaped although this isn't always necessarily the case you can now better discriminate between houses at the lower end of the price scale now that you've seen a log transformation was warranted for the outcome variable price let's see if the predictor variable size as measured by the square feet of living space warrants a similar transformation\n"