R Tutorial - Multiple Linear Regression

Using Multiple Linear Regression to Avoid Omitted Variable Bias

In this video, you will learn about how to use multiple linear regression to avoid a common threat to the accuracy of simple linear regression: omitted variable bias. This occurs when a variable not included in the regression is correlated with both the explanatory variable and the response variable.

For example, imagine we are looking at the relationship between study time before an exam and the success achieved. If we just consider these two variables, we find a negative relationship - the more person studies, the lower her exam score will be. But what's strange about this result? IQ is positively related to exam success and negatively related to study time. This means that if we simply use study time as an explanatory variable, we are ignoring the potential impact of IQ on both study time and exam success.

To address this issue, we need to include IQ in our regression model. By doing so, we can estimate the positive effect of study time on exam success while controlling for the potential influence of IQ. With help from multiple linear regression, we can now model future margin as a function of margin, n orders, n items, and so on.

We start by estimating a multiple regression model using the LM function, including all the variables in our data set. We save this model as "multiple LM" and use the "summary" function to examine the results. Although this approach works well initially, we soon encounter other problems associated with multiple linear regression - specifically, multicollinearity.

Multicollinearity occurs whenever one explanatory variable can be explained by the remaining explanatory variables, leading to unstable regression coefficients and underestimates of standard errors. This is a threat to the accuracy of our model because it causes the variance inflation factors (VIFs) to become high, indicating problematic multicollinearity. VIFs are calculated using the "wave" function from the RMS package and indicate that values above 5 are problematic.

To address this issue, we systematically check all variables in our model for multicollinearity by calculating the VIFs. We find that certain pairs of explanatory variables have high VIFs - specifically, foreign orders and items as well as margin for order and margin per item. To mitigate these problems, we exclude one variable from each pair from the regression model.

After making these adjustments, we are now ready to interpret our model output. The intercept in our model provides the expected future margin in year 2 when all independent variables are set to 0 - in this case, approximately 23. While it's challenging to make interpretations based solely on the value of the intercept, we can still examine the coefficient of each explanatory variable and its effect on the expected future margin.

The coefficient estimate for the margin variable signifies a 0.40 increase in future margin given an increase of 1 euro in margins during the current year. We also need to consider the significance of these coefficients - by default, we conduct a t-test to determine whether each coefficient is significantly different from 0 at a 5% significance level.

In our example, all variables except gender, age, and items per order are significant at the 95% confidence level. Finally, there's also a test for simultaneous equality of coefficients - but this will be explored further in future practice.

"WEBVTTKind: captionsLanguage: enin this video you'll learn about how to use multiple linear regression one threat to the accuracy of the simple linear regression from before is what's called omitted variable bias this occurs when a variable not included in the regression is correlated with both the explanatory variable and the response variable imagine we are looking at the relationship between the study time before an exam and the success achieved if we just consider these two variables we find a negative relationship the more person studies the lower her exam score will be strange isn't it since IQ is positively related to exam success and negatively related to study time we need to include this variable in the regression then with help of multiple regression I now estimate the positive effect of study time let's estimate a multiple regression model using the LM function including all the variables in the data set future margin is now modeled as a function of margin n orders n items and so on we save the model as multiple LM just as before we use summary now with multiple LM as an argument that worked although we now encounter other problems multicollinearity is one threat to a multiple linear regression this occurs whenever one explanatory variable can be explained by the remaining explanatory variables then the regression coefficients become unstable and the standard errors reported by the linear model are underestimates due to a high correlation between n orders and items as well as margin for order and margin per item these variables are candidates for multiple energy to systematically check all variables in a model for multicollinearity we calculate the variance inflation factors widths using the wave function from the RMS package these indicate the increase in the variance of an estimated coefficient due to multicollinearity with higher than 5 is problematic and values above 10 indicate poor regression estimates let's look at our models variance inflation factors as expected the Whip's foreign orders and in items as well as margin for order and margin per item are rather high hence we exclude one of each pair from the regression namely and items and margin per order here are the roofs of the new model they are all acceptable now finally we are ready to interpret the model output the intercept gives the expected margin in year 2 when all independent variables are set to 0 hence we observe an expected margin in year 2 of roughly 23 given that every explanatory variable is equal to 0 it's usually hard to make interpretations for just the value of the intercept in a multivariate regression model the coefficient of each explanatory variable gives the effect that a one unit change in that variable has on the expected margin in YouTube with all other variables being held constant the coefficient estimate of roughly 0.4 for the margin variable signifies a 0.40 increase in future mod given an increase of 1 euro for margins in the current year let's also look at the coefficient significance by default a t-test about whether or not the respective coefficient 0 is conducted if the p-value in the last column is smaller than 0.05 we can conclude that coefficient to be significantly different from 0 at the point 0 5 significance level in our example all variables except gender age and the items per order are significant at the 95% confidence level there's also a test if all coefficients are simultaneously equal to 0 but more on that later let's practice firstin this video you'll learn about how to use multiple linear regression one threat to the accuracy of the simple linear regression from before is what's called omitted variable bias this occurs when a variable not included in the regression is correlated with both the explanatory variable and the response variable imagine we are looking at the relationship between the study time before an exam and the success achieved if we just consider these two variables we find a negative relationship the more person studies the lower her exam score will be strange isn't it since IQ is positively related to exam success and negatively related to study time we need to include this variable in the regression then with help of multiple regression I now estimate the positive effect of study time let's estimate a multiple regression model using the LM function including all the variables in the data set future margin is now modeled as a function of margin n orders n items and so on we save the model as multiple LM just as before we use summary now with multiple LM as an argument that worked although we now encounter other problems multicollinearity is one threat to a multiple linear regression this occurs whenever one explanatory variable can be explained by the remaining explanatory variables then the regression coefficients become unstable and the standard errors reported by the linear model are underestimates due to a high correlation between n orders and items as well as margin for order and margin per item these variables are candidates for multiple energy to systematically check all variables in a model for multicollinearity we calculate the variance inflation factors widths using the wave function from the RMS package these indicate the increase in the variance of an estimated coefficient due to multicollinearity with higher than 5 is problematic and values above 10 indicate poor regression estimates let's look at our models variance inflation factors as expected the Whip's foreign orders and in items as well as margin for order and margin per item are rather high hence we exclude one of each pair from the regression namely and items and margin per order here are the roofs of the new model they are all acceptable now finally we are ready to interpret the model output the intercept gives the expected margin in year 2 when all independent variables are set to 0 hence we observe an expected margin in year 2 of roughly 23 given that every explanatory variable is equal to 0 it's usually hard to make interpretations for just the value of the intercept in a multivariate regression model the coefficient of each explanatory variable gives the effect that a one unit change in that variable has on the expected margin in YouTube with all other variables being held constant the coefficient estimate of roughly 0.4 for the margin variable signifies a 0.40 increase in future mod given an increase of 1 euro for margins in the current year let's also look at the coefficient significance by default a t-test about whether or not the respective coefficient 0 is conducted if the p-value in the last column is smaller than 0.05 we can conclude that coefficient to be significantly different from 0 at the point 0 5 significance level in our example all variables except gender age and the items per order are significant at the 95% confidence level there's also a test if all coefficients are simultaneously equal to 0 but more on that later let's practice first\n"