Learning to Fit GLMs in Python Using Statsmodels
Now that you understand the building blocks of GLM's it is time to learn how to fit a JL M in Python. The starting point is the statsmodels library which is used for statistical and econometric analysis.
We import the library using statsmodels API from 0.5.0 version, which supports the formula-based entry. We can import the library as follows or we can import a GL and function directly via statsmodels. For example, we can use the formula .PI to fit a model. We first need to describe the model using the model class GLM then the meta method fit is used to fit the model. Very detailed results of the model fit can be analyzed via the summary method and finally we can compute predictions using the predict method.
There are two ways to describe the model using formulas or arrays. If you're familiar with our language then you will appreciate the ability to fit the GLM using the our style formulas. The statsmodels library uses the patsy package to convert formulas and data to the matrices which are then used in model fitting. Note that if you are using the error based method, the intercept is not included by the default. You can add it using the add constant function.
For this course, we will use the formula-based method. The main arguments of the formula-based method are formula, data, and family. The formula is at the heart of the modeling function where the response or output is modeled as a function of the explanatory variables or the inputs. Each explanatory variable is specified and separated with the plus sign. Note that the formula needs to be enclosed in quotation marks.
We can represent explanatory variables in the model in different ways. Categorical variables are enclosed with capital C, removing the intercept is done with minus one. Interaction terms are written in two ways depending on the mid. The semicolon applies to only the interaction term whereas the multiplication symbol will also add individual variables.
Lastly, we can also add transformations of the variables directly in the formula family. Distributions are in the family's namespace here we list only three which we will use in this course. The default link function is denoted in parentheses but you could choose other link functions available for each distribution. However, if you choose to use a non-default link function you would have to specify it directly to view the results of the model fit.
We use the summary method which provides the main information on model fit such as the model description, model statistics such as log likelihood and deviance, and estimated model parameters with their corresponding statistics. The estimated parameters are given by CoF with their standard error z-scores p-values and 95% confidence intervals. To only view the regression coefficients, we can use parents given model fit similarly.
The confidence intervals for the parameter estimates can be obtained by calling comp_ end. The default is 95% which you can change using the Alpha argument with calls argument. You can specify which confidence intervals to return when doing predictive modeling. Your final goal is to compute and assess predictions given the fitted model and test data.
The first step is to specify the test data, which should contain all the variables you have included in the final model. Note that if you don't specify test data, the function uses data with which the model was fit. Final predictions are computed with predict now let's try
"WEBVTTKind: captionsLanguage: ennow that you understand the building blocks of GLM's it is time to learn how to fit a jl m in Python the starting point is the stats models library which is used for statistical and econometric analysis we import the library using stats models API from 0 5 0 version the formula based entry is supported which we can import as follows or we can import a GL and function directly via stats models the formula dot a P I to fit a model we first need to describe the model using the model class GL m then the meta method fit is used to fit the model very detailed results of the model fit can be analyzed via the summary method and finally we can compute predictions using the predict method there are two ways to describe the model using formulas or arrays if you're familiar with our language then you will appreciate the ability to fit the GLM using the our style formulas the stats models uses the patsy package to convert formulas and data to the matrices which are then used in model fitting note that if you are using the error based method the intercept is not included by the default you can add it using the add constant function for this course we will use the formula based method the main arguments are formula data and family the formula is at the heart of the modeling function where the response or output is modeled as a function of the explanatory variables or the inputs each explanatory variable is specified and separated with the plus sign note that the formula needs to be enclosed in quotation marks our different ways we can represent explanatory variables in the model categorical variables are enclosed with capital C removing the intercept is done with minus one the interaction terms are written in two ways depending on the mid where the semicolon applies to only the interaction term whereas the multiplication symbol will also in addition to the interaction term add individual variables we will see how this works in Chapter four lastly we can also add transformations of the variables directly in the formula family distributions are in the family's namespace here we list only three which we will use in this course the default link function is denoted in parentheses but you could choose other link functions available for each distribution however if you choose to use a non default link function you would have to specify it directly to view the results of the model fit we use the summary method which provides the main information on model fit such as the model description model statistics such as log likelihood and deviance and estimated model parameters with their corresponding statistics the estimated parameters are given by Co F with their standard error z-scores p-values and 95% confidence intervals to only view the regression coefficients we can use parents given model fit similarly the confidence intervals for the parameter estimates can be obtained by calling comp underscore end the default is 95% which you can change using the Alpha argument with calls argument you can specify which confidence intervals to return when doing predictive modeling your final goal is to compute and assess predictions given the fitted model and test data the first step is to specify the test data which should contain all the variables you have included in the final model note that if you don't specify test data the function uses data with which the model was fit final predictions are computed with predict now let's try thenow that you understand the building blocks of GLM's it is time to learn how to fit a jl m in Python the starting point is the stats models library which is used for statistical and econometric analysis we import the library using stats models API from 0 5 0 version the formula based entry is supported which we can import as follows or we can import a GL and function directly via stats models the formula dot a P I to fit a model we first need to describe the model using the model class GL m then the meta method fit is used to fit the model very detailed results of the model fit can be analyzed via the summary method and finally we can compute predictions using the predict method there are two ways to describe the model using formulas or arrays if you're familiar with our language then you will appreciate the ability to fit the GLM using the our style formulas the stats models uses the patsy package to convert formulas and data to the matrices which are then used in model fitting note that if you are using the error based method the intercept is not included by the default you can add it using the add constant function for this course we will use the formula based method the main arguments are formula data and family the formula is at the heart of the modeling function where the response or output is modeled as a function of the explanatory variables or the inputs each explanatory variable is specified and separated with the plus sign note that the formula needs to be enclosed in quotation marks our different ways we can represent explanatory variables in the model categorical variables are enclosed with capital C removing the intercept is done with minus one the interaction terms are written in two ways depending on the mid where the semicolon applies to only the interaction term whereas the multiplication symbol will also in addition to the interaction term add individual variables we will see how this works in Chapter four lastly we can also add transformations of the variables directly in the formula family distributions are in the family's namespace here we list only three which we will use in this course the default link function is denoted in parentheses but you could choose other link functions available for each distribution however if you choose to use a non default link function you would have to specify it directly to view the results of the model fit we use the summary method which provides the main information on model fit such as the model description model statistics such as log likelihood and deviance and estimated model parameters with their corresponding statistics the estimated parameters are given by Co F with their standard error z-scores p-values and 95% confidence intervals to only view the regression coefficients we can use parents given model fit similarly the confidence intervals for the parameter estimates can be obtained by calling comp underscore end the default is 95% which you can change using the Alpha argument with calls argument you can specify which confidence intervals to return when doing predictive modeling your final goal is to compute and assess predictions given the fitted model and test data the first step is to specify the test data which should contain all the variables you have included in the final model note that if you don't specify test data the function uses data with which the model was fit final predictions are computed with predict now let's try the\n"