Python Tutorial - Simple Linear Regressions

Simple Linear Regressions of Time Series: An Introduction to Ordinary Least Squares (OLS)

A simple linear regression is a statistical technique used to establish a relationship between two variables, X and Y. In this context, X represents an independent variable, while Y represents a dependent variable. The goal of the regression is to find the best fit line that minimizes the sum of squared distances between the data points and the regression line. This technique is also known as Ordinary Least Squares (OLS) because it minimizes the sum of the squared distances.

Regression techniques are very common in statistics, and there are many packages available in Python for implementing them. In the statsmodels library, OLS is used to perform linear regression. Similarly, in pandas, polyfit can be used to fit data to a line. The degree parameter in polyfit determines the type of regression, with a value of 1 fitting the data to a line, which is a linear regression.

When performing a linear regression on time series data, it's essential to ensure that the order of X and Y is consistent across packages. This can be a challenge when using different libraries or packages. In this course, we will be using the statsmodels library for OLS. To regress the returns of small cap stocks on returns of large cap stocks, we need to compute returns from prices using the percent change method in pandas.

To perform the regression, we must add a column of ones as a dependent right-hand-side variable. This is necessary because the regression function assumes that if there is no constant column, then you want to run the regression without an intercept by adding a column of one's. The statsmodels library will compute the regression coefficient of this column, which can be interpreted as the intercept of the line.

Notice that the first row of the return series is NaN (not a number). This is because each return is computed from two prices, resulting in one less return than price. To delete the first row of NaN values, we use the pandas method drop_na. Now, we are ready to run the regression. The first argument of the statsmodels regression is the series that represents the dependent variable Y, and the next argument contains the independent variable or variables.

In this case, the dependent variable is the r2000 returns, and the independent variables are the constant and SPX returns. The method fit runs the regression, and the results are saved in a class instance called results. The summary method of results shows the entire regression output, which we will focus on later. In the red box, we see the coefficient 1.4141, which is the slope of the regression, also referred to as beta.

The coefficient above that is the intercept, which is very close to zero. We can also pull out individual items from results like the intercept in results.dot.params[0] and the slope in results.dot.params[1]. Another statistic to take note of is the r-squared of 0.75^3, which measures how well the linear regression line fits the data. As expected, there is a relationship between correlation and r-squared, with the magnitude of the correlation being the square root of the r-squared.

The sign of the correlation depends on whether the regression line is positively or negatively sloped. In the example we just analyzed of large cap and small cap stocks, the r-squared was 0.75^3, which means that the linear regression line provides a good fit to the data. The slope of the regression was positive, so the correlation is also positive.

Now, it's your turn to try performing a simple linear regression on time series data using OLS.

"WEBVTTKind: captionsLanguage: enin this video you'll learn about simple linear regressions of time series a simple linear regression finds the slope beta and the intercept alpha of a line that's the best fit between a dependent variable Y and an independent variable X the X's and Y's can be to time series a linear regression is also known as ordinary least squares or OLS because it minimizes the sum of the squared distances between the data points and the regression line regression techniques are very common and therefore there are many packages in Python that can be used in stats models there is OLS in numpy there is polyfit and if you said degree equal 1 it fits the data to a line which is a linear regression panthis has an OLS method and scifi has a linear regression function beware that the order of x and y is not consistent across packages all these packages are very similar and in this course you will use the stats models OLS now you'll regress the returns of the small cap stocks on returns of large cap stocks compute returns from prices using the percent change method in pandas you need to add a column of ones as a dependent right-hand-side variable the reason you have to do this is because the regression function assumes that if there is no constant column then you want to run the regression without an intercept by adding a column of one's stats models will compute the regression coefficient of that column as well which can be interpreted as the intercept of the line the stats models method ad constant is a simple way to add a constant notice that the first row of the return series is n a n each return is computed from two prices so there was one less return than price to delete the first row of n use the pandas method drop na you're finally ready to run the regression the first argument of the stats models regression is the series that represents the dependent variable Y and the next argument contains the independent variable or variables in this case the dependent variable is the r2000 returns and the independent variables are the constant and SPX returns the method fit runs the regression and results are saved in a class instance called results the summary method of results shows the entire regression output we will only focus on a few items of the regression results in the red box the coefficient 1 point 1 4 1 2 is the slope of the regression which is also referred to as beta the coefficient above that is the intercept which is very close to 0 you can also pull out individual items from results like the intercept in results dot params 0 and the slope in results dot params 1 another statistic to take note of is the r-squared of 0.75 3 that will be discussed next from the scatter diagrams you saw that the correlation measure is how closely the data are clustered along a line the r-squared also measures how well the linear regression line fits the data so as you would expect there is a relationship between correlation and r-squared the magnitude of the correlation is the square root of the r-squared and the sign of the correlation is the sign of the slope of the regression line if the regression line is positively sloped the correlation is positive and if the regression line is negatively sloped the correlation is negative in the example you just analyzed of large cap and small cap stocks the r-squared was 0.75 3 the slope of the regression was positive so the correlation is then positive the square root of 0.75 3 or point eight six eight which can be verified by computing the correlation directly now it's your turnin this video you'll learn about simple linear regressions of time series a simple linear regression finds the slope beta and the intercept alpha of a line that's the best fit between a dependent variable Y and an independent variable X the X's and Y's can be to time series a linear regression is also known as ordinary least squares or OLS because it minimizes the sum of the squared distances between the data points and the regression line regression techniques are very common and therefore there are many packages in Python that can be used in stats models there is OLS in numpy there is polyfit and if you said degree equal 1 it fits the data to a line which is a linear regression panthis has an OLS method and scifi has a linear regression function beware that the order of x and y is not consistent across packages all these packages are very similar and in this course you will use the stats models OLS now you'll regress the returns of the small cap stocks on returns of large cap stocks compute returns from prices using the percent change method in pandas you need to add a column of ones as a dependent right-hand-side variable the reason you have to do this is because the regression function assumes that if there is no constant column then you want to run the regression without an intercept by adding a column of one's stats models will compute the regression coefficient of that column as well which can be interpreted as the intercept of the line the stats models method ad constant is a simple way to add a constant notice that the first row of the return series is n a n each return is computed from two prices so there was one less return than price to delete the first row of n use the pandas method drop na you're finally ready to run the regression the first argument of the stats models regression is the series that represents the dependent variable Y and the next argument contains the independent variable or variables in this case the dependent variable is the r2000 returns and the independent variables are the constant and SPX returns the method fit runs the regression and results are saved in a class instance called results the summary method of results shows the entire regression output we will only focus on a few items of the regression results in the red box the coefficient 1 point 1 4 1 2 is the slope of the regression which is also referred to as beta the coefficient above that is the intercept which is very close to 0 you can also pull out individual items from results like the intercept in results dot params 0 and the slope in results dot params 1 another statistic to take note of is the r-squared of 0.75 3 that will be discussed next from the scatter diagrams you saw that the correlation measure is how closely the data are clustered along a line the r-squared also measures how well the linear regression line fits the data so as you would expect there is a relationship between correlation and r-squared the magnitude of the correlation is the square root of the r-squared and the sign of the correlation is the sign of the slope of the regression line if the regression line is positively sloped the correlation is positive and if the regression line is negatively sloped the correlation is negative in the example you just analyzed of large cap and small cap stocks the r-squared was 0.75 3 the slope of the regression was positive so the correlation is then positive the square root of 0.75 3 or point eight six eight which can be verified by computing the correlation directly now it's your turn\n"