Introduction to Correlation and Regression
I'm Ben Ballmer, and I'm an assistant professor in the statistical and data Sciences program at Smith College. In this course on correlation and regression, we will explore techniques for characterizing and quantifying the relationships between two numeric variables in a statistical model. As you may know, previous courses have covered how to describe the distribution of a single variable. While that is useful, there are many cases where we want to understand the relationship between two variables. In this course, we will learn the necessary techniques to achieve this.
The Concept of Response and Explanatory Variables
In general, we typically have one output variable and one or more input variables in a statistical model. We refer to the output variable as the response variable, denoted by Y, and any explanatory variables as independent or predictor variables, denoted by X. The response variable is a quantity that we think might be related to the input or explanatory variable in some way. In this course, we will have a single explanatory variable, but in future courses, we will explore multiple explanatory variables.
Visualizing Relationships with Scatter Plots
Statisticians have developed a commonly used framework for visualizing the relationship between two numeric variables: the scatter plot. The scatter plot has been called the most generally useful invention in the history of statistical graphics. It is a simple two-dimensional plot where the two coordinates of each dot represent the value of one variable measured on a single observation. By convention, we always put the response variable on the vertical or y-axis and the explanatory variable on the horizontal or x-axis. In ggplot, we can bind the x and y aesthetics to our explanatory and response variables and then use geom_point to actually draw the points.
Creating Scatter Plots with ggplot
Let's take a look at an example of a scatter plot in action. We can see a scatter plot of the total length of a possum's body as a function of the length of its tail. Note that the axes have been labeled with the names of the variables automatically for clarity. It is essential to give your axes human-readable labels, which we can do using the scale_x_continuous and scale_y_continuous functions.
Discretizing Explanatory Variables
As you may know, a box plot can illustrate the relationship between a numerical response variable and a categorical explanatory variable. A scatter plot can be thought of as a generalization of side-by-side box plots. We can connect these ideas by discretizing our explanatory variable. This can be achieved using the cut function, which takes a numeric vector and chops it into discrete chunks. The breaks argument specifies the number of chunks. In this example, we use five breaks to separate the opossums into five groups based on their tail length.
Using ggplot to Create Discretized Scatter Plots
Finally, let's change to geom_boxplot to make the boxes. Note how the median body length increases as the tail length increases across the five groups. Now it's time for you to get some practice making scatter plots using these techniques.
"WEBVTTKind: captionsLanguage: enhi I'm Ben Ballmer I'm an assistant professor in the statistical and data Sciences program at Smith College and I'll be your instructor for this course on correlation and regression in the previous courses you've learned how to describe the distribution of a single variable this is useful but in many cases what we are more interested in is understanding the relationship between two variables in particular in this course you will learn techniques for characterizing and quantifying the relationships between two numeric variables in a statistical model we generally have one or one variable that is the output and one or more variables that are the inputs we will refer to the output variable as the response and we'll denote it with the letter Y in other disciplines or contexts you may hear this quantity called the dependent variable more generally the response variable is a quantity that we think might be related to the input or explanatory variable in some way we typically denote any explanatory variables with the letter X in this course we will have a single explanatory variable but in the next course we will have several in other fields these can be called independent or predictor variables just as you learn to visualize the distribution of one variable with a histogram or density plot statisticians have developed a commonly used framework for visualizing the relationship between two NAVAIR numeric variables the scatter plot the scatter plot has been called the most generally useful invention in the history of statistical graphics it is a simple two-dimensional plot in which the two coordinates of each dot represent the value of one variable measured on a single observation by convention we always put the response variable on the vertical or y-axis and the explanatory variable on the horizontal or x-axis in ggplot we bind the x and y aesthetics to our explanatory and response variables and then use the giome to actually draw the points here we can see a scatter plot of the total length of a possums body as a function of the length of its tail note that the axes have been labeled with the names of the variables automatically for clarity it is important to give your axes human-readable labels we can do that with the scale X continuous and scale Y continuous functions since you already know how a box plot can illustrate the relationship between a numerical response variable and a categorical explanatory variable it may be helpful to think of a scatter plot as a generalization of side by side box plots we can connect these ideas by discretizing our explanatory variable this can be achieved in our using the cut function which takes a numeric vector and chops it into discrete chunks the breaks argument specifies the number of chunks here we use five breaks to separate the opossums into five groups based on their tail length finally we change to giome box plot to make the boxes note how the median body length increases as the tail length increases across the five groups now it's time for you to get some practice making scatter plotshi I'm Ben Ballmer I'm an assistant professor in the statistical and data Sciences program at Smith College and I'll be your instructor for this course on correlation and regression in the previous courses you've learned how to describe the distribution of a single variable this is useful but in many cases what we are more interested in is understanding the relationship between two variables in particular in this course you will learn techniques for characterizing and quantifying the relationships between two numeric variables in a statistical model we generally have one or one variable that is the output and one or more variables that are the inputs we will refer to the output variable as the response and we'll denote it with the letter Y in other disciplines or contexts you may hear this quantity called the dependent variable more generally the response variable is a quantity that we think might be related to the input or explanatory variable in some way we typically denote any explanatory variables with the letter X in this course we will have a single explanatory variable but in the next course we will have several in other fields these can be called independent or predictor variables just as you learn to visualize the distribution of one variable with a histogram or density plot statisticians have developed a commonly used framework for visualizing the relationship between two NAVAIR numeric variables the scatter plot the scatter plot has been called the most generally useful invention in the history of statistical graphics it is a simple two-dimensional plot in which the two coordinates of each dot represent the value of one variable measured on a single observation by convention we always put the response variable on the vertical or y-axis and the explanatory variable on the horizontal or x-axis in ggplot we bind the x and y aesthetics to our explanatory and response variables and then use the giome to actually draw the points here we can see a scatter plot of the total length of a possums body as a function of the length of its tail note that the axes have been labeled with the names of the variables automatically for clarity it is important to give your axes human-readable labels we can do that with the scale X continuous and scale Y continuous functions since you already know how a box plot can illustrate the relationship between a numerical response variable and a categorical explanatory variable it may be helpful to think of a scatter plot as a generalization of side by side box plots we can connect these ideas by discretizing our explanatory variable this can be achieved in our using the cut function which takes a numeric vector and chops it into discrete chunks the breaks argument specifies the number of chunks here we use five breaks to separate the opossums into five groups based on their tail length finally we change to giome box plot to make the boxes note how the median body length increases as the tail length increases across the five groups now it's time for you to get some practice making scatter plots\n"