R Tutorial - R objects for statistical modeling

Mathematical Models and Statistical Modeling in R

In the world of statistical modeling, a mathematical model is a model built from mathematical stuff. However, it's essential to differentiate between mathematical models and statistical models. A statistical model is a mathematical model that's closely tied to data. In practice, statistical models are built from a special kind of mathematical stuff—the stuff that makes up computer languages.

Objects in R for Modeling

When working with models in R, you will encounter three important objects: functions, formulas, and data frames. Data frames can be thought of as a collection of variables, where each variable is represented by a column. It's a good practice to give a name to each variable in the data frame, making it easier to describe the models in terms of the names of the variables.

Data frames consist of rows called cases, where each case represents an object in the real world. These objects can be anything from a person to a person at a particular time or even something else entirely. The values for the individual variables are measured from these cases. R functions will generally take a data frame and a formula that describes the relationship among the variables involved in the model.

Formulas

Formulas are an essential part of statistical modeling, allowing you to describe how you want to relate variables to one another. A formula involves the little squiggle symbol called tilde. Typically, there's something to the right of the tilde, which is usually one or more variables separated by punctuation that looks like arithmetic but isn't. The plus sign starts that punctuation, and later you will see other forms of punctuation as an example.

To illustrate how formulas work in R, let's use the CPS 85 dataset to calculate the mean wage of workers in several different sectors of the economy. The formula we'll be using is "wage ~ sector," which means the average wage by sector. This formula breaks down the wage by sector. Using this formula in the mean function gives us the average wage in each sector.

Response Variables and Explanatory Variables

Statistical models are often built to predict or account for a single variable, known as the response variable. The basic idea is to construct a function that produces values for the response variable as the function's output. The inputs of these functions are called the explanatory variables in formulas for models.

In formulas, the response variable is always to the left of the tilde, and the explanatory variables are to the right. You can think of formulas as a sentence that relates the response and the explanatory variables. There are several English equivalents to tilde, including "wage ~ sector" which means wage as a function of sector.

The null hypothesis in statistical modeling is an essential concept. The null hypothesis states that nothing is happening; variation is random but not statistically significant. In our context, we aim to account for the explained variation and acknowledge that there's randomness involved.

Statistical Modeling

This course focuses on statistical modeling, where objects, formulas, functions, data frames, response variables, explanatory variables, and more come together to understand how statistical models work in R. Statistical models are built using a combination of these elements to predict or account for a single variable. By constructing a function that produces values for the response variable, we can use statistical models to gain insights into real-world phenomena.

Statistical modeling involves explaining how variables relate to one another and accounting for variation in data. It's an essential skill for anyone looking to analyze and interpret data. The process of building and using statistical models is complex but rewarding, offering a powerful toolset for understanding the world around us.

"WEBVTTKind: captionsLanguage: enin the first video I said that a mathematical model is a model built from mathematical stuff a statistical model is a mathematical model that's closely tied to data in practice statistical models are built from a special kind of mathematical stuff the stuff that makes up computer languages in this video we'll examine some of the kinds of objects in R that you will encounter in your work with models three of the most important are objects for modeling our functions formulas and data frames some people describe a data frame as a kind of spreadsheet or matrix with rows and columns I prefer to think of data frames more simply as a collection of variables each of the variables is a column and in modeling it's a good practice to give a name to each of the variables in the data frame that makes it easier to describe the models in terms of the names of the variables the rows of a data frame are called cases each case is one object in the real world the case might be a person or it might be a person at a particular time or anything else but always the cases the object from which values for the individual variables are measured we're going to use functions for several purposes both to build models and to evaluate those models for instance to calculate the output of models for new inputs the functions that build models will generally take as their inputs both a data frame and a formula that describes the relationship among the variables involved in the model formulas are a way to describe how you want to relate variables to one another in a formula variable names are used but no calculation is done with the values in those form in those variables instead the formula sets up the structure of the relationship that the modeler wants to express or explore all formulas involve the little squiggle symbol called tilde there will always be something to the right of the tilde typically one or more variables separated by punctuation that looks like arithmetic but isn't to start that punctuation will be the plus sign but later you will see other forms of punctuation as an example of how functions formulas and data frames are used together let's use data set called CPS 85 and calculate the mean wage of workers in several different sectors of the economy the mean function is one of the first functions newcomers to our encounter but it isn't set up to use formulas the mosaic package upgrades mean and other functions so that they work with formulas while continuing to work in the original way the formula waged till the sector means to break down the wage by sector using that formula in the mean function gives the average wage in each sector statistical models are often built to predict or account for a single variable which we will call the response variable the basic idea is to construct a function that produces values for the response variable as the functions output the functions inputs are called the explanatory variables in formulas for models the response variable is always to the left of the tilde the explanatory variables are to the right of the tilde you can think of formulas as a sentence that relates the response and the explanatory variables there are several English equivalents to tilde for example wage till the sector can be read as any of these wage as a function of sector wage accounted for by sector wage modeled by sector wage explained by sector wage given sector wage broken down by sector sorry to be late and and you are dr. know of the null hypothesis meaning the null hypothesis is that nothing is happening that variation is thing but randomness okay but this is a course on statistical modeling our object is to account for an explained variation sure there's some randomness but it's what's left over after our accounting for the rest but every statistics course needs me I think you're looking for the t-test course this is statistical modeling by let's get back to building modelsin the first video I said that a mathematical model is a model built from mathematical stuff a statistical model is a mathematical model that's closely tied to data in practice statistical models are built from a special kind of mathematical stuff the stuff that makes up computer languages in this video we'll examine some of the kinds of objects in R that you will encounter in your work with models three of the most important are objects for modeling our functions formulas and data frames some people describe a data frame as a kind of spreadsheet or matrix with rows and columns I prefer to think of data frames more simply as a collection of variables each of the variables is a column and in modeling it's a good practice to give a name to each of the variables in the data frame that makes it easier to describe the models in terms of the names of the variables the rows of a data frame are called cases each case is one object in the real world the case might be a person or it might be a person at a particular time or anything else but always the cases the object from which values for the individual variables are measured we're going to use functions for several purposes both to build models and to evaluate those models for instance to calculate the output of models for new inputs the functions that build models will generally take as their inputs both a data frame and a formula that describes the relationship among the variables involved in the model formulas are a way to describe how you want to relate variables to one another in a formula variable names are used but no calculation is done with the values in those form in those variables instead the formula sets up the structure of the relationship that the modeler wants to express or explore all formulas involve the little squiggle symbol called tilde there will always be something to the right of the tilde typically one or more variables separated by punctuation that looks like arithmetic but isn't to start that punctuation will be the plus sign but later you will see other forms of punctuation as an example of how functions formulas and data frames are used together let's use data set called CPS 85 and calculate the mean wage of workers in several different sectors of the economy the mean function is one of the first functions newcomers to our encounter but it isn't set up to use formulas the mosaic package upgrades mean and other functions so that they work with formulas while continuing to work in the original way the formula waged till the sector means to break down the wage by sector using that formula in the mean function gives the average wage in each sector statistical models are often built to predict or account for a single variable which we will call the response variable the basic idea is to construct a function that produces values for the response variable as the functions output the functions inputs are called the explanatory variables in formulas for models the response variable is always to the left of the tilde the explanatory variables are to the right of the tilde you can think of formulas as a sentence that relates the response and the explanatory variables there are several English equivalents to tilde for example wage till the sector can be read as any of these wage as a function of sector wage accounted for by sector wage modeled by sector wage explained by sector wage given sector wage broken down by sector sorry to be late and and you are dr. know of the null hypothesis meaning the null hypothesis is that nothing is happening that variation is thing but randomness okay but this is a course on statistical modeling our object is to account for an explained variation sure there's some randomness but it's what's left over after our accounting for the rest but every statistics course needs me I think you're looking for the t-test course this is statistical modeling by let's get back to building models\n"