Data Structures in R: Understanding Data Frames
In our previous lessons, we have covered various data structures such as vectors, matrices, and lists. However, when it comes to statistical programming languages like R, we often work with datasets that consist of observations or instances, each with associated variables. For example, consider a dataset of five people, where each person is an instance and the properties about these people, such as their name, age, and whether they have children, are the variables. The question arises: how can we store this information? A matrix might not be suitable because it is primarily designed for numeric data, while the names of the people would be characters.
One possible solution to this problem is to use a list. We can create a list of lists, where each sub-list represents a person with their name and age. However, the structure of such a list is not particularly useful to work with, as we would have to write a lot of code just to get what we want. For instance, if we want to know all the ages for example, we'd have to write a lot of our code just to get what we want. Fortunately, R provides a more suitable data structure called a data frame.
A data frame is a fundamental data structure in R that is similar to a matrix because it also has rows and columns. However, unlike matrices, a data frame can contain elements of different types. For example, one column can contain characters, while another can be numeric, and yet another can be logical values. This is exactly what we need to start our person's information in the dataset. We could have a column for the name which is character, one for the age which is numeric, and one logical column to denote whether the person has children. The key difference between data frames and matrices is that in a data frame, the elements in the same column should be of the same type.
In most cases, you do not create a data frame yourself; instead, you typically import data from another source, such as a CSV file, a relational database, or software packages like Excel or SPSS. R provides a way to manually create data frames as well, using the data.frame function. To create our people dataset, we need to pass three vectors that are all of length five, corresponding to the columns for first name, age, and children. The data frame function is simple to use, and it automatically inverts the names of the columns from the variable names we passed it to specify the names explicitly.
For instance, if we want to create a people dataset with five observations and three variables, we would pass the data.frame function three vectors that are all of length five. We can also use techniques like factors and lists to name the columns right away. In matrices, it is also possible to name the rows of the data frame, but this is generally not a good idea. Let's take a closer look at the structure of data frames.
When we examine the printout of our people dataset, it looks suspiciously similar to that of a list, which is because under the hood, a data frame is actually a list. In this case, the data frame is a list with three elements corresponding to each of the columns in the data frame. Each list element is a vector of length five, bonding to the number of observations. One requirement that is not present for lists is that the length of the vectors we put in the list must be equal.
If we try to create a data frame with three vectors that are not all of the same length, we'll get an error. Additionally, the name column which we expect to be a character vector is actually a factor because R stores strings as factors by default to suppress this behavior. If we want to change this behavior, we can set the strings.as.factor argument of the data frame function to false.
With this new knowledge, you're ready for some exercises and it's extremely useful and powerful to remember that data frames are actually lists, which gives them the ability to store vectors of different types on top of their standard functionality. There is also additional functionality built into data frames to easily extend and subset them, which I'll discuss in more detail in the next video.
"WEBVTTKind: captionsLanguage: enby now you already learned quite some things in our data structures such as vectors matrices and lists have no more secrets for you however R is a statistical programming language and in statistics will often be working with data sets such data sets are typically comprised of observations or instances all these observations have some variables associated with them you can have for example a data set of five people each person is an instance and the properties about these people such as for example their name their age and whether they have children are the variables how could you store such information are in a matrix maybe not really because the name would be a character and the H would be a numeric but these don't fit in the matrix in the list maybe this could work because you can put practically anything in a list you could create a list of lists where each sub list is a person with a name and age and so on however the structure of such a list is not really useful to work with what if you want to know all the ages for example you'd have to write a lot of our code just to get what you want but what data structure could we use them meet the data frame it's the fundamental data structure to start typical data sets it's pretty similar to a matrix because it also has rows and columns also for data frames the rows correspond to the observations the persons in our example while the columns correspond to the variables are the properties of each of these persons the big difference with matrices is that a data frame can contain elements of different types one column can contain characters another one numerix yet another one logicals that's exactly what we need to start our person's information in the data set right we could have a column for the name which is character one for the H which is numeric and one logical column to denote whether the person has children there still is a restriction on the datatype stone elements in the same column should be of the same type but that's not really a problem because in one column the age column for example you'll always want a numeric because an H is always a number regardless of the observation so for the practical part now creating a data frame in most cases you don't create a data frame yourself instead you typically import data from another source this could be a CSV file a relational database but could also come from other software packages like Excel or SPSS of course our provides a way to manually create data frames as well it could use the data frame function for this to create our people data frame it has five observations and three variables we'll have to pass the data frame function three vectors that are all of length five the vectors you pass correspond to the columns let's create these three vectors first name age and child now calling the data frame function is simple the printout of the data frame already shows very clearly that we're dealing with a data set notice how the data frame function invert the names of the columns from the variable names you passed it to specify the names explicitly you can use the same techniques as for factors and lists you can use the names function or use equal signs inside the data frame function to name the data frame columns right away like in matrices it's also possible to name the rows of the data frame but that's generally not a good idea so I won't go into detail on that here before you head over to some exercises let me shortly discuss the structure of data frames or more if you look at the structure there are two things you can see here first the printout looks suspiciously similar to that of a lists that's because under the hood the data frame actually is a list in this case it's a list with three elements corresponding to each of the columns in the data frame each list elements is a vector of length five bonding to the number of observations a requirement that is not present for lists is that the length of the vectors you put in the list has to be equal if you try to create a data frame with three vectors that are not all of the same length you'll get an error second the name column which you expect to be a character vector is actually a factor that's because our by default stores the strings as factors to suppress this behavior you can set the strings as factors argument of the data frame function to false now the name column actually contains characters with this new knowledge you're ready for some first exercises and it's extremely useful and powerful data structure whenever you're experimenting with data frames remember that they're actually in lists this gives the data frame the ability to store vectors of different types on top of that are also has some additional functionality built in to easily extend and subset data frames I'll talk more about that in the next video\n"