Welcome to our Course on Wrangling and Plotting Non-Numerical Data
My name is Emily Robinson and welcome to this course. We'll learn how to effectively wrangle and plot non-numerical or qualitative data. We'll start with learning about how to identify and inspect these variables in a data set. Then, we'll move to using the forcaps package by Hadley Wickham to manipulate the variables by renaming categories, changing their order, and collapsing multiple groups into one. In the third chapter, we'll see how we can make effective visualizations by combining forcats with other tidyverse packages like dplyr, stringr, and ggplot2. Our final chapter will recreate this visualization from the 5:38 blog using all the tools we have learned.
We've accessed this data from the 538 our package which provides access to the code and datasets published by 538. This course focuses on two types of qualitative data categorical or nominal data and ordinal data. Categorical data are data that fall into unordered groups, while ordinal data have an inherent order but not a constant distance between them. Both types of data have a fixed and known set of possible values.
One example of categorical data is a person's occupation. You might have a survey that has people pick their occupation from a list of 30, such as doctor, teacher, or engineer, with an extra category for "other." We could think of ways to order this data, such as by median salary or years of education needed, but they don't have an inherent order. On the other hand, if we ask people about their annual income and offered four choices: zero to $50,000, $50,000 to a hundred fifty thousand, one hundred fifty thousand to five hundred thousand dollars, and more than five hundred thousand dollars, this would be an ordinal variable because these groups go from smallest to largest. However, there's not a constant distance between each group if you were asked to construct a mean salary from this data, you couldn't do it. This is what makes it qualitative instead of quantitative data.
We have two ways to represent qualitative variables as characters and as factors. There are some differences under the hood but generally, you'll use factors for categorical and ordinal variables and characters otherwise. For example, names would best be represented as characters because there's no limit to the possible number of names on the other hand, a survey question where you can select which programming languages you know among 40 possible answers can be represented as a factor.
So, how do you know whether your variable is currently stored as a factor or character? There are two main methods. First, you can look at the printed output of your table if you store your data set as a table, a modern data frame, each column type is automatically printed out below or next to the column name. Let's print out the college all-ages data set, another one from the 538 package. We can see that there aren't any factors major and minor category are both character columns.
The second way is to use the function `is.factor()` this will be true or false depending on whether you guessed it the input is a factor when we check the major category column, we see just as above it's not a factor.
"WEBVTTKind: captionsLanguage: enhi my name is Emily Robinson and welcome to this course well we'll learn how to effectively wrangle and plot non numerical or qualitative data we'll start with learning about how to identify and inspect these variables in a data set then we'll move to using the for caps package by Hadley Wickham to manipulate the variables by renaming categories changing their order and collapsing multiple groups into one in the third chapter we'll see how we can make effective visualizations by combining for cats with other tidy verse packages like deep wire tidy our string our and ggplot2 in our final chapter will recreate this visualization from the 5:38 blog using all the tools we have learned we've accessed this data from the 538 our package which provides access to the code and datasets published by 538 this course focuses on two types of qualitative data categorical or nominal data and ordinal data categorical data are data that fall into unordered groups while ordinal data have an inherent order but not a constant distance between them both types of data have a fixed and known set of possible values one example of categorical data is a person's occupation you might have a survey that has people pick their occupation from a list of 30 such as doctor teacher or engineer with an extra category for other we could think of ways to order this data such as by median salary or years of education needed but they don't have an inherent order on the other hand if we ask people about their annual income and offered four choices zero to $50,000 $50,000 to a hundred fifty thousand one hundred fifty thousand to five hundred thousand dollars and more than five hundred thousand dollars this would be an ordinal variable because these groups go from smallest to largest however there's not a constant distance between each group if you were asked to construct a mean salary from this data you couldn't do it this is what makes it qualitative instead of quantitative data our has two ways to represent qualitative variables as characters and as factors there are some differences under the hood but generally you'll use factors for categorical and ordinal variables and characters otherwise for example names would best be represented as characters because there's no limit to the possible number of names on the other hand a survey question where you can select which programming languages you know among 40 possible answers can be represented as a factor so how do you know whether your variable is currently stored as a factor or character there are two main methods first you can look at the printed output of your table if you store your data set as a table a modern data frame each column type is automatically printed out below or next to the column name let's print out the college all-ages data set another one from the 538 package we can see that there aren't any factors major and minor category are both character columns the second way is to use the function is dot factor this will be true or false depending on whether you guessed it the input is a factor when we check the major category column we see just as above it's not a factor now that we've been introduced to qualitative variables in our let's try working with some exampleshi my name is Emily Robinson and welcome to this course well we'll learn how to effectively wrangle and plot non numerical or qualitative data we'll start with learning about how to identify and inspect these variables in a data set then we'll move to using the for caps package by Hadley Wickham to manipulate the variables by renaming categories changing their order and collapsing multiple groups into one in the third chapter we'll see how we can make effective visualizations by combining for cats with other tidy verse packages like deep wire tidy our string our and ggplot2 in our final chapter will recreate this visualization from the 5:38 blog using all the tools we have learned we've accessed this data from the 538 our package which provides access to the code and datasets published by 538 this course focuses on two types of qualitative data categorical or nominal data and ordinal data categorical data are data that fall into unordered groups while ordinal data have an inherent order but not a constant distance between them both types of data have a fixed and known set of possible values one example of categorical data is a person's occupation you might have a survey that has people pick their occupation from a list of 30 such as doctor teacher or engineer with an extra category for other we could think of ways to order this data such as by median salary or years of education needed but they don't have an inherent order on the other hand if we ask people about their annual income and offered four choices zero to $50,000 $50,000 to a hundred fifty thousand one hundred fifty thousand to five hundred thousand dollars and more than five hundred thousand dollars this would be an ordinal variable because these groups go from smallest to largest however there's not a constant distance between each group if you were asked to construct a mean salary from this data you couldn't do it this is what makes it qualitative instead of quantitative data our has two ways to represent qualitative variables as characters and as factors there are some differences under the hood but generally you'll use factors for categorical and ordinal variables and characters otherwise for example names would best be represented as characters because there's no limit to the possible number of names on the other hand a survey question where you can select which programming languages you know among 40 possible answers can be represented as a factor so how do you know whether your variable is currently stored as a factor or character there are two main methods first you can look at the printed output of your table if you store your data set as a table a modern data frame each column type is automatically printed out below or next to the column name let's print out the college all-ages data set another one from the 538 package we can see that there aren't any factors major and minor category are both character columns the second way is to use the function is dot factor this will be true or false depending on whether you guessed it the input is a factor when we check the major category column we see just as above it's not a factor now that we've been introduced to qualitative variables in our let's try working with some examples\n"