R Tutorial - Binning encoding - data driven

Using Encoding Procedures for Categorical Data: A Data-Driven Approach

The encoding procedures we have discussed work well on categorical data with a manageable number of categories. However, using one-hot encoding for a variable with thousands of categories will create a thousand or more new columns and will be complicated even if you combine similar categories based on contextual information. Let's discuss data-driven approaches to reducing the space of categorical variables and creating meaningful features.

Reducing Categorical Variables in Data-Driven Approaches

One approach is to look at the proportions of each category with respect to the income, which is the outcome variable in this example. We can combine prop.table() with the table function to achieve this. The prop.table() function takes a table and divides each sell value by the sum of all the cells. If we add a 1 after the table, we get the value of each cell divided by the sum of the row cells. In our example, we want to calculate the proportion of income within each grade level.

Insights from Proportions

By calculating the proportions of income within each grade level, we can gain insights into possible relationships between categories and the outcome. For instance, we can deduce that lower-grade levels are associated with making less than $50,000 in a calendar year. For example, individuals who only completed the 10th grade have approximately 93% of those individuals make less than $50k a year.

Ordering Proportions

To further analyze the relationships between categories and income, we can order the proportions corresponding to making over $50,000 a year using the arrange() function. By passing a table that contains the education span-income and the corresponding proportions, we can leverage this information to create meaningful categories. For example, we can group categories with similar proportions of making over $50,000 a year into 2/3 order ranges.

Grouping Categories

We can group categories with similar proportions of making over $50,000 a year in 2/3 order ranges. We can use these ranges to create meaningful education categories. For instance, we can define low education from 0-10%, medium education from 10-30%, and high education containing the rest (from 30-100%). By using this approach, we can attach ad-hoc information to our existing income data.

Attaching Proportions to Income Data

To create meaningful categories for our education levels, we can use inner_*.join() and attach the proportions table results with the proportion mappings for each grade level category. An inner join takes two data frames and only combines records that have the same link, discarding records with no links from either table. The link is specified using the buy statement in our example.

Linking Education with Proportions

In our example, we are linking education from the adult_income_table with edy_span from a proportions table. This allows us to attach the proportions associated with each grade level to our desired low, medium, and high education range categories. By doing so, we can create new columns in our data frame that contain these mappings.

Creating New Columns for Education Categories

After attaching the proportions to our income data, we create a new column called "new_mappings". This column contains the new mappings where the low education category contains eight categories from preschool to twelfth grade. The medium category contains four categories after graduating high school. The high education level contains a bachelor's degree and more.

Now It's Your Turn

The approach described above allows us to reduce the space of categorical variables in a data-driven manner, creating meaningful features that can be used for predictive modeling or other analyses. By leveraging insights from proportions and attaching ad-hoc information to our existing income data, we can create new columns that contain valuable mappings between education levels and income categories.

"WEBVTTKind: captionsLanguage: enthe encoding procedures we have discussed work well on categorical data with a manageable number of categories however using one hot encoding for a variable with thousands of categories will create a thousand or more new columns and will be complicated even if you combine similar categories based on contextual information let's discuss data driven approaches to reducing the space of categorical variables and creating meaningful features let's take a look at the education level variable from the adult underscore incomes data set there are sixteen distinct categories we want to incorporate into our model that predicts income levels above or below fifty thousand dollars we want to reduce these categories in a meaningful way leveraging the outcomes associated with these levels one approach is to look at the proportions of each category with respect to the income which is the outcome variable in this example we can combine prop table with the table function the prop table function takes a table which sells and divides each sell value by the sum of all the cells if you add a 1 after the table you get the value of each cell divided by the sum of the row cells in our example we want the proportion of income within each grade which is the gross sum these proportions give us insights into possible relationships the categories have with the outcome for example we can deduce that lower grade levels are associated with making less than $50,000 in in calendar year for example of the folks that only completed the 10th grade about 93 percent of those individuals make less than 50k a year we order the proportions I correspond to making over $50,000 a year using the arrange function and passing a table that contains the education span income and the corresponding proportions we can leverage this information to create meaningful categories for example we can group categories with similar proportions of making over $50,000 a year in 2/3 order ranges with low education from zero to ten percent medium education from ten to thirty percent and high education containing the rest from 30 to 100 percent we can attach this ad hoc information to our existing income data by using inner underscore joint and attaching the proportions table results with the proportion mappings for each grade level category an inner join takes two data frames and only combines records that have the same link discarding records with no links from either table the link is specified using the buy statement in our example we are linking education from the adult income stable with edy underscore span from a proportions table we now have the proportions associated with our education levels to map on our desired low medium and high education range categories we create a new column continued the new mappings where the low education category contains eight categories from preschool to twelfth grade the medium category contains four categories after graduating high school and the high education level contains a bachelor's degree and more now it's your turn let'sthe encoding procedures we have discussed work well on categorical data with a manageable number of categories however using one hot encoding for a variable with thousands of categories will create a thousand or more new columns and will be complicated even if you combine similar categories based on contextual information let's discuss data driven approaches to reducing the space of categorical variables and creating meaningful features let's take a look at the education level variable from the adult underscore incomes data set there are sixteen distinct categories we want to incorporate into our model that predicts income levels above or below fifty thousand dollars we want to reduce these categories in a meaningful way leveraging the outcomes associated with these levels one approach is to look at the proportions of each category with respect to the income which is the outcome variable in this example we can combine prop table with the table function the prop table function takes a table which sells and divides each sell value by the sum of all the cells if you add a 1 after the table you get the value of each cell divided by the sum of the row cells in our example we want the proportion of income within each grade which is the gross sum these proportions give us insights into possible relationships the categories have with the outcome for example we can deduce that lower grade levels are associated with making less than $50,000 in in calendar year for example of the folks that only completed the 10th grade about 93 percent of those individuals make less than 50k a year we order the proportions I correspond to making over $50,000 a year using the arrange function and passing a table that contains the education span income and the corresponding proportions we can leverage this information to create meaningful categories for example we can group categories with similar proportions of making over $50,000 a year in 2/3 order ranges with low education from zero to ten percent medium education from ten to thirty percent and high education containing the rest from 30 to 100 percent we can attach this ad hoc information to our existing income data by using inner underscore joint and attaching the proportions table results with the proportion mappings for each grade level category an inner join takes two data frames and only combines records that have the same link discarding records with no links from either table the link is specified using the buy statement in our example we are linking education from the adult income stable with edy underscore span from a proportions table we now have the proportions associated with our education levels to map on our desired low medium and high education range categories we create a new column continued the new mappings where the low education category contains eight categories from preschool to twelfth grade the medium category contains four categories after graduating high school and the high education level contains a bachelor's degree and more now it's your turn let's\n"