R tutorial - Cleaning and preprocessing text
Now That You Have a Corpus, It's Time to Clean It Up
Once you have gathered your raw data and created a corpus, it's essential to clean it up before applying any preprocessing functions. This step is crucial because not all datasets are suitable for the same preprocessing techniques. In this section, we will focus on some common preprocessing functions that can be applied to your corpus.
Before We Actually Apply Them to the Corpus
Before we dive into applying the preprocessing functions, let's learn what each one does. You don't always apply the same set of preprocessing functions for all your analyzes, and it's essential to understand the purpose of each function. Analyzing Base R has a function that converts all characters in a string to lowercase. This is helpful for term aggregation but can be harmful if you're trying to identify proper nouns like cities.
The Remove Punctuation Function
The remove punctuation function removes punctuation from your text, which can be especially helpful in social media analysis. However, this can also be harmful if you're trying to find emoticons made of punctuation marks. For instance, a smiley face :) is removed by the punctuation function. Depending on your analysis, you may want to remove numbers or keep them as part of your data.
Removing Numbers
Numbers are an essential part of many texts, including social media posts and online reviews. Removing numbers can be useful in certain contexts but should be avoided if you're trying to analyze quantities or currency amounts. For instance, analyzing the number of likes on a post versus the content itself would not make sense if numbers were removed.
Removing White Space
The strip white space function removes extra tabbed whitespace and extra lines from your text. This is a very useful function because it helps ensure that each word is accurately counted and analyzed. Without this function, words with multiple tabs or spaces might be misinterpreted as different words altogether.
Removing Uninteresting Words
Many preprocessing functions come with the option to remove common words like "the," "and," or "of." These words are not very interesting when it comes to analyzing text because they do not provide much insight into the content of the text. By removing these words, you can focus on more meaningful and relevant information.
Applying Transformations
After understanding what each preprocessing function does, it's time to apply them to your corpus using the Text Mining (TM) map function. This function is an interface that transforms your corpus by applying a mapping to its content. You see here that the TM map takes a corpus and applies one of the preprocessing functions to it.
The Transforming Function
The transforming function can be applied directly to the corpus or wrapped in the content transformer function doing some processing before applying the actual transformation. For instance, you might apply the stem document function, which uses an algorithm to segment words into their base form.
Stemming Words
The stem document function is an algorithm that segments words into their base form, which can help reduce noise and improve analysis of aggregated terms. This process is called stemming, and it's an essential technique in natural language processing (NLP) for analyzing and understanding text data.
An Algorithm to Segment Words to Their Base
In this example, you see complicated words being segmented into simpler forms, like "complicated" becoming "complicat." Complicated and complicated all get stemmed to "complicate," which definitely helps with analysis. Stemming can help you understand the core meaning of words and reduce noise in your data.
Aggregate Terms
The problem with stemming is that it often leaves you with tokens that are not very useful for further analysis. For instance, a word like "running" would be stemmed to just "run," which might not provide much insight into its original context. To overcome this issue, aggregate terms can be used to group words together based on their meaning.
The Problem with Aggregate Terms
The problem is that you often end up with tokens that are not very useful for further analysis. For instance, a word like "running" would be stemmed to just "run," which might not provide much insight into its original context. To address this issue, aggregate terms can be used to group words together based on their meaning.
Taking Un as an Argument
The un takes arguments of words and a dictionary of complete words in this example the dictionary is only complicating But you can see How all threeWords were unified tocomplicate you can even use a corpus ascomplicate you can even use a corpus asyour complete dictionary as shownhere There is another group ofpreprocessing functions from the pagepreprocessing functions from the pagepreprocessing functions from the pagewhichwhichthese nicely in the exercises you willthese nicely in the exercises you willhave the opportunity to work with bothTM and qat preprocessing functions then
