R tutorial - Cleaning and preprocessing text

Now That You Have a Corpus, It's Time to Clean It Up

Once you have gathered your raw data and created a corpus, it's essential to clean it up before applying any preprocessing functions. This step is crucial because not all datasets are suitable for the same preprocessing techniques. In this section, we will focus on some common preprocessing functions that can be applied to your corpus.

Before We Actually Apply Them to the Corpus

Before we dive into applying the preprocessing functions, let's learn what each one does. You don't always apply the same set of preprocessing functions for all your analyzes, and it's essential to understand the purpose of each function. Analyzing Base R has a function that converts all characters in a string to lowercase. This is helpful for term aggregation but can be harmful if you're trying to identify proper nouns like cities.

The Remove Punctuation Function

The remove punctuation function removes punctuation from your text, which can be especially helpful in social media analysis. However, this can also be harmful if you're trying to find emoticons made of punctuation marks. For instance, a smiley face :) is removed by the punctuation function. Depending on your analysis, you may want to remove numbers or keep them as part of your data.

Removing Numbers

Numbers are an essential part of many texts, including social media posts and online reviews. Removing numbers can be useful in certain contexts but should be avoided if you're trying to analyze quantities or currency amounts. For instance, analyzing the number of likes on a post versus the content itself would not make sense if numbers were removed.

Removing White Space

The strip white space function removes extra tabbed whitespace and extra lines from your text. This is a very useful function because it helps ensure that each word is accurately counted and analyzed. Without this function, words with multiple tabs or spaces might be misinterpreted as different words altogether.

Removing Uninteresting Words

Many preprocessing functions come with the option to remove common words like "the," "and," or "of." These words are not very interesting when it comes to analyzing text because they do not provide much insight into the content of the text. By removing these words, you can focus on more meaningful and relevant information.

Applying Transformations

After understanding what each preprocessing function does, it's time to apply them to your corpus using the Text Mining (TM) map function. This function is an interface that transforms your corpus by applying a mapping to its content. You see here that the TM map takes a corpus and applies one of the preprocessing functions to it.

The Transforming Function

The transforming function can be applied directly to the corpus or wrapped in the content transformer function doing some processing before applying the actual transformation. For instance, you might apply the stem document function, which uses an algorithm to segment words into their base form.

Stemming Words

The stem document function is an algorithm that segments words into their base form, which can help reduce noise and improve analysis of aggregated terms. This process is called stemming, and it's an essential technique in natural language processing (NLP) for analyzing and understanding text data.

An Algorithm to Segment Words to Their Base

In this example, you see complicated words being segmented into simpler forms, like "complicated" becoming "complicat." Complicated and complicated all get stemmed to "complicate," which definitely helps with analysis. Stemming can help you understand the core meaning of words and reduce noise in your data.

Aggregate Terms

The problem with stemming is that it often leaves you with tokens that are not very useful for further analysis. For instance, a word like "running" would be stemmed to just "run," which might not provide much insight into its original context. To overcome this issue, aggregate terms can be used to group words together based on their meaning.

The Problem with Aggregate Terms

The problem is that you often end up with tokens that are not very useful for further analysis. For instance, a word like "running" would be stemmed to just "run," which might not provide much insight into its original context. To address this issue, aggregate terms can be used to group words together based on their meaning.

Taking Un as an Argument

The un takes arguments of words and a dictionary of complete words in this example the dictionary is only complicating But you can see How all threeWords were unified tocomplicate you can even use a corpus ascomplicate you can even use a corpus asyour complete dictionary as shownhere There is another group ofpreprocessing functions from the pagepreprocessing functions from the pagepreprocessing functions from the pagewhichwhichthese nicely in the exercises you willthese nicely in the exercises you willhave the opportunity to work with bothTM and qat preprocessing functions then

"WEBVTTKind: captionsLanguage: enNow That You have a corpus you have toNow That You have a corpus you have toNow That You have a corpus you have totake it from the unorganized Raw Statetake it from the unorganized Raw Statetake it from the unorganized Raw Stateand start to clean It Up we will focusand start to clean It Up we will focusand start to clean It Up we will focuson some Common preprocessing functionson some Common preprocessing functionson some Common preprocessing functionsbut before We actually apply them to thebut before We actually apply them to thebut before We actually apply them to thecorpus Let's learn what each One doescorpus Let's learn what each One doescorpus Let's learn what each One doesbecause you don't always apply the samebecause you don't always apply the samebecause you don't always apply the sameones for all yourones for all yourones for all youranalyzes base r has a function to Loweranalyzes base r has a function to Loweranalyzes base r has a function to Lowerit makes all the characters in a stringit makes all the characters in a stringit makes all the characters in a stringlowercase This is helpful for termlowercase This is helpful for termlowercase This is helpful for termaggregation but can be harmful if youaggregation but can be harmful if youaggregation but can be harmful if youare Trying to identify proper nouns likeare Trying to identify proper nouns likeare Trying to identify proper nouns likecities the remove punctuation functioncities the remove punctuation functioncities the remove punctuation functionwell it removes punctuation this can bewell it removes punctuation this can bewell it removes punctuation this can beespecially helpful in social Media butespecially helpful in social Media butespecially helpful in social Media butcan be harmful if you are trying to findcan be harmful if you are trying to findcan be harmful if you are trying to findemoticons made of punctuation marks likeemoticons made of punctuation marks likeemoticons made of punctuation marks likea smileya smileya smileyface depending on your analyzes You mayface depending on your analyzes You mayface depending on your analyzes You maywant to remove Numbers obviously don'twant to remove Numbers obviously don'twant to remove Numbers obviously don'tdo this if you Trying to text minedo this if you Trying to text minedo this if you Trying to text minequantities or currency amounts butquantities or currency amounts butquantities or currency amounts butremove Numbers May Be usefulremove Numbers May Be usefulremove Numbers May Be usefulsometimes the strip white Space functionsometimes the strip white Space functionsometimes the strip white Space functionis also very useful sometimes text hasis also very useful sometimes text hasis also very useful sometimes text hasExtra tabbed White Space or Extra LinesExtra tabbed White Space or Extra LinesExtra tabbed White Space or Extra Linesthis Simply removesthis Simply removesthis Simply removesit a very important function from TM isit a very important function from TM isit a very important function from TM isremove Words you can probably Guess thatremove Words you can probably Guess thatremove Words you can probably Guess thata lot of Words like the and of are nota lot of Words like the and of are nota lot of Words like the and of are notvery interesting so may need to bevery interesting so may need to bevery interesting so may need to beremoved all of these transformsremoved all of these transformsremoved all of these transformsare applied to the corpus using the TMare applied to the corpus using the TMare applied to the corpus using the TMmap function this text mining functionmap function this text mining functionmap function this text mining functionis an interface to transform your corpusis an interface to transform your corpusis an interface to transform your corpusthrough a mapping to the corpus contentthrough a mapping to the corpus contentthrough a mapping to the corpus contentyou see here the TM map takes a corpusyou see here the TM map takes a corpusyou see here the TM map takes a corpusthen one of the preprocessing functionsthen one of the preprocessing functionsthen one of the preprocessing functionslike remove Numbers or removelike remove Numbers or removelike remove Numbers or removepunctuation to transform the corpus Ifpunctuation to transform the corpus Ifpunctuation to transform the corpus Ifthe transforming function is not fromthe transforming function is not fromthe transforming function is not fromthe TM Library It has to be wrapped inthe TM Library It has to be wrapped inthe TM Library It has to be wrapped inthe content Transformer function doingthe content Transformer function doingthe content Transformer function doingthis tells TM map to import the functionthis tells TM map to import the functionthis tells TM map to import the functionand use it on the content of theand use it on the content of theand use it on the content of thecorpus the stem document function usescorpus the stem document function usescorpus the stem document function usesAn algorithm to segment Words to theirAn algorithm to segment Words to theirAn algorithm to segment Words to theirbase in this example you see complicatedbase in this example you see complicatedbase in this example you see complicatedcomplicated and complicated all getcomplicated and complicated all getcomplicated and complicated all getstemmed to compile this definitely helpsstemmed to compile this definitely helpsstemmed to compile this definitely helpsaggregate terms the problem is that youaggregate terms the problem is that youaggregate terms the problem is that youoften left with tokens that are notoften left with tokens that are notoften left with tokens that are notWords So you have to takeWords So you have to takeWords So you have to taketotototheun takes arguments theem Words andtheun takes arguments theem Words andtheun takes arguments theem Words anddictionary of complete Words in thisdictionary of complete Words in thisdictionary of complete Words in thisexample the dictionary is onlyexample the dictionary is onlyexample the dictionary is onlycomplicating But you can see How all threecomplicating But you can see How all threecomplicating But you can see How all threeWords were unified toWords were unified toWords were unified tocomplicate you can even use a corpus ascomplicate you can even use a corpus ascomplicate you can even use a corpus asyour complete dictionary as shownyour complete dictionary as shownyour complete dictionary as shownhere There is another group ofhere There is another group ofhere There is another group ofpreprocessing functions from the pagepreprocessing functions from the pagepreprocessing functions from the pagewhichwhichwhichthese nicely in the exercises you willthese nicely in the exercises you willthese nicely in the exercises you willhave the opportunity to work with bothhave the opportunity to work with bothhave the opportunity to work with bothTM and qat preprocessing functions thenTM and qat preprocessing functions thenTM and qat preprocessing functions thenapply them to acorpus\n"