Tokenization: A Fundamental Component of Text Pre-Processing
Tokenization is the act of splitting text into individual tokens. These tokens can be as small as individual characters or as large as the entire text of a document. The most common types of tokens are characters, words, sentences, documents, and even separating text into tokens based on a regular expression, such as splitting text every time you see a three-digit number.
There are numerous ways to tokenize text, but for this article, we will be using the tidy text package in R, which describes itself as a text mining tool that utilizes ggplot2 and other tidy tools. The tidy text package follows the tidy data format, making it an ideal choice for those new to tidy concepts. Throughout this course, we will be working with two different datasets: the ten chapters from George Orwell's book "Animal Farm". This dataset is particularly suitable for our purposes, as it provides a rich character list and repeated themes that can be easily explored.
The tidy text function for tokenization in R is called `unnecessary`. It takes a column specified by the input argument, extracts tokens from the text within that column, and labels the output column. Tokenization options include sentences, lines, reg X for a user-specified regular expression, and many others. By adding the `count` action to the end of our code, we can quickly count the top tokens, although this is not the most interesting output yet.
One of the common words in English is "the", which appears frequently in texts like "Animal Farm". Using the `unnecessary` function, we can see what follows it. For example, let's look at chapter one and find any mention of "boxer" regardless of whether it's capitalized or not. Since the first token starts at the beginning of the text, I will use the `slice` function to skip the first token. The output is the text that follows every mention of "boxer", such as "who apparently was an enormous beast at nearly eighteen hands high".
By exploring these different aspects of tokenization, we can gain a better understanding of how to work with text data in R. In the next section, we will delve deeper into the world of tidy text and explore more advanced techniques for preprocessing and analyzing text data.
"WEBVTTKind: captionsLanguage: ennow that we have looked at a basic way to search text let's move to a fundamental component of text pre-processing tokenization tokenization is the act of splitting text into individual tokens tokens can be as small as individual characters or as large as the entire text of document the most common types of tokens are characters words sentences documents and even separating text into tokens based on a regular expression for example splitting text every time you see a three digit or larger number r has an abundance of ways to tokenize text but we will use the tidy text package which describes itself as text mining using d plier ggplot2 and other tidy tools the tiny text package follows the tidy data format taking the introduction to the tidy verse course may be helpful if you are new to the tidy concepts throughout this course we are going to use a couple of different data sets the first being the ten chapters from the book animal farm this is a great data set for our course although our data is limited to just the text and the chapter number it has a rich character list themes that repeat themselves and simple vocabulary for us to explore the tidy text function for tokenization is called unnecessary in foot Tibble called animal farm and extracts tokens from the column specified by the input argument we also specify what kind of tokens we want and what the output column should be labeled our tokenization options include sentences lines reg X for a user specified regular expression and many others we can take this a step further by quickly counting the top tokens by simply adding the count action to the end of our code not the most interesting output yet but we will clean this up later the most common words are just common English words such as the and of and to another use of unnecessary to see what follows it in Animal Farm boxers one of the main characters let's see what chapter one says about him here we have filtered Animal Farm - chapter one and looked for any mention of boxer regardless of boxer being capitalized or not since the first token starts at the beginning of the text I am using the slice function to skip the first token the output is the text that follows every mention of boxer who apparently was an enormous beast at nearly eighteen hands high tokenizingnow that we have looked at a basic way to search text let's move to a fundamental component of text pre-processing tokenization tokenization is the act of splitting text into individual tokens tokens can be as small as individual characters or as large as the entire text of document the most common types of tokens are characters words sentences documents and even separating text into tokens based on a regular expression for example splitting text every time you see a three digit or larger number r has an abundance of ways to tokenize text but we will use the tidy text package which describes itself as text mining using d plier ggplot2 and other tidy tools the tiny text package follows the tidy data format taking the introduction to the tidy verse course may be helpful if you are new to the tidy concepts throughout this course we are going to use a couple of different data sets the first being the ten chapters from the book animal farm this is a great data set for our course although our data is limited to just the text and the chapter number it has a rich character list themes that repeat themselves and simple vocabulary for us to explore the tidy text function for tokenization is called unnecessary in foot Tibble called animal farm and extracts tokens from the column specified by the input argument we also specify what kind of tokens we want and what the output column should be labeled our tokenization options include sentences lines reg X for a user specified regular expression and many others we can take this a step further by quickly counting the top tokens by simply adding the count action to the end of our code not the most interesting output yet but we will clean this up later the most common words are just common English words such as the and of and to another use of unnecessary to see what follows it in Animal Farm boxers one of the main characters let's see what chapter one says about him here we have filtered Animal Farm - chapter one and looked for any mention of boxer regardless of boxer being capitalized or not since the first token starts at the beginning of the text I am using the slice function to skip the first token the output is the text that follows every mention of boxer who apparently was an enormous beast at nearly eighteen hands high tokenizing\n"