R Tutorial - How many words do YOU know Zipf's law & subjectivity lexicon

The Power of Subjectivity Lexicons: Unlocking Sentiment Analysis

In the world of sentiment analysis, many methods rely on the use of subjectivity lexicons. But what exactly is a subjectivity lexicon, and why does it work so well? A subjectivity lexicon is a predefined list of words associated with specific emotions or positive or negative feelings. For instance, words like "bad," "awful," and "terrible" can all be reasonably linked to a negative state, while words such as "perfect" or "ideal" are more closely tied to positivity.

Sentiment analysis itself is simply the comparison between an author's text and a predefined subjectivity lexicon. The visual representation created in this course demonstrates this concept, using a subjectivity lexicon compared to fictitious text for illustration purposes. In our work with cue taps polarity function, we primarily utilize academic lexicons from the University of Illinois Chicago, as well as the tidy text sentiments table, which is comprised of three different lexicons. Furthermore, we have access to a package called "lexicon" that contains an extensive collection of subjectivity and other word lists.

It's worth noting that using standard lexicons can be beneficial, but it's also essential to consider adjusting them according to one's specific needs, as there is no guarantee that the existing lexicons will provide accurate results. This might seem counterintuitive, given the vast number of words that individuals are familiar with - often exceeding 50,000. However, most subjectivity lexicons only contain a few thousand words.

So, how can searching for these terms provide an accurate sentiment analysis? The answer lies in two fundamental principles: Zips Law and the principle of least effort. Zips Law states that, given some text, the frequency of any single word is inversely proportional to its rank in the frequency table. In simpler terms, if you count up the word frequency in a passage, the second word will appear approximately half as much as the first one, based on its position in the list or ranking. This principle can be observed outside of language, too - for instance, in the way we settle cities and market shares.

The principle of least effort is another crucial factor to consider when using subjectivity lexicons. In library sciences, it's often observed that speakers or writers tend not to exert a lot of effort when communicating, while audiences likewise don't want to spend a lot of energy interpreting their words. As a result, the word choice or lexical diversity tends to be limited. This means that people may know tens of thousands of words but consistently only use a few to express meaning.

This limitation presents an opportunity for subjectivity lexicons to shine, as they don't require an extensive list of words to provide accurate results. In fact, this works in their favor, allowing the word lists to be shorter and more manageable.

In the following section, we will create a visual demonstrating Zips Law on 3 million tweets and apply cue taps polarity function to some actual text, showcasing a subjectivity lexicon in action.

Zips Law: The Frequency of Words

One fundamental principle underlying sentiment analysis is Zips Law. Named after linguist George Zipp, this law states that given some text, the frequency of any single word is inversely proportional to its rank in the frequency table. In simpler terms, if you count up the word frequency in a passage, the second word will appear approximately half as much as the first one, based on its position in the list or ranking.

For instance, consider a short paragraph with three sentences. If we counted the frequency of each word, we would likely find that the most frequently occurring words are at the beginning and end of the sentence, while less common words tend to appear in the middle. This is because the first word sets the tone for the sentence, while the final word wraps up the idea. The words in between, however, often provide more nuanced information.

Zips Law can be observed outside of language as well. For instance, have you ever noticed how certain cities seem to attract businesses and industries with similar characteristics? This phenomenon is a manifestation of Zips Law in action, where the frequency of certain types of businesses or industries is inversely proportional to their ranking in the city's overall landscape.

The principle of least effort plays an equally crucial role in sentiment analysis. In library sciences, it's often observed that speakers or writers tend not to exert a lot of effort when communicating, while audiences likewise don't want to spend a lot of energy interpreting their words. As a result, the word choice or lexical diversity tends to be limited.

This limitation is due to human psychology, where individuals prioritize brevity and clarity over unnecessary complexity. People may know tens of thousands of words but consistently only use a few to express meaning. This phenomenon presents an opportunity for subjectivity lexicons to shine, as they don't require an extensive list of words to provide accurate results.

Creating Visualizations: Demonstrating Zips Law on 3 Million Tweets

In this section, we will create a visual representation demonstrating Zips Law on 3 million tweets. By analyzing the frequency of words in these tweets, we can illustrate how Zips Law affects sentiment analysis. The resulting visualization will showcase the inverse relationship between word rank and frequency, providing insight into the underlying principles of language.

To achieve this, we will utilize natural language processing techniques to analyze the tweets and extract relevant data. We will then use visualization tools to present our findings in a clear and concise manner, making it easier to understand the impact of Zips Law on sentiment analysis.

Applying Cue Taps Polarity Function: A Subjectivity Lexicon in Action

In this section, we will apply cue taps polarity function to some actual text, demonstrating a subjectivity lexicon in action. By analyzing the frequency and ranking of words in the text, we can apply the principles of Zips Law and sentiment analysis to gain insight into the author's intended meaning.

We will start by preparing the data for analysis, including tokenizing the text and extracting relevant features. Next, we will use cue taps polarity function to assign sentiment scores to each word in the text, taking into account its ranking and frequency within the context of the entire passage.

By analyzing the results, we can gain a deeper understanding of how subjectivity lexicons work and their role in sentiment analysis. This visualization will provide a concrete representation of the principles outlined earlier, making it easier to comprehend the inner workings of language.

"WEBVTTKind: captionsLanguage: enboom a little refresher and a polarity visualization pretty good start many sentiment analysis methods use a subjectivity lexicon let's learn what a subjectivity lexicon is and why it works the polarity function you just applied uses a subjectivity lexicon a subjectivity lexicon is a predefined list of words associated with a specific emotion or positive or negative feelings for example the words bad awful and terrible can all reasonably be associated with a negative state in contrast perfect or ideal can be connected with positivity in some cases sentiment analysis is merely the comparison between the author's text and the predefined subjectivity lexicon the visual you made is based on a subjectivity lexicon compared to some fictitious text more on that later for now focus on the subjectivity lexicon and why it works in this course we primarily work with cue taps polarity function which uses in academic lexicon from the University of Illinois Chicago plus we also spend a lot of time with the tidy text sentiments table made of three different lexicons there is even in our package called lexicon that has a bunch of subjectivity and other word lists while we use the standard ones from cue Daffy and tidy text it's worth exploring the many word list from the lexicon library plus you should always think about adjusting your lexicons since it's unlikely to be accurate for your exact need so now you may be saying to yourself little lists of words aren't going to help me with my text because people are so expressive in fact you may know more than fifty thousand words and most subjectivity lack lexicons are only a few thousand how can searching for these terms provide an accurate sentiment analysis the answer is based on two principles first a linguist named George Zipp created zips law second the principle of least effort also helps support using small lexicons zips law states that given some text the frequency of any single word is inversely proportional to its rank in the frequency table more simply if you counted up the word frequency in a passage the second word would appear about half as much 50% as the first one over the words place on the list or one over two similarly the third most frequent word could appear 1/3 as much as the first and so on 1 over 3 because it's the third term in the list zips law can be observed outside of language too here is a table of US city populations zips law can be observed in our language the way we settle cities and sometimes even in industry market shares the other principle at play here is called the principle of least effort from library sciences a speaker or writer doesn't want to exert a lot of effort when communicating while the audience doesn't want to spend a lot of energy interpreting so the word choice or lexical diversity becomes limited what does all this mean people are lazy at least when it comes to expressing themselves they may know tens of thousands of words but really only consistently use a few to express meaning this works out well for a subjectivity lexicon because it means the word lists don't have to be so long in this section you will create a visual demonstrating zips law on 3 million tweets and then apply cue taps polarity function to some actual text to see a subjectivity lexicon in actionboom a little refresher and a polarity visualization pretty good start many sentiment analysis methods use a subjectivity lexicon let's learn what a subjectivity lexicon is and why it works the polarity function you just applied uses a subjectivity lexicon a subjectivity lexicon is a predefined list of words associated with a specific emotion or positive or negative feelings for example the words bad awful and terrible can all reasonably be associated with a negative state in contrast perfect or ideal can be connected with positivity in some cases sentiment analysis is merely the comparison between the author's text and the predefined subjectivity lexicon the visual you made is based on a subjectivity lexicon compared to some fictitious text more on that later for now focus on the subjectivity lexicon and why it works in this course we primarily work with cue taps polarity function which uses in academic lexicon from the University of Illinois Chicago plus we also spend a lot of time with the tidy text sentiments table made of three different lexicons there is even in our package called lexicon that has a bunch of subjectivity and other word lists while we use the standard ones from cue Daffy and tidy text it's worth exploring the many word list from the lexicon library plus you should always think about adjusting your lexicons since it's unlikely to be accurate for your exact need so now you may be saying to yourself little lists of words aren't going to help me with my text because people are so expressive in fact you may know more than fifty thousand words and most subjectivity lack lexicons are only a few thousand how can searching for these terms provide an accurate sentiment analysis the answer is based on two principles first a linguist named George Zipp created zips law second the principle of least effort also helps support using small lexicons zips law states that given some text the frequency of any single word is inversely proportional to its rank in the frequency table more simply if you counted up the word frequency in a passage the second word would appear about half as much 50% as the first one over the words place on the list or one over two similarly the third most frequent word could appear 1/3 as much as the first and so on 1 over 3 because it's the third term in the list zips law can be observed outside of language too here is a table of US city populations zips law can be observed in our language the way we settle cities and sometimes even in industry market shares the other principle at play here is called the principle of least effort from library sciences a speaker or writer doesn't want to exert a lot of effort when communicating while the audience doesn't want to spend a lot of energy interpreting so the word choice or lexical diversity becomes limited what does all this mean people are lazy at least when it comes to expressing themselves they may know tens of thousands of words but really only consistently use a few to express meaning this works out well for a subjectivity lexicon because it means the word lists don't have to be so long in this section you will create a visual demonstrating zips law on 3 million tweets and then apply cue taps polarity function to some actual text to see a subjectivity lexicon in action\n"