Vectoring Words (Word Embeddings) - Computerphile

The Power of Word Embeddings: Uncovering Hidden Relationships in Language

The concept of word embeddings has revolutionized the way we approach natural language processing (NLP). By representing words as vectors in a high-dimensional space, these techniques enable machines to capture subtle relationships and patterns in language that were previously impossible to discern. In this article, we'll delve into the world of word embeddings, exploring their applications, advantages, and limitations.

A Simple yet Surprising Technique

=====================================

One of the most popular word embedding techniques is the Word2Vec algorithm, which has been widely adopted for its simplicity and effectiveness. The basic idea behind Word2Vec is to represent each word as a vector in a high-dimensional space, where similar words are closer together. This is achieved by training a neural network on a large corpus of text data, where each word is mapped to 300 numbers (dimensions). By analyzing the relationships between these vectors, we can extract meaningful insights into the structure of language.

Measuring Distance and Similarity

---------------------------------

To validate the effectiveness of Word2Vec, let's measure the distance between different words. We'll use the `distance` method provided by the library to compute the cosine similarity between two vectors. For example, we'll calculate the distance between "car" and "cat", as well as between "dog" and "cat". The results are:

* Distance between "car" and "cat": 0.1

* Distance between "dog" and "cat": 0.23

As expected, "dog" and "cat" are closer to each other than "car" and "cat", demonstrating the accuracy of Word2Vec in capturing word relationships.

Finding Similar Words

---------------------

Another useful application of Word2Vec is to find words that are similar to a given input word. By calculating the distance between the input word and all other words, we can identify the most similar words. For example, if we input "cat", we get:

* Most similar word: "cats"

* Second most similar word: "kitten"

* Third most similar word: "feline"

This demonstrates how Word2Vec can be used to extract meaningful relationships between words.

Uncovering Gender Bias in Language

--------------------------------------

One of the most interesting applications of Word2Vec is its ability to uncover hidden biases in language. By analyzing the vectors for masculine and feminine nouns, we can identify patterns that reflect societal expectations about gender. For example, if we subtract the vector for "man" from the vector for "king" and add the vector for "woman", we get a new vector that corresponds to the word "queen". This demonstrates how Word2Vec can be used to expose biases in language.

Exploring Directional Relationships

-------------------------------------

Another fascinating application of Word2Vec is its ability to capture directional relationships between words. By analyzing the vectors for different words, we can identify patterns that reflect semantic meaning. For example, if we input "king" and subtract the vector for "man", we get a new vector that corresponds to the word "queen". This demonstrates how Word2Vec can be used to uncover subtle relationships between words.

A Final Example: Foxes and Santa

-----------------------------------

To demonstrate the surreal nature of Word2Vec, let's try inputting some unexpected words. If we input "fox", we get:

* Distance from "pig" to "oink": close enough for me (no actual output)

* Distance from "cow" to "meow": no exact match

* Distance from "dog" to "box": close enough for me (no actual output)

* Inputting "santa", we get: ho ho ho (aha, a fun one!)

The final line of the text suggests that Word2Vec doesn't know the answer when given a word like "phoebe".

Conclusion

----------

Word embeddings have revolutionized the way we approach NLP, enabling machines to capture subtle relationships and patterns in language. By analyzing the vectors for different words, we can extract meaningful insights into the structure of language. While there are limitations to these techniques, they offer a powerful tool for exploring the complexities of human language. Whether you're interested in uncovering hidden biases or simply exploring the vast possibilities of word embeddings, this article has demonstrated just how fascinating and useful these techniques can be.

"WEBVTTKind: captionsLanguage: enif we're moving from cat to dog which is similar things so we go away from cat and towards dog and then we go can i go beyond in that direction yes so the first result is dogs which is kind of a nonsense result the second is pit bull so that's like the doggiest of dogs right the least cat-like dog that feels right yeah yeah actually well if you go the other way well the the most cat-like cat the most undog-like let's find out it's gonna be kitten right it's gotta be cats feline kitten it's not really giving us anything much to work with i thought i would talk a little bit about word embeddings word devec and just wording bettings in general the way i was introduced to word embeddings or the the sort of context that i'm most familiar with them in is like how do you represent a word to a neural network well it's a set of characters isn't it i mean need it be more than the set of characters that make it up right so you can do that but you remember the thing we were talking about before in language models you have a problem of how far back you can look i would much rather be able to look back 50 words than 50 characters um and like if you if you're training a character-based model a lot of the capacity of your network is going to be used up just learning what characters count as valid words right what combinations of characters are words and so if you're trying to learn something more complicated than that you're spending a lot of your time training just like what words are and a lot of your network capacity is being used for that as well but this isn't a hard problem we know what the words are right you can give the thing a dictionary and then you're kind of that gives it it gives it a jump start the point is neural networks they view things as like a vector of real numbers or a vector of floats which is like some of the real numbers um and so if you think about something like uh an image representing an image in this way is like fairly straightforward you just take all of the pixels and put them in a long row and if they're black then it's zero and if they're white then it's one and you just have grayscale in between for example it's like fairly straightforward and so then you end up with a vector that represents that image it's a reasonably good representation it sort of reflects some elements of the structure of what you're actually talking about so um like if you take if you take the same the same image and make it a little bit brighter for example that is just making that vector a bit longer right or a point in that configuration space that's a bit further from the origin you can make it darker by moving it close to the origin by reducing the length of that vector if you take an image and you apply a small amount of you know noise to it that represents just like jiggling that vector around slightly in that configuration space so you've got you've got a sense in which two vectors that are close to each other are actually kind of similar images and um that some of the sort of directions in the vector space are actually meaningful in terms of something that would make sense for images and the same is true with numbers and whatever else and this is very useful when you're training because it allows you to say if your neural network is trying to predict a number and the the value you're looking for is 10 and it gives you nine you can say no but that's close and if it gave you 7000 you can be like no and it's not close and that's gives more information that allows the system to learn and in the same way you can say yeah that's almost the image that i want um whereas if you give the thing a dictionary of words say you've got your 10 000 words and the usual way of representing this is with a one heart vector if you have ten thousand words you have a vector that's ten thousand long ten thousand dimensions and all of the values are zero apart from one of them which is one so like the first word in the dictionary if it's like a then that's represented by a one and then the rest of the ten thousands is zeros and then the second word is like a zero and then a one and then all zeros and so on but there you're not giving any of those clues if the thing is looking for one word and it gets a different word all you can say is yeah that's the correct one or no that's not the correct one something that you might try but you shouldn't because it's a stupid idea is rather than rather than giving it as a one-hot vector you could just give it as a number but then you've got this indication that like two words that are next to each other in the dictionary are similar and that's not really true right like if you have a language model and you're trying to predict the next word and it's saying i love playing with my pet blank and like the word you're looking for is cat and the word it gives you is car lexicographically they're pretty similar but you don't want to be saying to your network uh you know close that was very nearly right because it's not very nearly right it's a nonsense prediction but then if it said like dog you should be able to say no but that's close right because that is a plausible completion for that sentence and the reason that that makes sense is that cat and dog are like similar words what does it mean for a word to be similar to another word and so the assumption that word embeddings use is that two words are similar if they are often used in similar contexts so if you look at all of the instances of the word cat in a giant database uh you know a giant corpus of text and all of the instances of the word dog they're going to be surrounded by you know words like pet and words like you know feed and words like play and you know that kind of thing cute etc right and so that gives some indication that these are these are similar words the challenge that word embeddings are trying to come up with is like how do you represent words as vectors such that two similar vectors are two similar words and possibly so that directions have some meaning as well um because then that should allow our networks to be able to understand better what we're talking about uh in in in text so the thing people realized was if you have a language model that's able to get good performance of like predicting the next word in a sentence and the architecture of that model is such that it doesn't have that many neurons in its hidden layers it has to be compressing that information down efficiently so you've got the inputs to your network let's say for the sake of simplicity your language model is just taking a word and trying to guess the next word so we only have to deal with having one word in our input but so our input is this very tall thing right ten thousand tall and these then feed into a hidden layer which is much smaller i mean it's more than five but it might be like a few hundred maybe let's say 300 and these are sort of the connections all of these is connected to all of these and it feeds in and then coming out the other end you're back out to 10 000 again right because your output is it's going to make one of these high you do something like soft max to turn that into a probability distribution so you give it a word from your dictionary it then does something and what comes out the other end is probability distribution where you can just like look at the highest value on the output and that's what it thinks the next word will be and the higher that value is the more like confident it is but the point is you're going from 10 000 to 300 and back out to 10 000. so this 300 has to be if this if this is doing well at its task this 300 has to be encoding sort of compressing information about the word because the information is passing through and it's it's going through this thing that's only 300 wide so in order in order to be good at this task it has to be doing this so then they were thinking well how do we pull that knowledge out it's kind of like an egg drop competition is this why you have to devise some method of safely getting the egg to the floor right it's not like the teachers actually want to get an egg safely to the ground right but they've chosen the task such that if you can do well at this task you have to have learned some things about physics and things about engineering and probably teamwork yeah right right exactly so it's the it's the friends you make along the way so um so the way that they the way that they build this is rather than um trying to predict the next word although that will work that will actually give you word embeddings but they're not that good because they're only based on the immediately adjacent word you um you look sort of around the word so you you give it a word and then you sample from the the neighborhood of that word randomly another word and you train the network to predict that so the idea is that at the end um when this thing is fully trained you give it any word and it's going to give you a probability distribution over all of the words in your dictionary which is like how likely are each of these words to show up within five words of this first word or within 10 or you know something like that if the system can get really good at this task then the weights of this hidden layer in the middle have to encode something meaningful about that input word and so if you imagine the word cat comes in in order to do well the probability distribution of surrounding words is going to end up looking pretty similar to the output that you would want for the word dog so it's going to have to put those two words close together if it wants to do well at this task and that's literally all you do so so so if you run this on a lot it's it's absurdly simple right but if you run it on a large enough data set and give it enough compute to actually perform really well um it ends up giving you each uh giving you for each word uh a vector that's of length however many uh units you have in your hidden layer which for which the the nearbyness of those vectors expresses something meaningful about how similar the contexts are that those words appear in and our assumption is that words that appear in similar contexts are similar words and uh it's slightly surprising how well that works um and how much information it's able to extract so it ends up being a little bit similar actually to the way that the generative adversarial network uh does things where what we're training it to produce good images from random noise and in the process of doing that it creates this mapping from the latent space to images by doing basic arithmetic like just adding and subtracting vectors on the latent space would actually produce meaningful changes in the image so what you end up with is is that same principle but for words so if you take for example the vector and it's required by law that all explanations of word embeddings use the same example to start with so uh if you take the vector for um king subtract the vector for man and add the vector for woman you get another vector out and if you find the nearest point in your word embeddings to that vector it's the word queen and so there's a whole giant swathe of like ways that ways that ideas about gender are encoded in the language which are all kind of captured by this vector which we won't get into but it's interesting to explore i have it running and we can play around with some of these vectors and see where they end up so i have this running in in google collab which is very handy i'm using word embeddings that were found with the word to vec algorithm using google news each word is mapped to 300 numbers let's check whether what we've got satisfies our first condition we want dog and cat to be relatively close to each other and we want cat to be like further away from car than it is from doc right we can just measure the distance between these different vectors i believe you just do model.distance distance between car and cat okay 0.1 and then the distance between let's say dog and cat 0.23 right dog and cat are closer to each other this is a good start right and in fact we can uh let's find all of the words that are closest to cat for example okay so the most similar word to cat is cats makes sense followed by dog kitten feline beagle puppy pup pet felines and chihuahua right so this is already useful it's really handy that you can throw any word at this and it'll give you a list of the words that are similar whereas like if i put in car i get vehicle cars suv minivan truck right so this is working the question of directions is pretty interesting so yeah let's do the classic example which is this if you take the vector for king subtract the vector for man add the vector for woman what you get somewhat predictably is queen and if you put in boy here you get girl if you put in father you get mother yeah and it if you put in shirt you get blouse so this is reflecting something about gender that's that's in the in the data set that it's using this reminds me a little bit of the unicorn thing where it you know the transformer was able to infer all sorts of or appear to have knowledge about the world because of language right right but the um the thing that that i like about this that that is that that transformer is working with uh 1.5 billion parameters and here we're literally just taking each word and giving 300 numbers you know if i go from london and then subtract uh england and then add um i don't know japan we'd hope for tokyo we'd hope for tokyo and we get tokyo we get tokyo twice weirdly tokyo tokyo why is oh oh sorry it's no we don't we get tokyo and toyco ah which is a typo i guess and so yeah uh usa in new york ah okay interesting maybe it's thinking larger city of yeah right right like the exact relationship here isn't clear we haven't specified that what does it give us for australia i bet it's yeah it's sydney sydney melbourne so it's yeah it's not doing capital it's just the largest city right um but that's cool it's cool that we can extract the largest city and like this is completely unsupervised it was just given a huge number of news articles i suppose and it's pulled out that there's this relationship and that you can follow it for different things you can take the vector from pig to oink right okay and then like you put cow in there that's mu you put a cat in there and you get meowing you put dog in there you get box right close enough for me yeah yeah you put um but then then it's it gets surreal you put santa in there ho ho right fantastic what does the fox say it says phoebe what so it doesn't know basically although the second thing is chittering do fox's chitter i don't know not in this data setif we're moving from cat to dog which is similar things so we go away from cat and towards dog and then we go can i go beyond in that direction yes so the first result is dogs which is kind of a nonsense result the second is pit bull so that's like the doggiest of dogs right the least cat-like dog that feels right yeah yeah actually well if you go the other way well the the most cat-like cat the most undog-like let's find out it's gonna be kitten right it's gotta be cats feline kitten it's not really giving us anything much to work with i thought i would talk a little bit about word embeddings word devec and just wording bettings in general the way i was introduced to word embeddings or the the sort of context that i'm most familiar with them in is like how do you represent a word to a neural network well it's a set of characters isn't it i mean need it be more than the set of characters that make it up right so you can do that but you remember the thing we were talking about before in language models you have a problem of how far back you can look i would much rather be able to look back 50 words than 50 characters um and like if you if you're training a character-based model a lot of the capacity of your network is going to be used up just learning what characters count as valid words right what combinations of characters are words and so if you're trying to learn something more complicated than that you're spending a lot of your time training just like what words are and a lot of your network capacity is being used for that as well but this isn't a hard problem we know what the words are right you can give the thing a dictionary and then you're kind of that gives it it gives it a jump start the point is neural networks they view things as like a vector of real numbers or a vector of floats which is like some of the real numbers um and so if you think about something like uh an image representing an image in this way is like fairly straightforward you just take all of the pixels and put them in a long row and if they're black then it's zero and if they're white then it's one and you just have grayscale in between for example it's like fairly straightforward and so then you end up with a vector that represents that image it's a reasonably good representation it sort of reflects some elements of the structure of what you're actually talking about so um like if you take if you take the same the same image and make it a little bit brighter for example that is just making that vector a bit longer right or a point in that configuration space that's a bit further from the origin you can make it darker by moving it close to the origin by reducing the length of that vector if you take an image and you apply a small amount of you know noise to it that represents just like jiggling that vector around slightly in that configuration space so you've got you've got a sense in which two vectors that are close to each other are actually kind of similar images and um that some of the sort of directions in the vector space are actually meaningful in terms of something that would make sense for images and the same is true with numbers and whatever else and this is very useful when you're training because it allows you to say if your neural network is trying to predict a number and the the value you're looking for is 10 and it gives you nine you can say no but that's close and if it gave you 7000 you can be like no and it's not close and that's gives more information that allows the system to learn and in the same way you can say yeah that's almost the image that i want um whereas if you give the thing a dictionary of words say you've got your 10 000 words and the usual way of representing this is with a one heart vector if you have ten thousand words you have a vector that's ten thousand long ten thousand dimensions and all of the values are zero apart from one of them which is one so like the first word in the dictionary if it's like a then that's represented by a one and then the rest of the ten thousands is zeros and then the second word is like a zero and then a one and then all zeros and so on but there you're not giving any of those clues if the thing is looking for one word and it gets a different word all you can say is yeah that's the correct one or no that's not the correct one something that you might try but you shouldn't because it's a stupid idea is rather than rather than giving it as a one-hot vector you could just give it as a number but then you've got this indication that like two words that are next to each other in the dictionary are similar and that's not really true right like if you have a language model and you're trying to predict the next word and it's saying i love playing with my pet blank and like the word you're looking for is cat and the word it gives you is car lexicographically they're pretty similar but you don't want to be saying to your network uh you know close that was very nearly right because it's not very nearly right it's a nonsense prediction but then if it said like dog you should be able to say no but that's close right because that is a plausible completion for that sentence and the reason that that makes sense is that cat and dog are like similar words what does it mean for a word to be similar to another word and so the assumption that word embeddings use is that two words are similar if they are often used in similar contexts so if you look at all of the instances of the word cat in a giant database uh you know a giant corpus of text and all of the instances of the word dog they're going to be surrounded by you know words like pet and words like you know feed and words like play and you know that kind of thing cute etc right and so that gives some indication that these are these are similar words the challenge that word embeddings are trying to come up with is like how do you represent words as vectors such that two similar vectors are two similar words and possibly so that directions have some meaning as well um because then that should allow our networks to be able to understand better what we're talking about uh in in in text so the thing people realized was if you have a language model that's able to get good performance of like predicting the next word in a sentence and the architecture of that model is such that it doesn't have that many neurons in its hidden layers it has to be compressing that information down efficiently so you've got the inputs to your network let's say for the sake of simplicity your language model is just taking a word and trying to guess the next word so we only have to deal with having one word in our input but so our input is this very tall thing right ten thousand tall and these then feed into a hidden layer which is much smaller i mean it's more than five but it might be like a few hundred maybe let's say 300 and these are sort of the connections all of these is connected to all of these and it feeds in and then coming out the other end you're back out to 10 000 again right because your output is it's going to make one of these high you do something like soft max to turn that into a probability distribution so you give it a word from your dictionary it then does something and what comes out the other end is probability distribution where you can just like look at the highest value on the output and that's what it thinks the next word will be and the higher that value is the more like confident it is but the point is you're going from 10 000 to 300 and back out to 10 000. so this 300 has to be if this if this is doing well at its task this 300 has to be encoding sort of compressing information about the word because the information is passing through and it's it's going through this thing that's only 300 wide so in order in order to be good at this task it has to be doing this so then they were thinking well how do we pull that knowledge out it's kind of like an egg drop competition is this why you have to devise some method of safely getting the egg to the floor right it's not like the teachers actually want to get an egg safely to the ground right but they've chosen the task such that if you can do well at this task you have to have learned some things about physics and things about engineering and probably teamwork yeah right right exactly so it's the it's the friends you make along the way so um so the way that they the way that they build this is rather than um trying to predict the next word although that will work that will actually give you word embeddings but they're not that good because they're only based on the immediately adjacent word you um you look sort of around the word so you you give it a word and then you sample from the the neighborhood of that word randomly another word and you train the network to predict that so the idea is that at the end um when this thing is fully trained you give it any word and it's going to give you a probability distribution over all of the words in your dictionary which is like how likely are each of these words to show up within five words of this first word or within 10 or you know something like that if the system can get really good at this task then the weights of this hidden layer in the middle have to encode something meaningful about that input word and so if you imagine the word cat comes in in order to do well the probability distribution of surrounding words is going to end up looking pretty similar to the output that you would want for the word dog so it's going to have to put those two words close together if it wants to do well at this task and that's literally all you do so so so if you run this on a lot it's it's absurdly simple right but if you run it on a large enough data set and give it enough compute to actually perform really well um it ends up giving you each uh giving you for each word uh a vector that's of length however many uh units you have in your hidden layer which for which the the nearbyness of those vectors expresses something meaningful about how similar the contexts are that those words appear in and our assumption is that words that appear in similar contexts are similar words and uh it's slightly surprising how well that works um and how much information it's able to extract so it ends up being a little bit similar actually to the way that the generative adversarial network uh does things where what we're training it to produce good images from random noise and in the process of doing that it creates this mapping from the latent space to images by doing basic arithmetic like just adding and subtracting vectors on the latent space would actually produce meaningful changes in the image so what you end up with is is that same principle but for words so if you take for example the vector and it's required by law that all explanations of word embeddings use the same example to start with so uh if you take the vector for um king subtract the vector for man and add the vector for woman you get another vector out and if you find the nearest point in your word embeddings to that vector it's the word queen and so there's a whole giant swathe of like ways that ways that ideas about gender are encoded in the language which are all kind of captured by this vector which we won't get into but it's interesting to explore i have it running and we can play around with some of these vectors and see where they end up so i have this running in in google collab which is very handy i'm using word embeddings that were found with the word to vec algorithm using google news each word is mapped to 300 numbers let's check whether what we've got satisfies our first condition we want dog and cat to be relatively close to each other and we want cat to be like further away from car than it is from doc right we can just measure the distance between these different vectors i believe you just do model.distance distance between car and cat okay 0.1 and then the distance between let's say dog and cat 0.23 right dog and cat are closer to each other this is a good start right and in fact we can uh let's find all of the words that are closest to cat for example okay so the most similar word to cat is cats makes sense followed by dog kitten feline beagle puppy pup pet felines and chihuahua right so this is already useful it's really handy that you can throw any word at this and it'll give you a list of the words that are similar whereas like if i put in car i get vehicle cars suv minivan truck right so this is working the question of directions is pretty interesting so yeah let's do the classic example which is this if you take the vector for king subtract the vector for man add the vector for woman what you get somewhat predictably is queen and if you put in boy here you get girl if you put in father you get mother yeah and it if you put in shirt you get blouse so this is reflecting something about gender that's that's in the in the data set that it's using this reminds me a little bit of the unicorn thing where it you know the transformer was able to infer all sorts of or appear to have knowledge about the world because of language right right but the um the thing that that i like about this that that is that that transformer is working with uh 1.5 billion parameters and here we're literally just taking each word and giving 300 numbers you know if i go from london and then subtract uh england and then add um i don't know japan we'd hope for tokyo we'd hope for tokyo and we get tokyo we get tokyo twice weirdly tokyo tokyo why is oh oh sorry it's no we don't we get tokyo and toyco ah which is a typo i guess and so yeah uh usa in new york ah okay interesting maybe it's thinking larger city of yeah right right like the exact relationship here isn't clear we haven't specified that what does it give us for australia i bet it's yeah it's sydney sydney melbourne so it's yeah it's not doing capital it's just the largest city right um but that's cool it's cool that we can extract the largest city and like this is completely unsupervised it was just given a huge number of news articles i suppose and it's pulled out that there's this relationship and that you can follow it for different things you can take the vector from pig to oink right okay and then like you put cow in there that's mu you put a cat in there and you get meowing you put dog in there you get box right close enough for me yeah yeah you put um but then then it's it gets surreal you put santa in there ho ho right fantastic what does the fox say it says phoebe what so it doesn't know basically although the second thing is chittering do fox's chitter i don't know not in this data set\n"