How Search Engines Treat Data - Computerphile

**The Concept of "Run" and Its Significance in Search Engines**

Want all the ones about run in there so you just have all of those merge down just to word run and then you search for that concept every time regardless of what they type in run or running. This approach can lead to a situation where a document is deemed relevant not because it actually contains information related to your query, but simply because it contains the word "run" multiple times. To address this issue, search engines have developed techniques to evaluate the relevance of documents based on the proximity of keywords, rather than just counting their occurrences.

For instance, imagine you're searching for information about horses, and you come across a document that says "my horse is very nice." While it's clear from the context that the document is indeed about your horse, the search engine may not be able to determine this with absolute certainty. However, if the same document also mentions words like "pet" or "show horse," which are more closely related to horses than other concepts, it can infer that the document is relevant to your query. Similarly, when searching for information about running, a word that is often used in conjunction with the concept of running, such as "running shoes" or "running track," can help to narrow down the search results.

**Latent Semantic Analysis and Conceptual Similarity**

Another approach that search engines use is latent semantic analysis (LSA), which involves analyzing the concepts and themes present in a document's index. By examining how frequently words are used together, LSA can identify patterns and relationships between concepts that may not be immediately apparent from a simple keyword search. For example, if a word like "pony" appears with words like "field," "show horse," "pet horse," or "pet," it suggests that these words are conceptually related and should be considered together when evaluating relevance.

To implement LSA, search engines use advanced algorithms to analyze the index of their database, identifying patterns and relationships between words. By doing so, they can assign a higher weight to concepts that appear together frequently, rather than just relying on keyword matches. This approach has been highly effective in improving the accuracy of search results, particularly when dealing with ambiguous queries or those that rely heavily on contextual cues.

**Probabilistic Models and Language Understanding**

In recent years, search engines have begun using probabilistic models to evaluate the relevance of documents. These models assign a probability score to each document based on its likelihood of containing the desired information. By using this approach, search engines can consider multiple factors beyond just keyword matches, such as the frequency and context in which words appear.

For instance, if a word like "running" appears with phrases like "running shoes" or "running track," it is likely to be more relevant than the same word appearing in isolation. Similarly, when searching for information about horses, a word like "horse" that appears with words like "pet horse" or "show horse" is more likely to be relevant than one that appears with completely unrelated words.

**The Role of Search Engine Algorithms and Document Design**

While keyword matching and LSA are important factors in search engine algorithms, they are not the only considerations. Modern search engines use a range of techniques and metrics to evaluate document relevance, including probabilistic models and language understanding. By analyzing how web pages are designed and structured, search engines can better understand the context and intent behind queries.

For example, if an image is uploaded with a query about "my dog," but the original file is missing or has been altered, the algorithm may still be able to detect changes and improve the accuracy of subsequent searches. By leveraging advanced algorithms and incorporating insights from document design and structure, search engines can provide more accurate and relevant results for users.

**Egoless Programming and Open-Source Collaboration**

In the early days of the web, open-source software development and collaborative coding practices were relatively rare. However, as the web grew in popularity and complexity, the need for shared knowledge and expertise became increasingly important. The concept of "egoless programming," which involves giving up personal ownership and pride in one's work to benefit others, emerged as a way to facilitate collaboration and innovation.

Egoless programming has become a key aspect of open-source software development, where multiple contributors contribute to a project without seeking individual credit or recognition. By working together towards a common goal, developers can pool their expertise and knowledge to create more robust and effective solutions. Similarly, in the context of search engines, collaborative efforts between researchers, developers, and experts have led to significant advances in algorithmic development and the improvement of search engine performance.

**Conclusion**

In conclusion, the concept of "run" may seem simple at first glance, but it is actually a complex problem that requires careful consideration of multiple factors. By analyzing the proximity of keywords, leveraging latent semantic analysis, probabilistic models, and language understanding, search engines can provide more accurate and relevant results for users. Furthermore, by incorporating insights from document design and structure, as well as collaborating with other developers and experts, search engines can continue to improve their performance and provide better experiences for users.

"WEBVTTKind: captionsLanguage: enhow does a search engine work that's a hard question with a long answer because it's hard because google doesn't really want you to know how it works because whenever they tell you spammers try and use all of the knowledge to kind of make their results go higher and higher up the list so you don't want to they don't want us to know how it works and it's a long answer because there's been 60 years of research into information retrieval how to find documents in computers and before that there was hundreds of years of research into libraries for how do you index a library a library being that place through you have real physical books in it what's the link to libraries libraries are a big place full of books you want to find and there's no way you should look in the library to find them you created an index and you just read through that index and go okay i need to go to the fifth floor it's over there so then when we started search through computers that was a big digital space with lots of files in it and to know where we should look to find the file we wanted again it creates an index to say go over there into that folder and then when it comes to search engines the web is a massive digital place and then to know where we should look in the web for the things we want to find again they create an index so the history of how things were done in libraries affect how we do how we search the computers and that affects how we search through the web google is a huge dewey decimal system is that right it works in a bit differently but there have been do decimal type approaches to do different things like category browsing rather than web searching let's go back to the 60s and 70s when people were thinking how can we quickly go through files and computers uh so let's imagine we have five documents here so here we have five files which we're trying to search through to find out which one is most about someone talking about their horse so we've searched for the words my horse hoping that there's people talking about their horses and these five documents have those words in different amounts of time so document one has got the word mine at 25 times two has got mine it five times but neither mention the word horse document three says my and horse both ten times document four has 18 mi's and five horses and document five is only about the word mind not about horses and so the first thing you would do is just do term frequency tf term frequency and you're measuring just how many times each word is mentioned in there so this gets a score of 25 5 20 and so on which is fine in principle and the idea makes sense but in this set the word my is overwhelming everything so in fact document one is the most relevant one with a score of 25 because it's got the word mine it more than this one has my ad horse put together so this is a problem this is not the one that's most about horses but it's the one with the highest score so the next most obvious step for many people will say well let's say okay it needs to have both of those words in it for us to give it anything so now this makes a bit more sense these two are the only ones that get a good score and they are both mention the word horses which is great so another approach to this would have been just to rather than say okay we'll do some logic and say that's the best way of doing it we'll try a different approach where we say why don't we undermine how important the word my is so my is in every document so it doesn't really help us with anything so if we just divide the score that the word my produced for each one by the number of documents it's in then we suddenly have a much lower score for the word my my has less of an impact and we call this inverse document frequency so as we're dividing it by the number of documents we do this for every search term so my divided by five horse divided by two because it's only in two documents so this means that this one which is really about my but not about horses suddenly only gets a score of five then the next one suddenly goes down to one because there's five documents in there this is cut down to a seven this one gets cut down as well and we end up with a score of 6.1 and then this one goes right down to three because it's not important and this works really well because now the two which are about my and horse get a bigger score and the ones that were just about my get a much lower score but we're still open to the fact that a really important one about my with only mentioned horses once might come up rather than just cutting things out so if it's in lots of documents then we just care about it less and the score it creates has a lesser impact on whether we choose it's the most relevant document the other good thing this has done is that it's added slightly more weight to the word horse so the fact that this word the word horse is mentioned more in document 3 gives it a higher score now than document 4 than what we had before so it's prioritized the document that's more about horse as well as undermining the amount of my in there so then what a search engine has now done is it's found all the documents decided how many words each of them is how many times each word is in there and then given it a score and cut it down by the number of documents it's in so it's not too much too influential if it's not important and this is actually really easy you can do this in 50 lines of python it's just not very fast so the question is how do you make this faster and so what they do to make this faster is they build this thing called index which we mentioned earlier they build a text file which just has all these numbers pre-programmed into it and so in fact it counts how many times every word that they know about is in document one and they're all in a table ready to access and they can just uh use that so then every time you search they don't have to read all the documents again count the number of words in there again because that's the slowest bit they just look at the pre-calculated numbers for everything and so this index is an important aspect of keeping track of all the documents and which words are in there and that index will have all of these numbers we've calculated so far in there so how many documents the word horses in which is two and how many times it's in each of those documents so because this approach is pretty simple there's a number of things which are wrong with it and so there's a number of things we can do to make it even better one major problem obviously is if you have really really long documents like this one might be really long which is why it has the word mine at so many times then they're just going to likely get a bigger score than all the rest because they're longer so then what you can do is you can calculate the proportion of the document that's about the horse rather than the exact number of times it's mentioned and so then you're comparing every document on an equal proportion rather than allowing big documents to change it you can use stop words as another approach so stop words say well there's not even any point in searching for the word because it isn't everything same with the word two and and they're gonna get a low score whatever you do so let's just not even bother checking for it because that saves us time you can also look at stemming so one major challenge we often get is that you want to find documents that are about running but you search for the word run and you want the word running to come back or vice versa so if you search for running you want all the ones about run in there so you just have all of those merge down just to word run and then you search for that concept every time regardless of what they type in run or running you might also want to know about how near the words are to each other so it might well be that this document over here which has got 10 for my intent for horse is not really about my horse it's about my cat and someone else's horse and so you give a boost to the scores if they happen to be near to each other so if every time maya and horse come together at the same time you give them a bonus point then if this was only came together at one time my horse and that's only got a bonus point of one it's up to eight but if every time horse was mentioned there it came with my you'd get five bonus points and that's now 11.5 and suddenly this is the more relevant document because it said mine horse together not just at random times so this means that if you the documents which say my horse get a bigger score than documents that say my lovely horse which gets a bigger score than the documents which say my friend looks like a horse because they're further and further away and so they're less and less joined together in what you want then the final thing that you find some search engines might do is they might start to search for concepts rather than specific words so it might be that um a document which has the word pony in it comes back regularly all the time with my pony my pony my ponies that's kind of like the one that's my horse my horse my horse so it might be more relevant to the one that only had my horse once in it so you can start to say well if we had one which was uh my pony document six for which there's not much space and that had minor six times ponent six times so when you divide it by the number of documents my only got a small score so a tif idf were at 4.2 but because it said my pony together every time you get some bonus points you end up with a score that's actually making that one more relevant than this one because i said my horse once they said my pony six times that one said my horse five times so you're getting scores which promote words which are similar but not exactly the same in order to find words which are similar to each other to do this to bring pony in rather than just horse what they need to do is something called latent semantic analysis or other forms of looking back through your index and what they do is they say uh the index for a pony is looking a lot of index for horse every time pony comes up it's also with field and horse is always with field pony is always with show horses always we show ponies always with pet horses always with pets so these seem to be very similar concepts in our index so let's just merge them together into a super concept i mean does that have flaws pitfalls yeah there's many opportunities for that to go wrong and so what i the way i explained it is kind of a simple way of explaining latent semantic analysis but there's many more advances to kind of deciding how much of a similarity do you need for them to be considered probably related and there's other ways to consider them to be properly related uh so they're similar words in the dictionary or some other thesaurus which puts them close together does a human ever get involved in this to check things at all um so there are also uh semantic indexes where people said these words are conceptually related and so you can use those to help you understand which concepts are related this tf idf analysis of documents has kind of been a major stage in the 60 years of history of information retrieval and so there's a lot more i could tell you about how they've improved this they started calculating probabilities that this document is about this word rather than counts and so there's a whole probabilistic approach to it there's much more complicated models about language which we could go into but again we could do whole videos about those two topics so what we're really interested in this point now is how does a search engine use this type of algorithm and is this the only thing they use what other metrics do they use and so what we're going to next is analyzing how search engines work and how they can benefit from how web pages are designed but if i just sent out an image of my dog and i never sent out the original but the camera took no one's going to know but it's been imperceptibly changed because they haven't got a record in a narrower environment a precursor of open source and had in the jargon of a few years later was egoless programming in the sense that i wrote the code but if somebody else can do it better go aheadhow does a search engine work that's a hard question with a long answer because it's hard because google doesn't really want you to know how it works because whenever they tell you spammers try and use all of the knowledge to kind of make their results go higher and higher up the list so you don't want to they don't want us to know how it works and it's a long answer because there's been 60 years of research into information retrieval how to find documents in computers and before that there was hundreds of years of research into libraries for how do you index a library a library being that place through you have real physical books in it what's the link to libraries libraries are a big place full of books you want to find and there's no way you should look in the library to find them you created an index and you just read through that index and go okay i need to go to the fifth floor it's over there so then when we started search through computers that was a big digital space with lots of files in it and to know where we should look to find the file we wanted again it creates an index to say go over there into that folder and then when it comes to search engines the web is a massive digital place and then to know where we should look in the web for the things we want to find again they create an index so the history of how things were done in libraries affect how we do how we search the computers and that affects how we search through the web google is a huge dewey decimal system is that right it works in a bit differently but there have been do decimal type approaches to do different things like category browsing rather than web searching let's go back to the 60s and 70s when people were thinking how can we quickly go through files and computers uh so let's imagine we have five documents here so here we have five files which we're trying to search through to find out which one is most about someone talking about their horse so we've searched for the words my horse hoping that there's people talking about their horses and these five documents have those words in different amounts of time so document one has got the word mine at 25 times two has got mine it five times but neither mention the word horse document three says my and horse both ten times document four has 18 mi's and five horses and document five is only about the word mind not about horses and so the first thing you would do is just do term frequency tf term frequency and you're measuring just how many times each word is mentioned in there so this gets a score of 25 5 20 and so on which is fine in principle and the idea makes sense but in this set the word my is overwhelming everything so in fact document one is the most relevant one with a score of 25 because it's got the word mine it more than this one has my ad horse put together so this is a problem this is not the one that's most about horses but it's the one with the highest score so the next most obvious step for many people will say well let's say okay it needs to have both of those words in it for us to give it anything so now this makes a bit more sense these two are the only ones that get a good score and they are both mention the word horses which is great so another approach to this would have been just to rather than say okay we'll do some logic and say that's the best way of doing it we'll try a different approach where we say why don't we undermine how important the word my is so my is in every document so it doesn't really help us with anything so if we just divide the score that the word my produced for each one by the number of documents it's in then we suddenly have a much lower score for the word my my has less of an impact and we call this inverse document frequency so as we're dividing it by the number of documents we do this for every search term so my divided by five horse divided by two because it's only in two documents so this means that this one which is really about my but not about horses suddenly only gets a score of five then the next one suddenly goes down to one because there's five documents in there this is cut down to a seven this one gets cut down as well and we end up with a score of 6.1 and then this one goes right down to three because it's not important and this works really well because now the two which are about my and horse get a bigger score and the ones that were just about my get a much lower score but we're still open to the fact that a really important one about my with only mentioned horses once might come up rather than just cutting things out so if it's in lots of documents then we just care about it less and the score it creates has a lesser impact on whether we choose it's the most relevant document the other good thing this has done is that it's added slightly more weight to the word horse so the fact that this word the word horse is mentioned more in document 3 gives it a higher score now than document 4 than what we had before so it's prioritized the document that's more about horse as well as undermining the amount of my in there so then what a search engine has now done is it's found all the documents decided how many words each of them is how many times each word is in there and then given it a score and cut it down by the number of documents it's in so it's not too much too influential if it's not important and this is actually really easy you can do this in 50 lines of python it's just not very fast so the question is how do you make this faster and so what they do to make this faster is they build this thing called index which we mentioned earlier they build a text file which just has all these numbers pre-programmed into it and so in fact it counts how many times every word that they know about is in document one and they're all in a table ready to access and they can just uh use that so then every time you search they don't have to read all the documents again count the number of words in there again because that's the slowest bit they just look at the pre-calculated numbers for everything and so this index is an important aspect of keeping track of all the documents and which words are in there and that index will have all of these numbers we've calculated so far in there so how many documents the word horses in which is two and how many times it's in each of those documents so because this approach is pretty simple there's a number of things which are wrong with it and so there's a number of things we can do to make it even better one major problem obviously is if you have really really long documents like this one might be really long which is why it has the word mine at so many times then they're just going to likely get a bigger score than all the rest because they're longer so then what you can do is you can calculate the proportion of the document that's about the horse rather than the exact number of times it's mentioned and so then you're comparing every document on an equal proportion rather than allowing big documents to change it you can use stop words as another approach so stop words say well there's not even any point in searching for the word because it isn't everything same with the word two and and they're gonna get a low score whatever you do so let's just not even bother checking for it because that saves us time you can also look at stemming so one major challenge we often get is that you want to find documents that are about running but you search for the word run and you want the word running to come back or vice versa so if you search for running you want all the ones about run in there so you just have all of those merge down just to word run and then you search for that concept every time regardless of what they type in run or running you might also want to know about how near the words are to each other so it might well be that this document over here which has got 10 for my intent for horse is not really about my horse it's about my cat and someone else's horse and so you give a boost to the scores if they happen to be near to each other so if every time maya and horse come together at the same time you give them a bonus point then if this was only came together at one time my horse and that's only got a bonus point of one it's up to eight but if every time horse was mentioned there it came with my you'd get five bonus points and that's now 11.5 and suddenly this is the more relevant document because it said mine horse together not just at random times so this means that if you the documents which say my horse get a bigger score than documents that say my lovely horse which gets a bigger score than the documents which say my friend looks like a horse because they're further and further away and so they're less and less joined together in what you want then the final thing that you find some search engines might do is they might start to search for concepts rather than specific words so it might be that um a document which has the word pony in it comes back regularly all the time with my pony my pony my ponies that's kind of like the one that's my horse my horse my horse so it might be more relevant to the one that only had my horse once in it so you can start to say well if we had one which was uh my pony document six for which there's not much space and that had minor six times ponent six times so when you divide it by the number of documents my only got a small score so a tif idf were at 4.2 but because it said my pony together every time you get some bonus points you end up with a score that's actually making that one more relevant than this one because i said my horse once they said my pony six times that one said my horse five times so you're getting scores which promote words which are similar but not exactly the same in order to find words which are similar to each other to do this to bring pony in rather than just horse what they need to do is something called latent semantic analysis or other forms of looking back through your index and what they do is they say uh the index for a pony is looking a lot of index for horse every time pony comes up it's also with field and horse is always with field pony is always with show horses always we show ponies always with pet horses always with pets so these seem to be very similar concepts in our index so let's just merge them together into a super concept i mean does that have flaws pitfalls yeah there's many opportunities for that to go wrong and so what i the way i explained it is kind of a simple way of explaining latent semantic analysis but there's many more advances to kind of deciding how much of a similarity do you need for them to be considered probably related and there's other ways to consider them to be properly related uh so they're similar words in the dictionary or some other thesaurus which puts them close together does a human ever get involved in this to check things at all um so there are also uh semantic indexes where people said these words are conceptually related and so you can use those to help you understand which concepts are related this tf idf analysis of documents has kind of been a major stage in the 60 years of history of information retrieval and so there's a lot more i could tell you about how they've improved this they started calculating probabilities that this document is about this word rather than counts and so there's a whole probabilistic approach to it there's much more complicated models about language which we could go into but again we could do whole videos about those two topics so what we're really interested in this point now is how does a search engine use this type of algorithm and is this the only thing they use what other metrics do they use and so what we're going to next is analyzing how search engines work and how they can benefit from how web pages are designed but if i just sent out an image of my dog and i never sent out the original but the camera took no one's going to know but it's been imperceptibly changed because they haven't got a record in a narrower environment a precursor of open source and had in the jargon of a few years later was egoless programming in the sense that i wrote the code but if somebody else can do it better go ahead\n"

How Search Engines Treat Data - Computerphile

Random Videos