The Future of Knowledge Graphs: Amazon's Approach and Scaling Down to Personal and Small Business Levels
In recent years, knowledge graphs have become an essential tool for companies like Amazon to provide users with a vast amount of information about products and related data. The memory is different from one platform to another, and there are more accessories so on and so forth. Recommendation is the second thing both search and recommendation is a way to help people discover products.
For many many products at amazon when you look at the detail page you can see the structural information about particular properties of the products and that is this kind of structured knowledge. Also, you can see like a comparison tables and we are experimenting how we can generate better uh comparisons using the structured knowledge. Great great um you're thinking about knowledge graphs and uh the kinds of things you're doing at large scale how would you scale that down for someone who wanted to you know start exploring this area and maybe do something at the personal level or the you know much much smaller business than amazon level.
To scale it down, providing the tooling and generic technologies is key. Before i work on the product domain i didn't realize how different the product domain could be from the normal like google knowledge graph bing knowledge graph and so similarly when people talk about like um medical information biology i also see their unique challenges and so if we are able if we will be able to sort of have a set of tools to help people to get knowledge for particular domains that will help small businesses. At this moment the sort of uh the the the well-known knowledge graphs are mostly from big companies.
One thing uh we should i think i'm hoping we will be able to develop is those like um services or tooling which will help small businesses to build their knowledge graphs and in that domain as i said this generic techniques that will be the key. Are there things that come to mind when you describe a knowledge graph? A lot of what you think of in terms of relationships could start as simple as a relational database, and you know relate different products or entities and you've got different attributes and things like that.
But there's certainly a lot of other things you might want to do uh do you are you aware of uh you know is that the kind of natural starting place or um are there knowledge graph tools you know for someone that wants to build the graph not necessarily subscribe to someone else's graph. I think there are um i would say three levels the first level is how to provide the um the the the technology to store the knowledge graph and as you said relational database, it actually has exactly the same expressive power as graphs in other words anything that we can describe in a graph we can store it in relational databases and vice versa.
And in addition for at aws there are all of this graph database tools as well. So that's the first layer how to store it. The second layer i would say is the tools to help people to really build their graph for example web extraction tools or this linkage tools when i have two different databases about movies how do i know these two movies are the same movie how do i know uh these two people are the same and i see such uh sort of tools at different levels um so some of them are provided from aws as well but we need to sort of integrate them to be a big tool set that will cover all of the different techniques i mentioned in this knowledge graph pipeline.
I would say there is actually a third level i think at this moment each company owns their own knowledge graph and there is actually some common knowledge that belong to human beings and that would be a good set to start with for building small specialized knowledge graphs. If one day we could have those knowledge sort of as a service provided by different small business owners small companies that will be good as well.
Are you aware of any efforts to standardize you know some kind of interface between knowledge graphs federated nalographs so that you know amazon's could hook in the googles could hook in the my little companies and give one api to a broader view of the world or or or not. I'm aware of such um efforts from the research community, i think there is one uh called open knowledge uh so it is um uh so i know a few people like from schema.org and this is uh possibly even getting funding from the government about trying to hook up different knowledge graphs.
I don't see that much of appetite among the companies and i mean i can definitely understand the reason behind it sure it's huge efforts to build such a knowledge graph sure sure and the data is valuable and they're not sure they want to share it just yet goes into creating exactly yeah data is valuable and i mean in a sense we shouldn't get the data for free because otherwise people wouldn't have the motivation to work on data got it well luna thanks so much for taking the time to share with us a bit about what you've been working on and talk about knowledge graphs thank you very much i appreciate all of this insightful questions.
"WEBVTTKind: captionsLanguage: enall right everyone i am here with luna dong luna is a senior principal scientist with amazon working on product knowledge graphs luna welcome to the twimla ai podcast thank you nice to meet you sam it's great to meet you and i'm looking forward to our chat let's get started by having you introduce yourself to our audience tell us a little bit about your background and how you came to work in machine learning sure yeah so i'm luna and i work for amazon and the question regarding how i came to machine learning this is an interesting question and that reminded me my advisor when i was a phd student at uw i often heard him saying i'm from the ai community and i came to database from the back door and now as you can imagine i got my phd from the database field and that's the field where i'm sort of active for a long time and now i'm coming to machine learning from the back door as well so my advisor and i we sort of make a circle yeah so a little bit more about how i came to machine learning so my phd topic is about data integration basically how we can seamlessly collect data from many many different data sources and integrate them together and then starting from 2012 so that's the time google launched knowledge graph and starting from then knowledge graph has been a very popular concept and big companies universities they put a lot of efforts into it and if you think about a knowledge graph you put all of the data from different sources and put it into this knowledge graph so since then i have been working on knowledge graph for the past about nine years for now and when you build a knowledge graph you really need technology from all different fields this includes natural language processing so you need to understand texts this includes image processing you also want to get knowledge from images this includes data mining you want to mine the data from the text from the graphs and also this includes certainly database you want to integrate the data you want to clean up the data you want to have high quality data and in a sense to build a great knowledge graph you need all of the technologies and that's how i came to machine learning field because machine learning is the core for all of these fields interesting interesting when you describe the work you were doing on your phd it made me think of this challenge that we've been chasing after for the past i don't know 10 20 maybe even more years uh that i think of as like enterprise information integration we're going to create either some layer on top of all of the data to make it more easily accessible or some centralized thing that sits on top of all of the information within an organization it's interesting to think of a knowledge graph as you know playing that role for for many organizations um tell us a little when you think of of knowledge graphs um and in particular product knowledge graphs what are all of the things that go into making a robust knowledge graph sure yeah this is a great question so knowledge graph is basically trying to mimic how human beings look at the real world before we are able to read and write we already understand the real world and to the little kids those are mom daddy doggy my house my home that's another house which is next to my house and before any uh sort of a language thing there are all of these entities and the relationships between the entities that's how human beings understand the real world and the knowledge graph is trying to capture that and so what is a good knowledge graph it certainly describes all of the entities and relationships and then i would say there are three key features for a knowledge graph the first feature is because it is entities and relationships it's structured data so it is not just large paragraphs of texts it describes the entities the properties of the entities the relationships between the entities so that's the first thing it is structured the second thing is it is really high quality data that means it's very rich data you ideally want to know everything about the world and it is clean data there is no mistake and you can fully trust a knowledge graph as an authority and also it is canonicalized so for me as an example my official name is sindong and people know me by luna that doesn't matter that's the same person and over years how i look might differ slightly but that doesn't matter that's the same person i moved from company to my to companies for my job that doesn't matter that's still the same person in the knowledge graph there will be one entity represent me not five different entities so it's canonicalized so that's the second big feature for knowledge graph it is reach clean and canonicalized high quality data and the third one is the data are connected so you connect the data about business about movies about music about universities about uh products you connect all of them into one single knowledge graph and then you know hey this movie star and that uh song artist that's actually the same person you know this t-shirt with darth vader and that's a character from that movie star wars and everything is connected so we can reason about it so that's the third thing about knowledge graph so these are all important features to build a great knowledge graph and because this um this is kind of high criteria standard that's also why we need to explore all different technologies such that we can build a great knowledge graph when you say high quality and clean data i think of curated and you know that often has the connotation of human curation i imagine that this is one of the areas where you might want to apply machine learning to improve your ability to deliver a knowledge graph at least at you know the scale of a product database or even a smaller one like you know movies and and other types of knowledge graphs yeah that's a great point there are so many products and also there are so many entities in the world actually so just using uh product domain as an example we have billions of products and every day we will have changes to millions of products and if we want to manually curate everything it's just impossible and also not to talk about individual products let's have a guess how many different product types there are it's not hundreds it's not thousands it is like close to millions depending on what is the granularity we want to model and with all of these um many different product types each type has its unique product properties and also so many different products curation is impossible so that's why machine learning plays a critical role to scale up is that curation element the only place where machine learning comes in and constructing a knowledge graph or are there other areas where you might want to use ml yeah that's a good question so curation is one place where machine learning plays an important role another important role is now we have this knowledge huge knowledge graph how we are going to put them into real applications we want to help people to easily search for the knowledge and that's where um search natural language processing all of these machine learning techniques being critical and certainly we want to answer questions using our knowledge graph that's where it is important and also we want to use the structured rich knowledge to do recommendations to also explain why we make such ex recommendations and that's another place where machine learning plays an important role nice so you talked a little bit about the high level challenges of uh knowledge base particularly as it applies to products can we maybe take a step back and have you share a bit about amazon's efforts to build out a product knowledge graph how long has this been going on and what are some of the major steps and milestones uh along the way did you have a product in existence when you started at amazon or were were you part of the team that helped create it uh i'm part of the team that initiated this effort so i joined amazon um four years ago four plus years and that's the time we started this project and the effort is covering multiple different you can consider it as areas of applications so amazon is a huge company and one kind of product knowledge graph we are building is a media knowledge graph so everything you can consider about like books music uh movies and possibly even like podcast this is kind of a medium products and that's where we apply some kind of techniques to collect knowledge and um i will tell you what is unique about it but before that let me tell you another kind of products that's retail products and everything sort of um you put in the the electronics furniture um so the clothes and the what you put into your kitchen your bathroom those products those are retail products and that require a different set of techniques and in addition to all of this you know web is a huge uh source of data of knowledge and so one big pillar of this project is web knowledge extraction and that's where we are able to sort of get external information for example the information provided by the brand brand websites and so that's uh another part of this project how can we collect data from the web and use that to support the knowledge product knowledge graph as well as help alexa yeah so let me say a little bit more why i separate media products and retail products and although they are all products but they the data behind them are very different and for media uh the publishers for the music the movies the books they are very well trained and good at providing the meta information so what is that who is the director of this movie who is the singer of this song when is this movie released what is the language etc etc and there are a lot of decent information from different sources and our job is to integrate the data from those different sources different publishers and make them seamless so i recall that this is also related to where i was from from from the data integration community database community for retail it is different the retailers they are not well trained to generate all of this structured information instead everything is in the product title the product descriptions and also a whole bunch of wallets it's not structured information for so for the retail products we have one extra step where we need to pull the structured information out of the text out of the images and also we need to remove all of the noises that are provided for various reasons and so that's why a retail product graph is giving us extra challenges okay and you said what i was thinking and that that the information extraction you know that's a rich domain of you know research and practice in and of itself uh extracting information from and particularly unstructured sources like blocks of text and images yeah an area that we've been working at for a long time are there are there synergies when you're looking at the problem in conjunction with the knowledge graph problem or do you take off the shelf extraction techniques and then apply them in a vacuum and then you have some bundle of structured information that you then integrate into your knowledge graph sure so this is a very nice question and in a sense i would say information extraction at the very beginning uh that's possibly 30 years ago it started with the extracting two kinds of information one is the is a relationship so um for example uh this person is an artist things like that and then another big uh set of information they are extracting is event information so from the news articles what are the events who where when what how etc etc and it's really the uh at the time one knowledge graph is getting popular in industry uh that's the time we see a a a boost for information extraction so instead of that in addition to that two kinds of um set of information we extract now people are interested in extracting relationships between the entities so that's where people look at all of the articles all all of them also something we call the semi-structured websites for example if you go to rotten tomato you go to imdb you see the data is not big sort of text but data in some format in the web pages we call it semi-structured data so people start extracting information from those kind of data sources and try to say hey this is the relationship between this movie and this person and uh this is uh sort of um this um particular property of this product so it is helping each other in terms of how the two fields are growing um and so when you're taking on a new extraction project do you are you also utilizing the existing knowledge graph to help you with the extraction or are you extracting the information in a vacuum cool so um we want to use as much existing knowledge as possible so when we extract the knowledge we need to train models and the training data basically come from two sources one source is certainly manual labels it's very painful to manually label everything for different relationships from different sources and also for different types different entity types and so then naturally another big source is existing knowledge so we call it seed knowledge and we apply weak learning we use the seed knowledge to automatically generate the training data that will help us to train the model and that is the most scalable way because really just um manually collecting training data this is very hard and on the other hand this is just like how human beings learn the more you know the faster you learn and the more knowledgeable you will be and so when you're you're talking about kind of building a training data set for this extraction problem um at what level of the of the problem are we talking about applying machine learning for example uh you've talked about extracting entities from texts um you know that's one that it's you know it's a well-known problem the name extraction uh you know likewise we've talked about extracting text from images uh you also have um yeah i've seen some work on uh trying to do structure extract structure data from web pages in a way that's more robust to those pages changing than the usual you know xpath or you know uh html parsing yes uh are you applying h machine learning into all of these or to a subset of them everything let me tell you the whole sort of um workflow so we start with knowledge extraction where we extract knowledge from product descriptions from the web and for the web it includes both texts and the semi-structured data and also something called web tables basically you have tabular information a table on your web page and for products we extracted from uh texts and from images and so these are all like machine learning methods uh developed by nlp community developed developed from the computer vision community and after we extract all of the knowledge we try to integrate them together we try to decide as i mentioned this luna dong and that shin dong are the same person we try to decide um so is director of and the director are actually the same relationship even though different websites different data sources call it differently but that's the same relationship so for this integration we also use machine learning and this is techniques developed by the database community data mining community as well as the nlp community and after that we put everything together and we try to decide if something is wrong so we decided if something is wrong by looking for inconsistencies for example if most of the colors are like a red blue etc etc and suddenly we see something like vanilla flavored and we know hey this is not a color so that's uh one uh kind of inconsistency another thing is we might look at the values from different products sort of neighborhood products for example for ice cream and their flavor will mainly be chocolate coffee mint vanilla and suddenly we see spicy and if that's not from india we will guess this is unlikely to be correct and yes and we also look at inconsistency from different data sources and i i i'm always amazed how different the data sources give us different information about even some like very popular famous sort of movie stars and singers and we need to decide oh is she born was she born on february 28 or march 28th which one is the correct one so that's called data fusion and all of this again is using techniques developed by the database community the machine learning community for anomaly detection for example and also from the data mining community as well how to identify the inconsistent information and the claim something is wrong and remove it from the knowledge graph and one more thing for that is that we learn embeddings the sort of the representations for every entity and the relationship that also helps us identify mistakes and in addition the the embeddings can help us in a downstream applications for search for recommendation and also for question answering so that is a field by data mining community by recommendation community by the search and nlp communities so you find um machine learning plays an important role for every single step in this pipeline and so um i i often uh sort of a joke with my colleagues with my students so our goal is to build the most authoritative and richest knowledge graph and we would apply whatever technologies we will adjust it we will invent new technologies to achieve that goal and if mechanical engineering is the way to help us build this knowledge graph we will learn that but actually it turns out machine learning is really the key technique to do this but it's not like one single technique but it is machine learning technologies invented by many different fields sure sure when you apply that last step where you're identifying anomalies or out of distribution data however you want to think of it does that then percolate up to a human in the loop or do you have uh automated resolution techniques for addressing that kind of information that kind of uh anomaly yeah human in the loop is always needed and for the data business human loop is important because we really want very high quality of the data for machine learning techniques if our goal is something like 90 percent precision so basically when we make 10 predictions nine are correct this is achievable still using machine learning but when we want to get to 99 without human in the loop it's almost impossible so the question here is really how to make a smart choice about where people will play a role and where machine will play a role and so human beings can help in various ways for example they have to provide some manual annotations to help create the training data even if we sort of applied sort of a distance supervision weak learning to collect the training data automatically that data still originally oftentimes comes from human input so in the product graph domain a lot of data are come originally from the retailers so that's one part and later on as we do the train the model uh make the predictions we need humans in the loop to tell us what is correct and what is incorrect that's another big thing and and then in this process from time to time we need to check what is um what is correct what is that concerning and if there is any sort of domain knowledge that can definitely speed it up and certainly at this moment we are trying to apply automl as much as possible to reduce the human work so talking about human in the loop one place i got most inspired by human in the loop is in amazon fulfillment center and it is like a big sort of um uh it's like a park and there is um uh the the uh the there is a central system that will sort of decide okay i need to count the number of items in this bin for products i need to send this to those places so we can ship it out we need to double check whether something seems to be wrong here and it seems a central system knows everything about what it needs to achieve and it does the best to sort of combine human powers and machine powers and similarly for machine learning system eventually it will be a seamless integration of human power and machine power so we can best the leverage the intelligence the intelligence machine intelligence and human intelligence in a way that we can get the best knowledge you mentioned when you talked about the very front end of that process and uh pulling in the information from the web is that is that targeted in the sense of you have some set of products and you have maybe a url of set of urls where you know information about those products might exist and you're crawling those specifically or do you start with kind of the a funnel applied to the big internet and identify pages that may be relevant uh broadly how do you think about that funneling or that uh that proc part of the process for a system at this scale yeah so this is a huge question to ask actually because eventually it is about how we balance um the results and also the resources we need so i can use product as a simple example but the same idea may or may not apply to other domains so for products um the brand websites manufacture websites they provide a lot of information and then it makes a lot of sense to figure out what are the different brands for the products major brands especially and then what are the websites for these manufacturers and brands and then do targeted crawling but one could imagine let's say if our goal is to collect all of the knowledge for music and for movies if we only do targeted crawling we might miss information for torso and especially long tail uh music and movies so that's where targeted crawling may not be the best way mm-hmm yeah and so it sounds like you there's another part of the process where you're just crawling the entire web and then you've got to relate it to the existing knowledge graph to see if a given page is even relevant to something that you care about yeah in a sense yes and that's um also what uh google does because google in a sense uh has access to the whole with uh whole web and uh it's um a natural way uh to basically uh get extracted knowledge from all of the websites that suggests that they're that targeted versus not isn't really binary right you've got targeted in the sense of i've got urls associated with a product or you know then another level higher there's i've got brands that are associated with a product at a level higher i could just google this product and see what comes back and then crawl that and then at the highest level it's just crawl everything and then try to find relationships yeah it's definitely not binary so there are a lot of different factors to consider um and also for different companies it is also like a different sort of position taking google as an example it has access to the whole web and that's a little different for amazon you mentioned that you use embeddings you create an embedding for the yeah and i'm not sure i'm being very precise here either the product information that you get back but at some point you're applying embeddings and the example that came to mind was your earlier example about the ice cream flavors yes and it prompted um this idea of conditional embeddings and is that a thing like can you look at an embedding space of and and identify the distance of you know some set of flavors conditioned on ice cream is that a thing does that what is that exactly okay yeah that's very good inside so basically if we look at i mean spicy is a valid flavor but we look at the products where spicy is a flavor and we look at the type of those products it's unlikely to be ice cream and so when we learn the embeddings it will capture all of this subtle relationship in a nice way but implicit way so instead of saying hey ice cream should not have this five different flavors the embedding will basically say for the spicy flavor it mainly applies to products of these types and that will help us uh to identify mistakes saying oh this is a spicy ice cream something seems to be wrong or at least unusual yeah yeah i'm trying to to think of how you create an embedding space that does that usually you would think of you know the embedding spaces being characterized by chocolate vanilla strawberry being clustered here american ice cream indian ice cream being someplace distant but you know those will all be similar and how do you create a space so that indian ice cream and spicy are close and the you know regular ice cream or american ice cream and it's very close yeah yeah yes it is um i i would say there are a lot of cool technologies in this and there is a domain for knowledge graph embedding and the idea behind it is actually um simple but useful so we can look at the entity relationship entity as a triple so it could be considered as subject predicate object or a subject verb object and then when we put all of these triples together and then look at all of the triples related to the spicy flavor that's how we sort of learn the embeddings from those triples and note that this triple i mean this embedding actually propagate uh so when we look at all of the products that are related to this spicy flavor uh the products itself already carry the information about their product types so that's already encoded into their embedding and then all of these things are propagated into this spicy flavor what's an example of a triple in this analogy that we're talking about with ice cream uh yeah so it is um some product uh has flavor uh spicy okay interesting yeah the uh more of a graph type of a metaphor seems more intuitive for the kinds of relationships you're you're talking about relatives embedding um but it's it's interesting all the different things that you can do with uh embeddings including this yes yeah so um recently i think i would say in the past five years this graph neural network has been getting a lot of attention and a lot of research a lot of progress and that's one of the most effective techniques for knowledge graph because knowledge graph is a graph and then using all of this graph neural net you will be able to learn embedding for every node in the graph and then propagate the information and so the the graphs that you're building are these did these is this a research topic do these power the amazon.com that we go to today um is it you know is it one knowledge graph you know for all of products or is it some aggregation or ensemble even beneath the product level like how how are these actualized within the amazon business today sure so it is um certainly uh there are a lot of research and the science investment going on uh that enables these techniques but it's actually a production system and we generate this uh structured knowledge for a lot of products and we use that to support i would say three major kinds of applications uh the one is search so as a simple example if you search for shampoo and then you might say okay i want those uh product products for mail and then we have this search navigation and you could click a button to say i'm more interested in shampoos for mail and so that's for search it also helps us to understand the customer intent very well from the queries and then try to match it to the essence so that's the first type the second type is recommendations and using the knowledge we can hopefully tell you for example this is a product you are viewing and here are all of the other products that are mostly similar but have maybe slightly different things like the model is different uh the memory is different uh and there are more accessories so on and so forth so recommendation is the second thing both search and recommendation is a way to help people discover products and the next thing is more of a displaying information about the products so now for many many products at amazon when you look at the detail page you can see the structural information about particular properties of the products and that is this kind of structured knowledge and also you can see like a comparison tables and we are experimenting how we can generate better uh comparisons using the structured knowledge great great um you're thinking about knowledge graphs and uh the kinds of things you're doing at large scale how would you scale that down for someone who wanted to you know start exploring this area and maybe do something at the personal level or the you know much much smaller business than amazon level yeah that's a great question so i think it's kind of providing the tooling and generic technologies so before i work on the product domain i didn't realize how different the product domain could be from the normal like google knowledge graph bing knowledge graph and so similarly when people talk about like um medical information biology i also see their unique challenges and so if we are able if we will be able to sort of have a set of tools to help people to get knowledge for particular domains that will help small businesses and at this moment the sort of uh the the the well-known knowledge graphs are mostly from big companies one thing uh we should i think i'm hoping we will be able to develop is those like um services or tooling which will help small businesses to build their knowledge graphs and in that domain as i said this generic techniques that will be the key and are there things that come to mind and when you describe a knowledge graph a lot of what you think of in terms of relationships you know you could start as simple as a relational database and you know relate different products or entities and you've got different attributes and things like that but there's certainly a lot of other things you might want to do uh do you are you aware of uh you know is that the kind of natural starting place or um are there knowledge graph tools you know for someone that wants to build the graph not necessarily subscribe to someone else's graph so i think there are um i would say three levels the first level is how to provide the um the the the technology to store the knowledge graph and as you said relational database it actually has exactly the same expressive power as graphs in other words anything that we can describe in a graph we can store it in relational databases and vice versa and in addition for at aws there are all of this graph database tools as well and so that's the first layer how to store it the second layer i would say is the tools to help people to really build their graph for example web extraction tools or this linkage tools when i have two different databases about movies how do i know these two movies are the same movie how do i know uh these two people are the same and i see such uh sort of tools at different levels um so some of them are provided from aws as well but we need to sort of integrate them to be a big tool set that will cover all of the different techniques i mentioned in this knowledge graph pipeline and i would say there is actually a third level i think at this moment each company owns their own knowledge graph and there is actually some common knowledge that belong to human beings and that would be a good set to start with for building small specialized knowledge graphs if one day we could have those knowledge sort of as a service provided by different small business owners small companies that will be good as well are you aware of any efforts to standardize you know some kind of interface between knowledge graphs federated nalographs so that you know amazon's could hook in the googles could hook in the my little companies and give one api to a broader view of the world or or or not yeah so um i'm aware of such um efforts from the research community i think there is one uh called open knowledge uh so it is um uh so i know a few people like from schema.org and this is uh possibly even getting funding from the government about trying to hook up different knowledge graphs i don't see that much of appetite among the companies and i mean i can definitely understand the reason behind it sure it's huge efforts to build such a knowledge graph sure sure and the data is valuable and they're not sure they want to share it just yet goes into creating exactly yeah data is valuable and i mean in a sense we shouldn't get the data for free because otherwise people wouldn't have the motivation to work on data got it well luna thanks so much for taking the time to share with us a bit about what you've been working on and talk about knowledge graphs thank you very much i appreciate all of this insightful questions thank you you\n"