Epsilon Software for Private Machine Learning with Chang Liu - #135

The Challenges and Opportunities of Summarizing Scientific Papers

As John mentioned, summarizing scientific papers is a complex task that requires a deep understanding of the subject matter. A good human writer would go back and try to summarize the paper while also leveraging the advantages of extractive summarization techniques. This approach involves identifying the entities in the paper that are most important or unclear and using them as a starting point for summarization.

Summarizing scientific papers is essentially a hard problem, and it's one that researchers have been grappling with for some time. One of the challenges is that scientific papers already come with their own summaries - known as abstracts - which can be helpful but also riddled with jargon and technical terms that may not be familiar to non-experts. This means that a good summary of a scientific paper would need to provide a clear and concise overview of the research, without relying on specialized knowledge.

To overcome these challenges, researchers are exploring the use of ontologies and knowledge bases as a way to provide a framework for summarizing complex information. One example is the Mesh ontology, which was developed by the National Institutes of Health (NIH) and provides a rich set of concepts and terms related to biochemistry and molecular biology. By using an ontology like Mesh, researchers can map technical terms in a scientific paper to their equivalent concepts in a broader knowledge base, allowing for more effective summarization.

This approach requires significant amounts of data and computational power, but it also has the potential to accelerate artificial intelligence research by providing engineers with better tools for reading and analyzing large volumes of complex information. By building AI systems that can summarize and make sense of scientific papers, researchers hope to speed up the process of advancing knowledge in their field.

In particular, the use of Mesh ontology as a tool for summarizing scientific papers holds significant promise. By leveraging this existing knowledge base, researchers can create more accurate and effective summaries of complex information, which can then be used to inform decision-making or drive further research. This is particularly important in fields like pharmaceutical research, where understanding complex biological systems is critical to developing new treatments.

Ultimately, the goal of summarizing scientific papers is to provide a clear and concise overview of complex information, without relying on specialized knowledge. By using ontologies and knowledge bases like Mesh, researchers can create more effective tools for summarization that also provide insights into the underlying research being conducted.

"WEBVTTKind: captionsLanguage: enhello and welcome to another episode of tormal talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington I'd like to start out by thanking everyone who joined me last week at the toile AI summit in Las Vegas it was a great event for a summary of the event and my key takeaways from each of the event sessions sign up for my newsletter at tamale Icom slash newsletter I wrote about it right after returning from the event last week and when you sign up you'll automatically get an email telling you how to get access to back issues again that's twill Malaya comm slash newsletter event season continues this week tomorrow I'm keynoting at the prepare AI event here in st. Louis and then making my way out to San Francisco for figure-eight strain AI conference the Train AI agenda looks awesome and I'll be on site all day podcasting so if you're in the Bay Area you should definitely plan to stop by of course if you do use the discount code twimble AI for 30% off of registration be sure to give me a shout if you're planning to be around in this episode I'm joined by John Bohannon director of science at AI start-up primer as you all may know a few weeks ago we released my interview with Google legend Jeff Dean which by the way you should definitely check out if you haven't already anyway and that interview Jeff mentions the recent explosion of machine learning papers on archive which I responded to jokingly by asking whether Google had already developed the AI system to help them summarize and track all of them while Jeff didn't have anything specific to offer a listener reached out and let me know that John was in fact already working on this problem in our conversation John and I discussed his work on primer science a tool that harvests content uploaded to archive sorted into natural topics using unsupervised learning then gives relevant summaries of the activity happening in different innovation areas we spend a good amount of time on the inner workings of primer science including their data pipeline and some of the tools they use how they determine ground truth for training their models and use of heuristics to supplement NLP in their processing all right let's do it all right everyone I am on the line with John Bohannon John is director of science at a startup called primer John welcome to this week in machine learning in AI Hey so this conversation is an interesting one it grew out of a listener response to a comment made in my recent interview with Jeff Dean Jeff commented on the explosion of machine learning papers on archive and I jokingly asked if Google had already developed the deep learning based summarization techniques to help us all keep up and it turns out that one of your colleagues John reached out to let me know that you have been working on this and have built it and I mean just before we got started you showed it to me and it's pretty cool so here we are but before we get into the details of that project you've got an interesting background in molecular biology and data journalism how did you find your way to AI uh it's a long journey but uh I think it started in computer camp when I was 9 years old so that that's the kind of summer camp I went to and yeah as my studies progressed I actually drifted away into biology in a PhD in molecular biology and then before doing my next postdoc I wanted to take a break and do something different so I tried being a journalist science journalist and fell in love with it and basically jumped off the academic track and became eventually a computational journalist basically using data and code to find and tell stories that are impossible to tell otherwise and a friend of mine named Sean Gourley who did his PhD with me in England at the same time I actually lived in the same house our fate eventually became intertwined again he I moved to the Bay Area to do a visiting scholar stint at Berkeley and he's in San Francisco he says hey John I've got this startup called primer and you really should come by and check out what we're doing I think you're gonna find that the stuff we're working on really really matches with the stuff you work on and so eventually I have some time and I was like okay I'll pop over there for a week and sure enough within one day it was clear that they were solving problems that I just find so hard and I wanted so badly to solve myself that basically if you can't beat him join them nice nice so maybe for context you can tell us a little bit about what the company does and the kinds of problems that that they're working on it you're working on yeah so primer at its core is an AI company that's trying to make machines that read and write that's the fundamental problem that underlies all this in terms of a business model we for example automate a lot of the work that a junior analyst would do in say a bank or the intelligence community also frankly what a journalist does I feel like I'm reverse engineering myself every day because a lot of what you have to do actually it's it's also somewhat automating a lot of what you do Sam like all of our jobs what we have in common is that we have to read a ton of stuff often very technical stuff and make sense of it and then tell stories we like that is the fundamental unit of information that's our data structure right I stole and that that is that is really hard for computers to do so it'll let people to do exactly I think yeah it's one of those things that's both it's hard for everyone so I think you're relatively new to this podcast but those that have been around for a while from the beginning know that it started out as more of a news oriented format as opposed to an interview format and basically the my mission was to kind of summarize the most interesting AI and m/l tidbits for from the previous week's news but that is super super hard especially with so much news happening all the time it would take a ton of time to curate all of that information and digest it and you know turn it into stories as you're as you're saying exactly and so like you faced several problems and what we tried to do it at primer is is break it down into reasonable problems that you can actually attack so one is is for example what's relevant like you know what are you telling a story about it's it's not enough to just say I want to tell a story about last week's AI research okay well where you like what documents are relevant even if you could get the papers then it's like well where do you get all the conversations about those papers how do you figure out what those papers were about you know if there were a thousand papers published over the past several months and you wanted to tell a story of a thousand papers I don't know how a human would do that I well actually I can tell you humans simply don't do that what we do is we take shortcuts we sort of we sort of fly blind we grab you know the zeitgeist and and that's that's kind of a random process it's like well I you know I I overheard some conversations and this seems to be a hot topic I'm gonna decide and so I'm going to amplify it and what you end up with our coherent stories but they're not necessarily what actually was the most important thing that happened it's just some some strange sampling of the space of all things that happened that's the best you can do but what if you had a machine that could actually read everything and show you in some sense everything that happened that's the goal so you showed me a kind of a portal into research papers yeah is the idea to provide that as a service or more the platform that allows someone to create that thing so we're in a pretty privileged position were privileged in the sense that we've already got some really big customers so the federal government Walmart Singapore's sovereign trust with several others coming online soon does those are the those are the relationships that actually pay the bills and so we do things like if you have a portfolio manager who's trying to keep track of a ton of companies that portfolio manager needs to stay on top of all the relevant developments in the space roughly defined by all those companies all the news about them maybe SEC filings if you want to assess changes in risk profile it's it's sort of an overwhelming task and so primer basically superpowers those analysts by automating all the things that are really hard and tedious and time-consuming and it basically reduces the cost of curiosity it allows those analysts to not spend half their day reading a million things just to find out what was worth reading instead they can they can see summaries of you know one hundred papers at once get a sense of whether it's worth diving deeper or look at another batch of a hundred papers it you know it also gives alerts with predefined conditions so that you don't lose a second if if something that you know in retrospect is going to be a situation worth knowing about you'll get a heads-up so meanwhile though you can use the same machine machinery that does reading and writing and summarization to do things like the thing I sent you like read all of archive so we do we do have a business model for this system going forward we're going to be developing it into products for for example the pharmaceutical industry but for the time being we just have this this beautiful laboratory where we get to really push the edge of natural language processing tell us more about this archive project that you've built yeah archive is a really good illustration of this problem that we all face of too much information if you ever go to the archive website you basically see a firehose of research coming in archive is amazing because it is literally the place where research gets debuted it's it's the first place you'll see a paper coming out from Google or Microsoft or MIT on topics that are basically going to define the machine learning progress over the next 10 years in retrospect you can look back and you can you can see the timeline of this amazing scientific revolution unfolding but it's not at all human readable like even if you are an expert even if you have a PhD in machine learning you just can't make sense of all of archive you might be able to make sense of the papers in your own subdomain but even there it's tough you got to find them archive isn't designed for for humans in a way I mean it is but it's it's just not user friendly and so primer science is a stab at making sense of that basically it's a it's a really it's a really hard problem that's well scoped so what what it does is it is it harvests all these papers and it does unsupervised learning on the content of the papers to try and figure out what are the topics that this naturally falls into so within machine learning for example I'm just looking now at some of the latest the system has discovered that there are not only image reconstruction papers there's like 58 papers actually in this bag that are on that theme but it is discovered that there's a whole bunch of research on traffic and temporal analysis there's something on mathematical optimization there's a whole bunch of papers about semantic segmentation so all of this is happening without an ontology or knowledge base and that's you're gonna have to have such a system if you want it to work on any corpus of papers so you could imagine building some super ontology that capture everything there is to know about science but then it's gonna be out of date next month like I wouldn't want to build that thing because you know maintaining it would be a nightmare instead you need a system that does more less what humans do on a smaller scale what we do is we we look at things and we just sort of eyeball it and say oh you know these are kind of about this and these are about that and so you get a natural segmentation of the space and then within each of these topics it does a time series analysis and it tries to figure out if I take all the news and the social media signal all the tweets about this research as it was published and afterwards all the commentary all the sort of real-time online critique so the the peer review that's happening in real time out and out in the open can I detect events and so an event event and be more than just the publication of a paper it could be that for example a self-driving car crashes somewhere and suddenly the world is looking intensely at an issue related to what we do and don't know about these systems and some of this research may get pulled into that if you want to detect that real world event you need a system that can actually divide all those documents all those tweets all those things that are relevant to the same thing and figure out how to segment them in time and so it does that too and it tries to figure out essentially what were the big events in this space how was human attention in the world divided in relation to this corpus of papers and then does some other cute tricks to make it make it useful to you as you dive into all of this information it pulls out all the people and tries to tell you what it knows about them just based on the corpus mind you we're also developing a version of this that is building a knowledge base and actually learning about people as it reads the news and as papers are published but what you're what I sent you this morning is just essentially the out of the box I don't know anything about the world but I know this group of thousands of papers you sent me and this is what I can tell you about them these are all the people these are all the topics these are the events that seem to all of this information seems to be pointing at out in the real world and and another cute one is if you're finding the jargon really hard to understand I've generated a dictionary for you that is kind of a magical dictionary where if you click on a technical term it actually shows you who coined that term when how is it defined give me some context about how it's used it's kind of like a oxford english dictionary on steroids nice nice I'm finding this interview more challenging than most because as I'm as you're speaking I've got the the tool in the background and I keep seeing papers that look really interesting it's working super super distracting so maybe can you tell us a little bit about the the technology that's making it all happen yeah you know what is the what is the stack look like what does the pipeline look like how are you approaching the unsupervised learning piece so it all it all begins with a gigantic elasticsearch indexed okay I think if you I think if you talk to a lot of the people that you've interviewed even already about you know what's at the bottom of this whole stack there's often like some massive index of documents so we're ingesting the news and blogs and tweets and scientific papers every day and that's the starting point of this whole system and it has this growing corpus and so if you query as as we've done today on artificial intelligence for example the first thing it has to do is is retrieve all the information that is relevant and then kicks off this pipeline where basically the first thing it does is it tries with unsupervised learning plus several other steps to divide all the information up into natural topics so within each topic it then tries to detect the events in the real world that any of these documents might be referring to so if you've got like a hundred documents that might be news documents and scientific papers and social media signal about all the above you do a time series analysis on it and you try and figure out are are there real-world events it's trying to make an inference here are their real world real world events that all of this information is pointing at and describing it looks at events basically from the perspective of news articles is that right the system you're looking at does yeah but you can imagine any any document that has a meaningful publication timestamp and includes a description or commentary about something that happened in the real world it could in principle be mapped to something called an event the concept the concept of an event is is bigger than what a human intuitively would call event it might actually be for example an explosion of discussion around an issue for example the me2 movement is not just an event right it's made up of many events right and some of those events might not even be something that could have been observed in one place at one time but there is a natural segmentation of all the things happening in the world into something that we call events so that's that that's the theory but not behind this then if if you click over to overview sorry to distract you again then it tries to tell you a story so we've got many versions of this what you're looking at is is basically the one of the earliest versions of this but basically if you asked if you asked a machine to go and read thousands of things and you give it a budget of one page to tell you what it learned this is starting to get at what you'd expect to come back this is what you get it's basically and it's kind of like a technical report on these are things that I learned here these are the big events these are the big papers this is what's getting most attention oh and then by the way my topic analysis has revealed that there are some changes afoot in artificial intelligence and these are the things that seem to be trending upwards and are really interesting and oh by the way I discovered there's this weird paper that seems to fall in this topic but it's deeply connected to this other topic and that's statistically strange I need to tell you about it and by the way here's some people who seem to be getting a ton of attention and here's another person who has collaborated with them on a high profile paper and they've never worked together before that's interesting so you can see what's going on here is it the system has a model of what humans find interesting and of course we humans at primer built that in there's a story logic that underlies this you don't want a system to tell you everything it learned it's just going to be another fire hose you've made no progress a one-to-one map of the world is not a useful map so you need something that will compress the information and try and tell you a story so that's what the system does I think I interrupted you as you were about to start talking about the the pipeline that you're sending some of this stuff through and just going back to the beginning with with archive like are you ingesting all the archive papers or crawling that site like how yeah so pumpkins bark who found it and still runs archive is a friend of mine from a good while back and and he is this primer science as well like he I think actually he's the very first one I made a user account for oh wow yeah and so he's really helped out over the past year making sure that we have direct access so we don't have to scrape the site we basically just pull down the entire day's new papers all in one go and we do the same with news except it arrives more or less in real time so we have a real time stream more or less of the news with maybe a 10 minute delay and we've got a real-time stream of all the tweets that are relevant to space and are those via commercial ApS of some sort we get them directly from Twitter okay so yeah we have a day to deal with them okay and then news the news we actually have several sources of one of the most convenient is LexisNexis they have a service called moreover you can actually purchase a firehose of news they do a really good job actually oh wow okay so you pull all that into your elastic search index and maybe talk a little bit about some of the underlying NLP bits that are enabling all yeah yeah so when you kick off a query what's happening is you're making a lot of reading happening so for example if you if you take a look at the topics that have been generated text in word embeddings quantum and all of those topic labels that is generated it actually discovered and chose those from the content of the articles themselves so the first step in any NLP task on documents is to tokenize the entire document so are you familiar with tokenizing mm-hmm yeah so you basically discover all the words and punctuation and you run an analysis that gets you the parts of speech it's kind of like it's kind of like what you did in grade school when you made these sentence diagrams to try and make sense of all the different parts of what someone says and then a whole bunch of things happen in parallel basically you have to there's some things that are useful if you if you give it a bag of words so you can take an entire scientific paper or even a thousand scientific paper person they turn into bags of bags of words and with that kind of analysis you can you could for example discover the groups of words the engrams that basically best describe this space and you can generate a label so if you go into any of those topics it has decided to call give that topic a name based on the language within the documents themselves within the topic so I'm still amazed that it works frankly NOP this kind of magical when it when it when something makes sense to a human when there's a machine that didn't really understand it in the same way you did right right it's kind of magical are you using kind of off the shelf NLP toolkits and lt case I found stuff order yes and rolling it off now we started off that way okay so we've been using this tool spacey from the very beginning it's it's free it's open source and it's really powerful and it's it's really what shocks me is that they're just two people at the heart of this project um a fellow named ha Noble and a gal named eNOS who lived in Berlin not far from where I lived for a few years and I've gotten to know them a little bit just recently and it does the nuts and bolts NLP that you need so it will tokenize but it'll also discover named entities it'll help you find the people and organizations and and so forth but you need to train it that's something that we have discovered is just probably like everyone else it'll get you started but then you need to solve your own problems it's only a starting point so for example with the the people and all that all the information that we can extract about them and and tell you a story based on the people in this space Spacey is one of the things that we use early in the pipeline but then there's a ton of custom code that we had to build to basically get the kind of information that space you can't get to clean up the stuff that Spacey gets wrong to link it with all the other information we're extracting by other means and it's a mixture of machine learning and good old fashioned regular expressions that what I find so fun about being at an AI startup is the goal here is not to generate research papers the goal is to just solve problems really well by whatever means you can so which i think is like the right motivation to have right right if you're just motivated to to publish cutting-edge papers you don't care if it works I went to this I went to this conference called nibs which is essentially like where all this cutting-edge research is being debuted and something that really struck me is like half the stuff that people are bragging about doesn't even really practically work or works within such a narrowly congestion that with a problem that'll where I am right right let it's like it's computationally intractable or whatever but and that's fine it's it's like that's the whole point is to debut tomorrow's technology but it's frustrating when you're trying to build something and you get excited about some some new idea and you chase it down only to discover oh this actually never could have worked yeah oh yeah and I've had that experience I've had the experience I I found a paper using primer science of course it's a pretty weird situation to have AI eating itself we basically have an AI system that reads AI papers which we then use to try and improve the AI that respects but we came across a really exciting paper and fully replicated it and it just doesn't work and that's okay that's that's how it goes in this space when you're when you're right at the edge of knowledge it's not all going to work so we have this principle a primer of always trying to find the practical solution as quickly as possible don't don't get seduced by ideas that are sexy to talk about but it's not actually solving your problem yeah I should throw in a plug for my newsletter I've recently written on this topic of reproducibility and both science and AI drawing off of a recent interview I did with Clare gall neck on this the same topic but I really appreciate you you know owning up to that broader pipeline one of the questions I get a lot when talking with folks about their products or projects is you know people want to know like okay granted you've applied you know some great cutting-edge machine learning AI stuff but what else is there required to make it work like what are the how much heuristics are kind of in and around these tools to actually make it work so to hear you note that yeah you know good alright good old regular expressions are used liberally to you know make sure that this all works I think it's important for it's important to realize that and oh yeah absolutely I guarantee you you go into some of the the biggest most cutting-edge groups at giant tech companies you think that they're doing some kind of pristine AI that you just press a button and it understands things I guarantee you look under the hood and there's just a ton of regular expressions now that's not to say that machine learning isn't the way forward like it totally is but to make these things work on actual problems it it's still a labor of love so you're doing a lot with space ER you also which I'm assuming is more traditional NLP technology approach are you also doing things with like were Tuvok and deep learning based approaches yeah in particular as we've expanded into other languages beyond English Spacey is just not gonna cut it when you want to make something that understands Russian and Chinese mm-hmm so we've actually had to pretty much make a bunch of tools from scratch but it it relies on word vectors and word embeddings and where things where things get complicated is is actually where you try and pull this all together so if you if you use deep learning to extract for example some pattern in you know a corpus of 10000 documents the harder thing you know once you've extracted it is knowing whether you're right and whether it's worth saying like I can I can I can find a bunch of patterns in text pretty easily but the harder thing is assessing how confident am I that I found something that I haven't just miss extracted it's not just a spurious pattern and then even harder than that is it worth telling you like how do i square this with my model of what humans are interested in right right though where we're headed with this is basically a model of stories which ultimately is a model of humans humans are storytellers we've evolved to do this thing we just take it for granted like what we're doing right now this conversation is is incredibly high-tech like you and I you and I in real time are like gliding through a narrative that millions of years of technology evolution it's amazing yeah it's amazing so I think this is actually the next frontier of AI decoding what story is yeah and so what does that mean practically howhow are you approaching that yeah so here's a bite-sized example if you make something that reads scientific papers and tries to tell you what you need to know about I research last week for example it's not enough to just give you a dashboard of here's the most shared paper here's the here's the paper that got the most news here's the paper that currently has the most citations that's not doing much heavy lifting for you if you were to hire a thousand human analysts to just work for you like imagine you had that luxury what would you ask them to do that's that's kind of that's kind of the better guiding question and and what what sort of story would they tell you what would the format be I guarantee the humans wouldn't come back and give you a dashboard they would say okay the big deal last week is that a self-driving car crashed and it's kicked off a huge discussion about quality control and where we're system system errors are going to creep in and how you can make machine learning systems understandable like from an engineering point of view how are we gonna how are we gonna deal with this emerging problem and the people who are weighing in on this are the following the researchers and deep learning but here's some other people who are very knowledgeable but they're in an adjacent domain and we think this is this is really worth knowing but meanwhile by the way we discovered a paper published by a couple of researchers that you've rarely heard of but it's getting a lot of traction and it seems to be on a topic that is emerging and you're probably gonna care about this you know it's it's basically it has to do with voice recognition and we know that that's an interesting topic but the more interesting thing is that this researcher is really well known in a totally different field and is just like diving into this and that's unusual so check it out like here's the paper I'm just gonna go out on a limb here and say you really should read this paper and then by the way like here here's basically a new concept that is creeping into the space and we haven't seen it before this might be a fluke but I think this is actually something that's that's worth knowing about and so here are five papers that you should read you know I'm working within your budget here so that's that's what all the humans would do it's basically the kind of like one one to two page presidential intelligence briefing you know that ideally that's what it would look like a ton of research has gone into boiling things down to a very tight story and that's all you need to know and so the the idea then is that you've got some kind of generative model for creating these you know basically your briefing yeah over and it has two steps it has two steps like at least two steps one is what information can I find that's truly relevant the raw ingredients of a story and then the next step is well how can i synthesize this into an actual story I have to do text generation document planning you know you give me a budget a page a paragraph maybe you just want a bullet point and I'll work with it I'll be able to express this as a story given that constraint and so kind of going back to our earlier exchange about you know good old-fashioned heuristics like how to what degree I haven't looked at you know compared one of these briefing pages versus another but you know how much is generation and how much is you know more you know like templates and things like that mm-hmm yeah so the philosophy we followed is always start fast and doable I mean put another way like you always want to start with a model that you can fully understand yourself and implement quickly so that you have some some baseline sure so yeah we've always always started with first can you do it yourself as a human maybe even no computer involved mm-hmm like if you were to read ten papers and try and say something intelligent about them for example tell me tell me for example what if you were to classify events and I gave you a pile of papers and I said how would you classify these if kind of tags would you attach them or if you were looking for a particular kind of event could you could you divide papers into yes and no always start with yourself you the engineer can you yourself do it because if you can't you're probably gonna have a hard time teaching a computer do it and then if you get some other humans probably the you know the person just two chairs away from you if you can get someone else to do the same task independently and get the same ideally or a similar answer okay now you're in good shape only then do you start building a computational system to try and do this automatically and your first stab at that should be something straightforward a set of regular expressions heuristics can you can you actually find this yourself using rules that you yourself devised and then if the only way really to get beyond that to really tackle increasing complexity is to have something that will learn on its own you'll never do that with regular expressions you have to you have to use machine learning to have a system find patterns itself in a changing world so I think you're saying then that there's you know you're somewhere on the spectrum of templates and machine learning oh yeah always and in fact I think the the best things out there are always somewhere in the middle right I think my definition yeah and and essentially it becomes a race can can we build something that can learn faster and output better smarter content than the system we have we had a little race actually recently to try and build an event classifier and a brilliant engineer named Leonard Appleton took a stab at just using regular expressions ah no machine learning and another brilliant engineer named Josh took on the task of solving the same problem using a really complicated machine learning graphical model and sometimes John Henry wins the race frankly yeahshe could not build a system at least last I checked that could do better than Leonard's massive complicated regular expression heuristic engine and but you know eventually eventually machine learning will win like we all know that right right but that's that's the beauty of a practical approach when when you're really driven by practical principles you're willing to say well we've got a better solution that's actually simpler and easier to understand let's use that for now keep trying but it's it it's never long before a machine learning based system does better it's just an incredibly powerful tool when you're using machine learning for tasks like summarization where you referenced earlier you know first you do it then you get someone else to do it and you compare them you know your summary of a given paper or a given paragraph is likely to be very different from mine like what do you how do you find ground truth so that you can train learning models yeah that you've really put your finger on the hardest problem stories by their nature can be told infinite ways and there are some automated techniques they've been around for a decade they have French color names I don't know how that came about but there's something called Rouge in that something called blue and what they do is they they treat the output as bag of word problems and they try and find out how much information overlap there is between a human summary and a computer summary as you can imagine that's great if you're trying to measure whether you got it terribly wrong right if we if we make two summaries and they have nothing to do with each other then they're probably they're probably not talking about the same thing though that's right summarizing fiction right you could be we could be summarizing on two totally different levels and both be right that's true that's absolutely true so say my I think the same holds true for news I mean it you know I I'll let you continue but that seems like a very very rudimentary metric well I mean you'd be surprised then to learn that the latest greatest papers in this field are still using those metrics because they're easy you know you can it's it's a one-click measurement right right but it it really doesn't help when you want to assess a subtle output of a story that could be sliced and diced sort of infinite ways unfortunately it becomes a CAPTCHA you need some human to read it and go oh yeah that makes sense or that's crazy right but there are there are some techniques you can use so one is you can actually crowdsource assessment of narrative you can you can give human annotators and scores a system a rigorous system so like you can measure the coherence you can measure the sophistication the whether or not you've you've really summarized the space well in various ways those those sounds like they would require a fairly sophisticated crowdsource yeah so that's right like the the more technical and sophisticated this task becomes the less you can rely on Mechanical Turk in fact eventually you've got your own engineers doing this so it's it's definitely not scalable but there are there are some some tricks that you can use so for example if I generate a bunch of summaries on a topic that I've already summarized for example if I have a Wikipedia article about it I can at least find out if the most important entities in the narrative have been represented and I can also turn the system around and do extraction on the summary you can even I will suggest to make a generative adversarial network that generates stories and critiques them you can see where this is going mm-hmm eventually you can you have a system that tries to check off all the boxes of what counts as a good story like you've you've talked about the most important entities and you've expressed their relationships you've come in under budget in terms of space-aged but ultimately you're going to need a human to assess whether it's a well-written story until we can crack the code of text style transfer where you can actually say tell me this story in the style of a New York Times reporter or tell me the story in the in the style of you know a terse military briefing send our my's text in Hemingway style exactly and until we can actually have networks that can both detect and reproduce narrative style I think we're work for the time being stuck in a world where it's really hard to assess how well our systems are doing ultimately you want to hook this up to your users and and either passively or actively harvest their feedback so the simplest version of this of course is a B testing if you write many versions of summary and you expose a large number of humans to a versus B you can just find out what they think of it by for example whether they click through and read it you can you can also make it active you can let users say yeah that was good or that was bad or going back to my Hemingway text summaries Google inbox presenting you three choices for how to summarize the response the appropriate response to an email yep and we've played with that as well we we generate alternative summaries to events for example it's it's it's a really powerful way of real-time effortless quality checking you don't you don't want to have to sort of pause your whole engineering operation in order all the time just to assess how well you're doing you really want it to be sort of continual you want to always be beating the output of your own computational systems we call it dog fooding if you've got to be real-time dog fooding and so the nice thing about primer science is this thing that I'm building is we we use it to discover the research that is going to help us make it better and so if you keep on using the thing you are your own quality Assessor that really helps right right like hard to scale I wish I could could clone myself in some way to assess sort of at a thousand decks now one thing that I didn't see in what you've built it it seems like it is it does a really good job at kind of this meta characterization of archive and what's happening in different categories but I didn't see it attempting to summarize individual papers which is the thing that Jeff Dean and I were originally talking about is it trying to do that somewhere nothing what you're looking at we are we are actually working on that that summarization problem okay yeah so we've taken two strategies and they're kind of running in parallel one is extractive summarization where you you you're the system is allowed to pull words and set even whole sentences directly from the text and then kind of pull them together into a summary that works extremely well when you have a large number of Docs like if you if you have a hundred documents all about the same thing extract of summarization is really powerful and really efficient and then the alternative is abstracted summarization where the system is going to write its own words often character by character out of thin air and it has a language model so it reads all these things and it it basically makes a prediction about what it should say next as it generates a summary a really nice bit of progress in this field that we've been using is abstract of summarization with pointers so the idea here is you also have a sense of your confidence about whether the word or phrase that you're putting into the summary at any given time is going to be a good choice and if you're not so confident you point back to the text and you grab the thing itself so for example if you had a sentence that said a one of the most exciting areas of artificial intelligence these days is generate generative adversarial networks now if if gener generative adversarial networks that phrase is something that you haven't encountered or your model basically says I'm not sure if that if I can actually paraphrase that then what you want to do is what a good human writer would do you just go back and you grab that thing so you can you can summarize while also having some of the advantages of extractive mmm so summarizing basically around the entities that you aren't too sure about exactly it basically becomes a sliding scale between abstractive and extractive the more confident it gets the more abstract if it gets the more flexible it gets which will allow you to summarize a single scientific paper for example in a couple of sentences and if you're not so sure then it slides over to extract 'iv and it will just pull out the sentences that it deem and the phrases that it deems are the most central and informative interesting it's a hard problem though it's a really hard problem another thing that makes it hard when it comes to scientific papers is they already have their own summaries they're called abstracts and you and you'd think that oh great this it job done but abs as you know abstracts themselves can be so riddled with jargon and references to arcane things that it's hardly a summary at all it's really only a summary for the authors of the paper right right so you really need a summary of the summary right and that's what we're working on we're finding that you really do need to power this with an ontology and a knowledge base though elaborate on that ok so let's take for example a problem that I'm just starting to work on how do you how do you summarize and make sense of pharmaceutical research papers so there is an ontology that is available to everyone that basically the NIH paid for called mesh and it's it's it's kind of like every every jargon term in biochemistry and molecular biology gene names and gene types all of that is captured in this very rich ontology that was hand-built by no doubt by unthanked graduate students and it's something that's really nice about mesh is that it's actually a subset of wiki data and wiki data is the database that stands behind Wikipedia now I say that in an idealistic way because actually in reality that's the way it was dreamed up a wiki did is gonna basically be the database that powers Wikipedia but in fact it's not there yet humans humans vastly prefer to update Wikipedia with content and wiki data basically plays catch-up nonetheless it is a huge powerful open source knowledge base and the mesh ontology of the subset of it and so if you want to summarize a scientific paper just a single scientific paper the first thing you need to do is is make sense of it you need to map all of those words which to the computer are just it could be random numbers for all they cares has no idea what it means you need to map them to concepts and that's what systems like mesh were designed to help us do so the idea being instead of what you're doing in science primer in doing this in a totally unsupervised manner here you're using the additional information you're getting from the pre-existing ontology to help the Machine make sense of the various documents and to paraphrase it so like a good summary is something that doesn't just like say less it also says just as much but an in in a compressed way right right you know if I just tell you the beginning of a story I haven't really compressed that story for you I need to like give you the sense of the beginning middle and end and compress that all down into three sentences and you're not gonna be able to do that just using a standard NLP techniques on a scientific paper you're just not gonna be able to do it no way all right you have to map that out to an ontology and say oh you know this long sentence describing this genetic pathway I can boil that down to a single sentence that says the genetic pathway X you know mm-hmm interesting but yeah you need a lot of tacit knowledge to be able to do that so that's what we're working on awesome well John this has been super interesting I really appreciate you taking the time thank you anything else you'd like to share with the audience oh just that I'd like to make a prediction go ahead well I predict that the kind of stuff were working on is going to accelerate artificial intelligence research more than anything else I think building AI that can read the latest research on AI and help the engineers who build it build it faster is going to vastly accelerate the whole process awesome well we will put your prediction on the blockchain and just to make sure we get all the jargon in then we'll do it I see I'm gonna do an IC alright awesome thanks so much done thanks Sam all right everyone that's our show for today for more information on John or any of the topics covered in this episode head on over to twilly Icom slash talk slash one three six thanks so much for listening and catch you next time youhello and welcome to another episode of tormal talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Carrington I'd like to start out by thanking everyone who joined me last week at the toile AI summit in Las Vegas it was a great event for a summary of the event and my key takeaways from each of the event sessions sign up for my newsletter at tamale Icom slash newsletter I wrote about it right after returning from the event last week and when you sign up you'll automatically get an email telling you how to get access to back issues again that's twill Malaya comm slash newsletter event season continues this week tomorrow I'm keynoting at the prepare AI event here in st. Louis and then making my way out to San Francisco for figure-eight strain AI conference the Train AI agenda looks awesome and I'll be on site all day podcasting so if you're in the Bay Area you should definitely plan to stop by of course if you do use the discount code twimble AI for 30% off of registration be sure to give me a shout if you're planning to be around in this episode I'm joined by John Bohannon director of science at AI start-up primer as you all may know a few weeks ago we released my interview with Google legend Jeff Dean which by the way you should definitely check out if you haven't already anyway and that interview Jeff mentions the recent explosion of machine learning papers on archive which I responded to jokingly by asking whether Google had already developed the AI system to help them summarize and track all of them while Jeff didn't have anything specific to offer a listener reached out and let me know that John was in fact already working on this problem in our conversation John and I discussed his work on primer science a tool that harvests content uploaded to archive sorted into natural topics using unsupervised learning then gives relevant summaries of the activity happening in different innovation areas we spend a good amount of time on the inner workings of primer science including their data pipeline and some of the tools they use how they determine ground truth for training their models and use of heuristics to supplement NLP in their processing all right let's do it all right everyone I am on the line with John Bohannon John is director of science at a startup called primer John welcome to this week in machine learning in AI Hey so this conversation is an interesting one it grew out of a listener response to a comment made in my recent interview with Jeff Dean Jeff commented on the explosion of machine learning papers on archive and I jokingly asked if Google had already developed the deep learning based summarization techniques to help us all keep up and it turns out that one of your colleagues John reached out to let me know that you have been working on this and have built it and I mean just before we got started you showed it to me and it's pretty cool so here we are but before we get into the details of that project you've got an interesting background in molecular biology and data journalism how did you find your way to AI uh it's a long journey but uh I think it started in computer camp when I was 9 years old so that that's the kind of summer camp I went to and yeah as my studies progressed I actually drifted away into biology in a PhD in molecular biology and then before doing my next postdoc I wanted to take a break and do something different so I tried being a journalist science journalist and fell in love with it and basically jumped off the academic track and became eventually a computational journalist basically using data and code to find and tell stories that are impossible to tell otherwise and a friend of mine named Sean Gourley who did his PhD with me in England at the same time I actually lived in the same house our fate eventually became intertwined again he I moved to the Bay Area to do a visiting scholar stint at Berkeley and he's in San Francisco he says hey John I've got this startup called primer and you really should come by and check out what we're doing I think you're gonna find that the stuff we're working on really really matches with the stuff you work on and so eventually I have some time and I was like okay I'll pop over there for a week and sure enough within one day it was clear that they were solving problems that I just find so hard and I wanted so badly to solve myself that basically if you can't beat him join them nice nice so maybe for context you can tell us a little bit about what the company does and the kinds of problems that that they're working on it you're working on yeah so primer at its core is an AI company that's trying to make machines that read and write that's the fundamental problem that underlies all this in terms of a business model we for example automate a lot of the work that a junior analyst would do in say a bank or the intelligence community also frankly what a journalist does I feel like I'm reverse engineering myself every day because a lot of what you have to do actually it's it's also somewhat automating a lot of what you do Sam like all of our jobs what we have in common is that we have to read a ton of stuff often very technical stuff and make sense of it and then tell stories we like that is the fundamental unit of information that's our data structure right I stole and that that is that is really hard for computers to do so it'll let people to do exactly I think yeah it's one of those things that's both it's hard for everyone so I think you're relatively new to this podcast but those that have been around for a while from the beginning know that it started out as more of a news oriented format as opposed to an interview format and basically the my mission was to kind of summarize the most interesting AI and m/l tidbits for from the previous week's news but that is super super hard especially with so much news happening all the time it would take a ton of time to curate all of that information and digest it and you know turn it into stories as you're as you're saying exactly and so like you faced several problems and what we tried to do it at primer is is break it down into reasonable problems that you can actually attack so one is is for example what's relevant like you know what are you telling a story about it's it's not enough to just say I want to tell a story about last week's AI research okay well where you like what documents are relevant even if you could get the papers then it's like well where do you get all the conversations about those papers how do you figure out what those papers were about you know if there were a thousand papers published over the past several months and you wanted to tell a story of a thousand papers I don't know how a human would do that I well actually I can tell you humans simply don't do that what we do is we take shortcuts we sort of we sort of fly blind we grab you know the zeitgeist and and that's that's kind of a random process it's like well I you know I I overheard some conversations and this seems to be a hot topic I'm gonna decide and so I'm going to amplify it and what you end up with our coherent stories but they're not necessarily what actually was the most important thing that happened it's just some some strange sampling of the space of all things that happened that's the best you can do but what if you had a machine that could actually read everything and show you in some sense everything that happened that's the goal so you showed me a kind of a portal into research papers yeah is the idea to provide that as a service or more the platform that allows someone to create that thing so we're in a pretty privileged position were privileged in the sense that we've already got some really big customers so the federal government Walmart Singapore's sovereign trust with several others coming online soon does those are the those are the relationships that actually pay the bills and so we do things like if you have a portfolio manager who's trying to keep track of a ton of companies that portfolio manager needs to stay on top of all the relevant developments in the space roughly defined by all those companies all the news about them maybe SEC filings if you want to assess changes in risk profile it's it's sort of an overwhelming task and so primer basically superpowers those analysts by automating all the things that are really hard and tedious and time-consuming and it basically reduces the cost of curiosity it allows those analysts to not spend half their day reading a million things just to find out what was worth reading instead they can they can see summaries of you know one hundred papers at once get a sense of whether it's worth diving deeper or look at another batch of a hundred papers it you know it also gives alerts with predefined conditions so that you don't lose a second if if something that you know in retrospect is going to be a situation worth knowing about you'll get a heads-up so meanwhile though you can use the same machine machinery that does reading and writing and summarization to do things like the thing I sent you like read all of archive so we do we do have a business model for this system going forward we're going to be developing it into products for for example the pharmaceutical industry but for the time being we just have this this beautiful laboratory where we get to really push the edge of natural language processing tell us more about this archive project that you've built yeah archive is a really good illustration of this problem that we all face of too much information if you ever go to the archive website you basically see a firehose of research coming in archive is amazing because it is literally the place where research gets debuted it's it's the first place you'll see a paper coming out from Google or Microsoft or MIT on topics that are basically going to define the machine learning progress over the next 10 years in retrospect you can look back and you can you can see the timeline of this amazing scientific revolution unfolding but it's not at all human readable like even if you are an expert even if you have a PhD in machine learning you just can't make sense of all of archive you might be able to make sense of the papers in your own subdomain but even there it's tough you got to find them archive isn't designed for for humans in a way I mean it is but it's it's just not user friendly and so primer science is a stab at making sense of that basically it's a it's a really it's a really hard problem that's well scoped so what what it does is it is it harvests all these papers and it does unsupervised learning on the content of the papers to try and figure out what are the topics that this naturally falls into so within machine learning for example I'm just looking now at some of the latest the system has discovered that there are not only image reconstruction papers there's like 58 papers actually in this bag that are on that theme but it is discovered that there's a whole bunch of research on traffic and temporal analysis there's something on mathematical optimization there's a whole bunch of papers about semantic segmentation so all of this is happening without an ontology or knowledge base and that's you're gonna have to have such a system if you want it to work on any corpus of papers so you could imagine building some super ontology that capture everything there is to know about science but then it's gonna be out of date next month like I wouldn't want to build that thing because you know maintaining it would be a nightmare instead you need a system that does more less what humans do on a smaller scale what we do is we we look at things and we just sort of eyeball it and say oh you know these are kind of about this and these are about that and so you get a natural segmentation of the space and then within each of these topics it does a time series analysis and it tries to figure out if I take all the news and the social media signal all the tweets about this research as it was published and afterwards all the commentary all the sort of real-time online critique so the the peer review that's happening in real time out and out in the open can I detect events and so an event event and be more than just the publication of a paper it could be that for example a self-driving car crashes somewhere and suddenly the world is looking intensely at an issue related to what we do and don't know about these systems and some of this research may get pulled into that if you want to detect that real world event you need a system that can actually divide all those documents all those tweets all those things that are relevant to the same thing and figure out how to segment them in time and so it does that too and it tries to figure out essentially what were the big events in this space how was human attention in the world divided in relation to this corpus of papers and then does some other cute tricks to make it make it useful to you as you dive into all of this information it pulls out all the people and tries to tell you what it knows about them just based on the corpus mind you we're also developing a version of this that is building a knowledge base and actually learning about people as it reads the news and as papers are published but what you're what I sent you this morning is just essentially the out of the box I don't know anything about the world but I know this group of thousands of papers you sent me and this is what I can tell you about them these are all the people these are all the topics these are the events that seem to all of this information seems to be pointing at out in the real world and and another cute one is if you're finding the jargon really hard to understand I've generated a dictionary for you that is kind of a magical dictionary where if you click on a technical term it actually shows you who coined that term when how is it defined give me some context about how it's used it's kind of like a oxford english dictionary on steroids nice nice I'm finding this interview more challenging than most because as I'm as you're speaking I've got the the tool in the background and I keep seeing papers that look really interesting it's working super super distracting so maybe can you tell us a little bit about the the technology that's making it all happen yeah you know what is the what is the stack look like what does the pipeline look like how are you approaching the unsupervised learning piece so it all it all begins with a gigantic elasticsearch indexed okay I think if you I think if you talk to a lot of the people that you've interviewed even already about you know what's at the bottom of this whole stack there's often like some massive index of documents so we're ingesting the news and blogs and tweets and scientific papers every day and that's the starting point of this whole system and it has this growing corpus and so if you query as as we've done today on artificial intelligence for example the first thing it has to do is is retrieve all the information that is relevant and then kicks off this pipeline where basically the first thing it does is it tries with unsupervised learning plus several other steps to divide all the information up into natural topics so within each topic it then tries to detect the events in the real world that any of these documents might be referring to so if you've got like a hundred documents that might be news documents and scientific papers and social media signal about all the above you do a time series analysis on it and you try and figure out are are there real-world events it's trying to make an inference here are their real world real world events that all of this information is pointing at and describing it looks at events basically from the perspective of news articles is that right the system you're looking at does yeah but you can imagine any any document that has a meaningful publication timestamp and includes a description or commentary about something that happened in the real world it could in principle be mapped to something called an event the concept the concept of an event is is bigger than what a human intuitively would call event it might actually be for example an explosion of discussion around an issue for example the me2 movement is not just an event right it's made up of many events right and some of those events might not even be something that could have been observed in one place at one time but there is a natural segmentation of all the things happening in the world into something that we call events so that's that that's the theory but not behind this then if if you click over to overview sorry to distract you again then it tries to tell you a story so we've got many versions of this what you're looking at is is basically the one of the earliest versions of this but basically if you asked if you asked a machine to go and read thousands of things and you give it a budget of one page to tell you what it learned this is starting to get at what you'd expect to come back this is what you get it's basically and it's kind of like a technical report on these are things that I learned here these are the big events these are the big papers this is what's getting most attention oh and then by the way my topic analysis has revealed that there are some changes afoot in artificial intelligence and these are the things that seem to be trending upwards and are really interesting and oh by the way I discovered there's this weird paper that seems to fall in this topic but it's deeply connected to this other topic and that's statistically strange I need to tell you about it and by the way here's some people who seem to be getting a ton of attention and here's another person who has collaborated with them on a high profile paper and they've never worked together before that's interesting so you can see what's going on here is it the system has a model of what humans find interesting and of course we humans at primer built that in there's a story logic that underlies this you don't want a system to tell you everything it learned it's just going to be another fire hose you've made no progress a one-to-one map of the world is not a useful map so you need something that will compress the information and try and tell you a story so that's what the system does I think I interrupted you as you were about to start talking about the the pipeline that you're sending some of this stuff through and just going back to the beginning with with archive like are you ingesting all the archive papers or crawling that site like how yeah so pumpkins bark who found it and still runs archive is a friend of mine from a good while back and and he is this primer science as well like he I think actually he's the very first one I made a user account for oh wow yeah and so he's really helped out over the past year making sure that we have direct access so we don't have to scrape the site we basically just pull down the entire day's new papers all in one go and we do the same with news except it arrives more or less in real time so we have a real time stream more or less of the news with maybe a 10 minute delay and we've got a real-time stream of all the tweets that are relevant to space and are those via commercial ApS of some sort we get them directly from Twitter okay so yeah we have a day to deal with them okay and then news the news we actually have several sources of one of the most convenient is LexisNexis they have a service called moreover you can actually purchase a firehose of news they do a really good job actually oh wow okay so you pull all that into your elastic search index and maybe talk a little bit about some of the underlying NLP bits that are enabling all yeah yeah so when you kick off a query what's happening is you're making a lot of reading happening so for example if you if you take a look at the topics that have been generated text in word embeddings quantum and all of those topic labels that is generated it actually discovered and chose those from the content of the articles themselves so the first step in any NLP task on documents is to tokenize the entire document so are you familiar with tokenizing mm-hmm yeah so you basically discover all the words and punctuation and you run an analysis that gets you the parts of speech it's kind of like it's kind of like what you did in grade school when you made these sentence diagrams to try and make sense of all the different parts of what someone says and then a whole bunch of things happen in parallel basically you have to there's some things that are useful if you if you give it a bag of words so you can take an entire scientific paper or even a thousand scientific paper person they turn into bags of bags of words and with that kind of analysis you can you could for example discover the groups of words the engrams that basically best describe this space and you can generate a label so if you go into any of those topics it has decided to call give that topic a name based on the language within the documents themselves within the topic so I'm still amazed that it works frankly NOP this kind of magical when it when it when something makes sense to a human when there's a machine that didn't really understand it in the same way you did right right it's kind of magical are you using kind of off the shelf NLP toolkits and lt case I found stuff order yes and rolling it off now we started off that way okay so we've been using this tool spacey from the very beginning it's it's free it's open source and it's really powerful and it's it's really what shocks me is that they're just two people at the heart of this project um a fellow named ha Noble and a gal named eNOS who lived in Berlin not far from where I lived for a few years and I've gotten to know them a little bit just recently and it does the nuts and bolts NLP that you need so it will tokenize but it'll also discover named entities it'll help you find the people and organizations and and so forth but you need to train it that's something that we have discovered is just probably like everyone else it'll get you started but then you need to solve your own problems it's only a starting point so for example with the the people and all that all the information that we can extract about them and and tell you a story based on the people in this space Spacey is one of the things that we use early in the pipeline but then there's a ton of custom code that we had to build to basically get the kind of information that space you can't get to clean up the stuff that Spacey gets wrong to link it with all the other information we're extracting by other means and it's a mixture of machine learning and good old fashioned regular expressions that what I find so fun about being at an AI startup is the goal here is not to generate research papers the goal is to just solve problems really well by whatever means you can so which i think is like the right motivation to have right right if you're just motivated to to publish cutting-edge papers you don't care if it works I went to this I went to this conference called nibs which is essentially like where all this cutting-edge research is being debuted and something that really struck me is like half the stuff that people are bragging about doesn't even really practically work or works within such a narrowly congestion that with a problem that'll where I am right right let it's like it's computationally intractable or whatever but and that's fine it's it's like that's the whole point is to debut tomorrow's technology but it's frustrating when you're trying to build something and you get excited about some some new idea and you chase it down only to discover oh this actually never could have worked yeah oh yeah and I've had that experience I've had the experience I I found a paper using primer science of course it's a pretty weird situation to have AI eating itself we basically have an AI system that reads AI papers which we then use to try and improve the AI that respects but we came across a really exciting paper and fully replicated it and it just doesn't work and that's okay that's that's how it goes in this space when you're when you're right at the edge of knowledge it's not all going to work so we have this principle a primer of always trying to find the practical solution as quickly as possible don't don't get seduced by ideas that are sexy to talk about but it's not actually solving your problem yeah I should throw in a plug for my newsletter I've recently written on this topic of reproducibility and both science and AI drawing off of a recent interview I did with Clare gall neck on this the same topic but I really appreciate you you know owning up to that broader pipeline one of the questions I get a lot when talking with folks about their products or projects is you know people want to know like okay granted you've applied you know some great cutting-edge machine learning AI stuff but what else is there required to make it work like what are the how much heuristics are kind of in and around these tools to actually make it work so to hear you note that yeah you know good alright good old regular expressions are used liberally to you know make sure that this all works I think it's important for it's important to realize that and oh yeah absolutely I guarantee you you go into some of the the biggest most cutting-edge groups at giant tech companies you think that they're doing some kind of pristine AI that you just press a button and it understands things I guarantee you look under the hood and there's just a ton of regular expressions now that's not to say that machine learning isn't the way forward like it totally is but to make these things work on actual problems it it's still a labor of love so you're doing a lot with space ER you also which I'm assuming is more traditional NLP technology approach are you also doing things with like were Tuvok and deep learning based approaches yeah in particular as we've expanded into other languages beyond English Spacey is just not gonna cut it when you want to make something that understands Russian and Chinese mm-hmm so we've actually had to pretty much make a bunch of tools from scratch but it it relies on word vectors and word embeddings and where things where things get complicated is is actually where you try and pull this all together so if you if you use deep learning to extract for example some pattern in you know a corpus of 10000 documents the harder thing you know once you've extracted it is knowing whether you're right and whether it's worth saying like I can I can I can find a bunch of patterns in text pretty easily but the harder thing is assessing how confident am I that I found something that I haven't just miss extracted it's not just a spurious pattern and then even harder than that is it worth telling you like how do i square this with my model of what humans are interested in right right though where we're headed with this is basically a model of stories which ultimately is a model of humans humans are storytellers we've evolved to do this thing we just take it for granted like what we're doing right now this conversation is is incredibly high-tech like you and I you and I in real time are like gliding through a narrative that millions of years of technology evolution it's amazing yeah it's amazing so I think this is actually the next frontier of AI decoding what story is yeah and so what does that mean practically howhow are you approaching that yeah so here's a bite-sized example if you make something that reads scientific papers and tries to tell you what you need to know about I research last week for example it's not enough to just give you a dashboard of here's the most shared paper here's the here's the paper that got the most news here's the paper that currently has the most citations that's not doing much heavy lifting for you if you were to hire a thousand human analysts to just work for you like imagine you had that luxury what would you ask them to do that's that's kind of that's kind of the better guiding question and and what what sort of story would they tell you what would the format be I guarantee the humans wouldn't come back and give you a dashboard they would say okay the big deal last week is that a self-driving car crashed and it's kicked off a huge discussion about quality control and where we're system system errors are going to creep in and how you can make machine learning systems understandable like from an engineering point of view how are we gonna how are we gonna deal with this emerging problem and the people who are weighing in on this are the following the researchers and deep learning but here's some other people who are very knowledgeable but they're in an adjacent domain and we think this is this is really worth knowing but meanwhile by the way we discovered a paper published by a couple of researchers that you've rarely heard of but it's getting a lot of traction and it seems to be on a topic that is emerging and you're probably gonna care about this you know it's it's basically it has to do with voice recognition and we know that that's an interesting topic but the more interesting thing is that this researcher is really well known in a totally different field and is just like diving into this and that's unusual so check it out like here's the paper I'm just gonna go out on a limb here and say you really should read this paper and then by the way like here here's basically a new concept that is creeping into the space and we haven't seen it before this might be a fluke but I think this is actually something that's that's worth knowing about and so here are five papers that you should read you know I'm working within your budget here so that's that's what all the humans would do it's basically the kind of like one one to two page presidential intelligence briefing you know that ideally that's what it would look like a ton of research has gone into boiling things down to a very tight story and that's all you need to know and so the the idea then is that you've got some kind of generative model for creating these you know basically your briefing yeah over and it has two steps it has two steps like at least two steps one is what information can I find that's truly relevant the raw ingredients of a story and then the next step is well how can i synthesize this into an actual story I have to do text generation document planning you know you give me a budget a page a paragraph maybe you just want a bullet point and I'll work with it I'll be able to express this as a story given that constraint and so kind of going back to our earlier exchange about you know good old-fashioned heuristics like how to what degree I haven't looked at you know compared one of these briefing pages versus another but you know how much is generation and how much is you know more you know like templates and things like that mm-hmm yeah so the philosophy we followed is always start fast and doable I mean put another way like you always want to start with a model that you can fully understand yourself and implement quickly so that you have some some baseline sure so yeah we've always always started with first can you do it yourself as a human maybe even no computer involved mm-hmm like if you were to read ten papers and try and say something intelligent about them for example tell me tell me for example what if you were to classify events and I gave you a pile of papers and I said how would you classify these if kind of tags would you attach them or if you were looking for a particular kind of event could you could you divide papers into yes and no always start with yourself you the engineer can you yourself do it because if you can't you're probably gonna have a hard time teaching a computer do it and then if you get some other humans probably the you know the person just two chairs away from you if you can get someone else to do the same task independently and get the same ideally or a similar answer okay now you're in good shape only then do you start building a computational system to try and do this automatically and your first stab at that should be something straightforward a set of regular expressions heuristics can you can you actually find this yourself using rules that you yourself devised and then if the only way really to get beyond that to really tackle increasing complexity is to have something that will learn on its own you'll never do that with regular expressions you have to you have to use machine learning to have a system find patterns itself in a changing world so I think you're saying then that there's you know you're somewhere on the spectrum of templates and machine learning oh yeah always and in fact I think the the best things out there are always somewhere in the middle right I think my definition yeah and and essentially it becomes a race can can we build something that can learn faster and output better smarter content than the system we have we had a little race actually recently to try and build an event classifier and a brilliant engineer named Leonard Appleton took a stab at just using regular expressions ah no machine learning and another brilliant engineer named Josh took on the task of solving the same problem using a really complicated machine learning graphical model and sometimes John Henry wins the race frankly yeahshe could not build a system at least last I checked that could do better than Leonard's massive complicated regular expression heuristic engine and but you know eventually eventually machine learning will win like we all know that right right but that's that's the beauty of a practical approach when when you're really driven by practical principles you're willing to say well we've got a better solution that's actually simpler and easier to understand let's use that for now keep trying but it's it it's never long before a machine learning based system does better it's just an incredibly powerful tool when you're using machine learning for tasks like summarization where you referenced earlier you know first you do it then you get someone else to do it and you compare them you know your summary of a given paper or a given paragraph is likely to be very different from mine like what do you how do you find ground truth so that you can train learning models yeah that you've really put your finger on the hardest problem stories by their nature can be told infinite ways and there are some automated techniques they've been around for a decade they have French color names I don't know how that came about but there's something called Rouge in that something called blue and what they do is they they treat the output as bag of word problems and they try and find out how much information overlap there is between a human summary and a computer summary as you can imagine that's great if you're trying to measure whether you got it terribly wrong right if we if we make two summaries and they have nothing to do with each other then they're probably they're probably not talking about the same thing though that's right summarizing fiction right you could be we could be summarizing on two totally different levels and both be right that's true that's absolutely true so say my I think the same holds true for news I mean it you know I I'll let you continue but that seems like a very very rudimentary metric well I mean you'd be surprised then to learn that the latest greatest papers in this field are still using those metrics because they're easy you know you can it's it's a one-click measurement right right but it it really doesn't help when you want to assess a subtle output of a story that could be sliced and diced sort of infinite ways unfortunately it becomes a CAPTCHA you need some human to read it and go oh yeah that makes sense or that's crazy right but there are there are some techniques you can use so one is you can actually crowdsource assessment of narrative you can you can give human annotators and scores a system a rigorous system so like you can measure the coherence you can measure the sophistication the whether or not you've you've really summarized the space well in various ways those those sounds like they would require a fairly sophisticated crowdsource yeah so that's right like the the more technical and sophisticated this task becomes the less you can rely on Mechanical Turk in fact eventually you've got your own engineers doing this so it's it's definitely not scalable but there are there are some some tricks that you can use so for example if I generate a bunch of summaries on a topic that I've already summarized for example if I have a Wikipedia article about it I can at least find out if the most important entities in the narrative have been represented and I can also turn the system around and do extraction on the summary you can even I will suggest to make a generative adversarial network that generates stories and critiques them you can see where this is going mm-hmm eventually you can you have a system that tries to check off all the boxes of what counts as a good story like you've you've talked about the most important entities and you've expressed their relationships you've come in under budget in terms of space-aged but ultimately you're going to need a human to assess whether it's a well-written story until we can crack the code of text style transfer where you can actually say tell me this story in the style of a New York Times reporter or tell me the story in the in the style of you know a terse military briefing send our my's text in Hemingway style exactly and until we can actually have networks that can both detect and reproduce narrative style I think we're work for the time being stuck in a world where it's really hard to assess how well our systems are doing ultimately you want to hook this up to your users and and either passively or actively harvest their feedback so the simplest version of this of course is a B testing if you write many versions of summary and you expose a large number of humans to a versus B you can just find out what they think of it by for example whether they click through and read it you can you can also make it active you can let users say yeah that was good or that was bad or going back to my Hemingway text summaries Google inbox presenting you three choices for how to summarize the response the appropriate response to an email yep and we've played with that as well we we generate alternative summaries to events for example it's it's it's a really powerful way of real-time effortless quality checking you don't you don't want to have to sort of pause your whole engineering operation in order all the time just to assess how well you're doing you really want it to be sort of continual you want to always be beating the output of your own computational systems we call it dog fooding if you've got to be real-time dog fooding and so the nice thing about primer science is this thing that I'm building is we we use it to discover the research that is going to help us make it better and so if you keep on using the thing you are your own quality Assessor that really helps right right like hard to scale I wish I could could clone myself in some way to assess sort of at a thousand decks now one thing that I didn't see in what you've built it it seems like it is it does a really good job at kind of this meta characterization of archive and what's happening in different categories but I didn't see it attempting to summarize individual papers which is the thing that Jeff Dean and I were originally talking about is it trying to do that somewhere nothing what you're looking at we are we are actually working on that that summarization problem okay yeah so we've taken two strategies and they're kind of running in parallel one is extractive summarization where you you you're the system is allowed to pull words and set even whole sentences directly from the text and then kind of pull them together into a summary that works extremely well when you have a large number of Docs like if you if you have a hundred documents all about the same thing extract of summarization is really powerful and really efficient and then the alternative is abstracted summarization where the system is going to write its own words often character by character out of thin air and it has a language model so it reads all these things and it it basically makes a prediction about what it should say next as it generates a summary a really nice bit of progress in this field that we've been using is abstract of summarization with pointers so the idea here is you also have a sense of your confidence about whether the word or phrase that you're putting into the summary at any given time is going to be a good choice and if you're not so confident you point back to the text and you grab the thing itself so for example if you had a sentence that said a one of the most exciting areas of artificial intelligence these days is generate generative adversarial networks now if if gener generative adversarial networks that phrase is something that you haven't encountered or your model basically says I'm not sure if that if I can actually paraphrase that then what you want to do is what a good human writer would do you just go back and you grab that thing so you can you can summarize while also having some of the advantages of extractive mmm so summarizing basically around the entities that you aren't too sure about exactly it basically becomes a sliding scale between abstractive and extractive the more confident it gets the more abstract if it gets the more flexible it gets which will allow you to summarize a single scientific paper for example in a couple of sentences and if you're not so sure then it slides over to extract 'iv and it will just pull out the sentences that it deem and the phrases that it deems are the most central and informative interesting it's a hard problem though it's a really hard problem another thing that makes it hard when it comes to scientific papers is they already have their own summaries they're called abstracts and you and you'd think that oh great this it job done but abs as you know abstracts themselves can be so riddled with jargon and references to arcane things that it's hardly a summary at all it's really only a summary for the authors of the paper right right so you really need a summary of the summary right and that's what we're working on we're finding that you really do need to power this with an ontology and a knowledge base though elaborate on that ok so let's take for example a problem that I'm just starting to work on how do you how do you summarize and make sense of pharmaceutical research papers so there is an ontology that is available to everyone that basically the NIH paid for called mesh and it's it's it's kind of like every every jargon term in biochemistry and molecular biology gene names and gene types all of that is captured in this very rich ontology that was hand-built by no doubt by unthanked graduate students and it's something that's really nice about mesh is that it's actually a subset of wiki data and wiki data is the database that stands behind Wikipedia now I say that in an idealistic way because actually in reality that's the way it was dreamed up a wiki did is gonna basically be the database that powers Wikipedia but in fact it's not there yet humans humans vastly prefer to update Wikipedia with content and wiki data basically plays catch-up nonetheless it is a huge powerful open source knowledge base and the mesh ontology of the subset of it and so if you want to summarize a scientific paper just a single scientific paper the first thing you need to do is is make sense of it you need to map all of those words which to the computer are just it could be random numbers for all they cares has no idea what it means you need to map them to concepts and that's what systems like mesh were designed to help us do so the idea being instead of what you're doing in science primer in doing this in a totally unsupervised manner here you're using the additional information you're getting from the pre-existing ontology to help the Machine make sense of the various documents and to paraphrase it so like a good summary is something that doesn't just like say less it also says just as much but an in in a compressed way right right you know if I just tell you the beginning of a story I haven't really compressed that story for you I need to like give you the sense of the beginning middle and end and compress that all down into three sentences and you're not gonna be able to do that just using a standard NLP techniques on a scientific paper you're just not gonna be able to do it no way all right you have to map that out to an ontology and say oh you know this long sentence describing this genetic pathway I can boil that down to a single sentence that says the genetic pathway X you know mm-hmm interesting but yeah you need a lot of tacit knowledge to be able to do that so that's what we're working on awesome well John this has been super interesting I really appreciate you taking the time thank you anything else you'd like to share with the audience oh just that I'd like to make a prediction go ahead well I predict that the kind of stuff were working on is going to accelerate artificial intelligence research more than anything else I think building AI that can read the latest research on AI and help the engineers who build it build it faster is going to vastly accelerate the whole process awesome well we will put your prediction on the blockchain and just to make sure we get all the jargon in then we'll do it I see I'm gonna do an IC alright awesome thanks so much done thanks Sam all right everyone that's our show for today for more information on John or any of the topics covered in this episode head on over to twilly Icom slash talk slash one three six thanks so much for listening and catch you next time you\n"