Innovating Neural Machine Translation with Arul Menezes - #458

The Power of Post-Processing in Speech Recognition and Translation

Translation is an expectation that involves more than just conveying meaning from one language to another. It requires a deep understanding of form, including well-formatted text, capitalization, punctuation, sentence breaks, and other nuances. To achieve this, significant effort has been put into bridging the gap between speech recognition and translation, post-processing the output to ensure accurate results.

One of the key challenges in speech recognition is accurately identifying sentence boundaries. When a sentence is broken in the middle, speech recognition software can struggle to correctly identify it as a separate entity. This can lead to poor translation results, as the machine is treating two separate sentences as one. To overcome this issue, researchers and developers are working on improving the algorithms used for sentence boundary detection. By getting these basic elements right, translation systems can provide much more accurate and reliable results.

The impact of accurate sentence boundaries on translation cannot be overstated. When a sentence is accurately identified, it allows for precise translation of each word and phrase, without disrupting the flow of the conversation or text. This attention to detail is critical in fields such as law, medicine, and finance, where accuracy can have serious consequences.

The European Parliament has recognized the importance of accurate translation and is currently partnering with several companies, including the one being interviewed today, to develop a live transcription and translation system for its meetings in all 24 languages. This project represents a significant challenge, but also an exciting opportunity for innovation and improvement. By developing systems that can accurately translate speech in real-time, researchers are pushing the boundaries of what is possible with language technology.

The pace of innovation in natural language processing (NLP) has been rapid in recent years. Advances in transformer models and other architectures have enabled significant improvements in translation accuracy and efficiency. Larger models have proven to be particularly effective, as they allow for more complex and nuanced patterns to be learned from the data. However, scaling up existing models can be a challenge, and researchers are working on new architectures that can adapt to larger models without sacrificing performance.

Another area of ongoing research is pre-trained multi-lingual models. These models are trained on large datasets in multiple languages, allowing them to learn common patterns and relationships across languages. This approach has shown significant promise in improving translation accuracy, particularly for less commonly spoken languages. By specializing these models for individual language pairs or groups of languages, researchers hope to create systems that can handle a wide range of linguistic challenges.

Finally, there is growing recognition of the importance of handling edge cases and long-tail problems in NLP. While small, well-known words are often handled accurately, it is the less common words and phrases that pose the greatest challenge. Researchers are working on developing systems that can handle these "long-tail" problems with greater accuracy, including name translation and number recognition. These areas are critical for applications such as financial transactions and medical records, where accuracy can have serious consequences.

Neural hallucination is another phenomenon being studied in NLP. This refers to the ability of a system to generate output that is not present in the input data, but is still fluent and coherent. While this can be beneficial in some contexts, it also poses significant challenges for systems that rely on accurate translation and recognition. Researchers are working on developing techniques to detect and mitigate neural hallucination, ensuring that systems provide reliable and accurate results.

In conclusion, the field of NLP is rapidly advancing, with significant breakthroughs being made in speech recognition, translation, and other areas. By addressing the challenges of sentence boundary detection, pre-trained multi-lingual models, edge cases, and neural hallucination, researchers hope to create more accurate, efficient, and reliable systems for language processing. These advances have the potential to transform a wide range of applications, from medicine and finance to education and entertainment.

"WEBVTTKind: captionsLanguage: enall right everyone i am here with arul menaces meneses arul is a distinguished engineer at microsoft a rule welcome to the tuomo ai podcast thank you sam i'm delighted to be here i'm really looking forward to our chat which will focus on some of the work you're doing in the machine translation space uh to get us started i'd love to have you introduce yourself and share a little bit about your background how did you come to work in nlp and translation and tell us a little bit about your story yeah so i've actually been at microsoft uh 30 years at this point uh i yeah i know oh god it's a long time uh i um i was actually in a phd program i came here for the summer loved it so much i never went back so i worked in microsoft in the various engineering teams for a while and then eventually i drifted back into research and i joined the natural language processing team and microsoft research and i started the machine translation project and i've been doing that ever since i've been doing machine translation for like 20 years now and it's been it's been a great ride because it's just a fascinating field so many interesting challenges and we've made so much progress from like when we started um you know and we've gone through so many evolutions of technology it's been it's been a great ride yeah yeah there are some pretty famous examples of um you know how the introduction of deep learning uh has changed uh machine translation uh i'm assuming that your experience there is no different can you share a little bit about how the the evolution that you've seen over the years sure sure i mean historically you know machine translation is something people tried to do you know in the 50s uh it was one of the first things they uh wanted to do with computers you know along with simulating sort of nuclear uh sort of bombs but for the longest time it was very very hard to make progress so all the way through i would say the late 90s early 2000s we were still in sort of rule-based and knowledge sort of engineered approaches but then the first real breakthrough that came in the late 90s well actually starting a little earlier in terms of some papers published at ibm but but really taking off in the late 90s and early 2000s was statistical machine translation where for the first time you know we were able to take advantage of like large amounts of previously translated data right so you take um documents and web pages and things that that have previously be translated by people and you get these parallel texts which is uh let's say english and and french and and you align um documents and sentences uh and then eventually words and phrases so you can learn these translations and so uh with statistical machine translation we would we were learning from data for the very first time instead of having people hand code it and it worked actually astonishingly well compared to what we were doing before but eventually we ran into the limits of the technology because while we had the data we didn't have the techniques to do a good job of learning what that data was telling us because you know the machine learning techniques that we had back then uh just weren't good enough at they were good at memorizing right they could they could ex if if you said something exactly the way they had seen in the data they would do a good job of translating it but they were terrible at generalizing from what they saw in the data and that's where neural models come in like neural models are amazing at generalizing you know people always talk about how uh some of the latest models you know you can probe them to uh figure out what was in their training data and and get them to reproduce what was their training data but what we forget is it takes work to actually make them do that because most of the time they're generalizing they're paraphrasing they're not just replicating their training data and that's something we were not able to do before um so so if you look at the evolution over the last 20 years of machine translation we had a statistical machine translation which did really well for a while but then eventually plateaued then you know we had sort of the advent of neural networks and the first thing that people tried to do was you know we did feed forward neural networks we tried to shoehorn it shoehorn them into the framework we already had and combine sort of feed forward networks and statistical techniques and that was okay you got a few incremental improvements but it wasn't till we had uh the sort of pure neural lstm models that we um for the first time were really capturing the power of neural models right so what an lstm model would do would be you know you have this encoder that you feed the the the source language sentence in and it basically embeds the the meaning of that entire sentence in the lstm state and then you feed that to a decoder that is now generating a fluent sentence uh sort of based on this very abstracted embedded understanding of what the source language said and so that's very different from the way we were doing it just sort of copying words and phrases that we'd memorized um so that was the first revolution and and it gave us amazing results actually uh compared to what we were doing before and then of course a long after that came transformers which uh so sort of take that whole encoder decoder architecture but take it to the next level instead of having the meaning of the entire source sentence be encoded into a single lstm state which may work well for short sentences but gets you know worse as you get longer sentences in a transformer you know we have the self-attention that's basically looking at every word uh in the source and every word in the target and um so you have like full context available to the model at uh at any point in time and so um so that's where we stand today as you know transformers are the state of the art but of course there's lots of really cool interesting variations and things we're doing which i think we're going to talk about at some point and when you talk about transformers being the state of the art is that what is powering the current kind of production azure machine translation service or is that the state of the art in research and you know there's some combination of the various techniques you mentioned that is powering the live service so the live service is very much powered by transformers um we have you know all 180 uh language pairs or something that we support uh powered by uh transformers running in production now one thing we do do is that um we take advantage of what's called knowledge distillation right to um uh take the the the knowledge that's embedded in these very large transformers that we train offline and then condense that or distill that into a smaller still transformers but smaller shallower and and narrower models that we use in production right so um we typically go through multiple stages of these teacher models before we get to the student so our our pipeline is actually fairly complex we take uh the parallel data which i mentioned earlier which is sort of the lifeblood of machine translation this is the previously translated human text and we train like a first teacher based on that data then we typically do what's called back translation which is um a technique in machine translation to take advantage of monolingual data so data that's not parallel so it's not translated source and target it's just in one language typically the target language and what we do there is we want to take advantage of this monolingual data to teach the model um more about the syntax of uh and the you know semantics off the target language so it gets more fluid and the way we incorporate that data into a machine translation model is to something called back translation where we take um the the the target language data we translate it back to the source using one of our models and then we use it to train the the model in the other direction so this is a little complicated so basically if you if you're training an english to french model in addition to the parallel english french data you also take some french monolingual data you translate it back to english using your other direction translation system the french to english system and then you put that synthetic data back into training your english french system okay so um yeah so that's essentially a data augmentation technique it is yeah yeah it's a it's a data augmentation technique and it works like incredibly well actually adds several points to our metric the metric we use in machine translation is called the blue score um i mean there are other metrics and i mean i can talk about that at some point if we want to get into it but uh but you know we get several points of blue score out of the back translation and then so that's our final sort of teacher model which is typically huge and then what we do is we we use that model to teach the student model and the way we do that is essentially we run like a huge amount of text through this teacher model and then we take the the data generated by the teacher and we train the student on it and the reason that works is because unlike sort of natural data that we've that we train the teacher on which can be confusing contradictory uh diverse the the data generated by the teacher is very uniform and it's very standardized and so you can you can use a much uh simpler student model to learn all of that knowledge from the teacher because it's a simpler learning problem and having done that that model runs like super fast and we can host in production and translate like trillions of words you know so yeah and so the the student teacher um part of the process is interesting to explore a little bit further are you essentially trying to do something the task that you're trying to or goal that you're trying to achieve with that is model compression right very different approach to it than like pruning or you know some of the other ways you might approach completely right yeah so we do like uh we do a lot of different things for model compression right so one of the things we do is we we do quantization for example we run all our models in eight bits we've experimented with less than eight bits it's not quite as effective um uh but you know uh we do that uh we do some other like pruning techniques as well but the biggest one is the is the knowledge distillation and what you're trying to do there is get a smaller model to basically mimic the behavior of the of the big teacher model just running a lot cheaper and by combining all the techniques we published a paper last year on this um at uh at a workshop and uh from our big teacher with all of the knowledge distillation the compression the uh the quantization and so on we're running something like 250 times faster on the student than the teacher with i mean there is a small loss in quality right but we lose maybe half a blue point not not too much um and in some cases not even any we can like actually maintain the quality uh as is so my my next question for you is it the way you describe the process um and in particular the idea that the teacher is outputting more consistent examples than what is in the training data my my next question was or the intuition that i had was that that would cause the student to be far less effective at generalizing and would make it perform worse but it sounds like that's not the the case in practice so the key to that is to make sure that the data that you feed through the teacher to teach the student is diverse enough to cover uh all the situations that you may encounter right so the students are a little weird i mean and i think you're sort of hinting at that we do for example over fit the student to the training data which is something that you typically wouldn't do in your teacher model because you in fact are trying to make the teacher match the student as much as possible so some of the things that you do to um uh you know to make the to the teachers better at generalization you don't do in the student uh and in fact if you look at the student distributions they're much sharper um than the teacher distributions because they've they have overfit to the data that they've seen um but um you know there's a little evidence that you can get into some corner cases that are brittle like um you know there's this problem of neural hallucination that all of the neural models are subject to where you know occasionally they'll just output something that is completely off the wall uh unrelated to anything that they've seen and there's some evidence that there's a little bit of amplification of that going on like if it's you know the teachers are also subject to illustration but maybe at a very very low frequency and that maybe that's being amplified a little bit in the student um so we're you know we're working on managing that but but yeah so there's there's you know it's a trade-off like the students have lower capacity but that's what enables us to run them and we you know we run them on cpu we don't we don't use gpus in production uh inference we use of course all models are trained and and all the knowledge decisions are done on gpus but but uh in production we're just using cpus and is is that primarily uh based on the cost benefit analysis or is it based on the latency uh envelope that you have to work with and not needing not wanting to kind of batch uh right inference requests yeah that's exactly right i mean um you know latency is a big concern our api is a real-time api and so uh you know latency is the biggest driving factor and uh honestly if you do inference on gpus it's uh you know you get some latency benefit but the big benefit is on large batches and so unless you have a matched batch translation api you can't really take advantage of your the full capacity of your of your gpu so you know in a real-time api yeah and are both the the teacher and the student models transformers yeah they are they are yeah the students are you know transform or large or a little bit larger and then this uh sorry that's the teachers and then the students are they're very highly optimized um transformer i mean we start with transformer base but then we do a lot of really strange stuff um i would refer you to the paper actually okay um yeah when you were describing the uh the data augmentation technique that you use right uh it kind of called to mind ideas about incorporating a gan type of approach where you're you're doing the pass back translation and then you know maybe there's some gan that is trying to figure out if the the result going backwards right is uh like a human translation is there a role for that kind of technique do you read is that something that comes up in the research yeah so we've we've looked at gans there's been there were some exciting results but uh but in the end i mean i think we have some okay research results we haven't uh seen much benefit but uh more broadly in terms of um data augmentation um we're using it uh all over the place right so it's we have the back translation but there are a lot of phenomenon that uh we want to address in machine translation that is maybe not well represented in the data and so we use data augmentation pretty heavily to cover those cases right to give you a simple example when you translate a sentence and you get a particular translation and then you go in and let's say you remove the period at the end of the sentence sometimes it changes the translation entirely they may both be perfectly good translations right but they're different and so one way to look at it as well they've got both good translations but but people don't like that so if you look at our customers and we're very sensitive to what our users you know uh the feedback we get from our users uh so so one of the feedback they got we got was that uh you know we want a little more stability in our translation so you know just because i i lost the period at the end of the sentence i shouldn't get a drastically different translation and so uh you know it's very easy to augment the data and say well you know stochastically i'm going to like delete the period on my sentences and so then the model learns to basically be robust whether there's a period or not now of course you know that's different with a question mark you definitely want to leave the question mark in because that changes the meaning of the whole sentence but you know things like that punctuation the period commas things like that uh maybe you know capitalization uh for example one of the the other examples would be like an all cap sentence you know you take the whole sentence and you change it to all caps uh you get a totally different translation right so so we again generate some synthetic all caps data so that the model learns to do a good job of translating that as well uh and then there's you know there's all these like i i would call them you know long tail phenomenon that uh and you know we feel that data augmentation is a good way to address some of these yeah your examples are really interesting to me uh because i'm refer i'm comparing them to like your textbook nlp types of examples where the first thing you're doing is making everything lowercase and getting rid of all of your punctuation yeah sounds like that does not work for translation no because there's a lot of information in in casing and punctuation right like i mean if you want to handle names for example you need to pay attention to the to the case uh of the input um and uh you know yeah so there's like everything in the input has information and so actually even the punctuation right like sometimes if you take the period off at the end of the sentence it should change things because it may be a noun phrase rather than an actual sentence right so so it's not so much about pre-processing the data and trying to be clever it's about exposing the model to different variations so that the model can figure things out for itself yeah um one of the the questions this prompts is like the the unit of you know work or the unit of um thing that you're trying to translate you know translating a word being different from translating a sentence being different from translating an entire document right sounds like most of what we've been talking about is kind of phrase by phrase now relative to the word by word that you know we were doing 20 years ago um but are you also looking at the entire document are you able to get information from a broader context to impact the translations yeah so that's that's a very good question sam yeah so the context matters a lot right so um one of the reasons why um neural models are so great at translating now is because they they're looking at the whole sentence context and they're translating the entire context the sentence uh and every they they're basically sort of figuring out the meaning of every word and phrase in the context of the whole sentence which is something we couldn't do with statistical machine translation before uh so now uh the next step is to expand that context to beyond the sentence right so there are a lot of phenomenon that it's impossible to translate well without context beyond the sentence right in many languages unless you have a you know a document level context or paragraph level context you can't generate the right pronouns because you don't actually know the sentence doesn't have enough clues to let you know what is the gender of the the subject of the object or the person you're talking about in that sentence beyond just the pronouns it's also like um you know the senses of words and uh you know disambiguating those so we're we're actually moving towards uh translating at the whole document level context um uh or at least you know very large multi-sentence fragments and then uh there we'll be able to use um you know the the context of the entire document to translate each individual sentence and we actually have uh some really great research results um based on uh translating at document level uh uh you know and uh yeah so we're pretty excited about that that model is not in production yet but it's something that we're working on um we did ship a document level api i think it's in public preview right now which uh addresses sort of the other half of the problem which is you know people have documents uh they've got format formatting you know it's in pdf it's in word it's in powerpoint whatever and uh html and uh it's a hassle getting all the text out of that getting it translated and then we're still trying to reassemble the document and reconstruct the formatting of that document on the translated thing so so we've made that easy we just shipped this api you just give us your pdf we'll tear it apart we'll do the translation we'll put it back together and we'll preserve the format and you know especially for pdf that is actually really hard doing the format preservation is uh is is tricky um but we're pretty excited about that api and so then that's the place where our document level neural model would fit right in right because now we have the users giving us a whole document we can not only handle all this stuff about the formatting and all that we can go one better we can actually use the whole document context to give you better quality translations can you give us a an overview of some of the techniques that go into looking at the entire document when building the the model yeah so there's i mean right now as i said we haven't actually shipped this so we're looking at a bunch of variations um there's you know there's several things that people have looked at like what you know there are hierarchical models where you um where you do the you run transformers at the sentence level and then you're on a second level to sort of like collect the sentence level information into like a document level uh context vector and then you feed that back into translating each sentence um i mean you know we were finding that actually if you just make it like super simple and and uh treat the whole thing as as if it were a giant sentence in effect um you get really good results you do have to deal with the performance issues right because transformers are n squared in the size of the input and the output so instead of you know handling uh you know a 25 word sentence if we're now uh translating a thousand word per you know document or paragraph then um you know that's the you know you've got like an n squared problem in terms of the performance right it's going to be uh that much more expensive um so but we have we have things that we're looking at to make that faster as well um so so yeah so so we're pretty optimistic we can do that um uh and and i i think we can do that with just letting the transformer figure it out for itself rather than trying to be very clever about all this hierarchical stuff nice yeah um let's talk a little bit about the role of different languages so you um you know we we've already talked about how you can use back translation to help um you know augment the performance of your uh the translation of a language in in one direction or the translation between a couple of language pairs oh uh is are there ways to take advantage of the other 130 or languages that you support when you're building the n plus one model for a given language abs absolutely absolutely that's been one of the most exciting things i would say that came out of sort of transformers and and neural models in general is uh the ability to do this sort of transfer learning between languages right and and the reason we can do that is because um you know transformers or neural models in general are representing the meanings of words and sentences and phrases uh as embeddings in you know this this uh space and there's by training on multiple languages together you can actually get the representations of these languages to merge and have the similar concepts be represented through uh uh you know uh some like basically uh related points in space in that space right um so as a practical matter we've basically found that if we group languages by family right and take so for example we took all our indian languages and we put them together and we train one joint model across all of the languages and now we're talking you know languages where you have a very different amount of data you have hindi where we have quite a lot of data and then we have like assamese which is i think the last one that we shipped that has you know probably like two orders of magnitude less data and the the wonderful thing is that by training them jointly the assamese model learns from the huge amount of data that we have for hindi and does like dramatically better than if we had uh just trained on assamese by itself in fact we've done those experiments and you know for the smaller languages we can get like five ten blue points which is like a crazy level of improvement um just from the transfer learning uh and multilingual um we also do that with like arabic uh all of our middle eastern languages so we're just like grouping more and more language families together and getting huge benefits out of this and the when you're grouping the the language families have you ever do you experiment with going across language families and seeing if there's some improvement uh yeah but there yeah so we you know we've trained models that have like 50 or 100 languages in them what you run into is um you know as you uh add languages you have to increase the size of your vocabulary to accommodate all of these languages and you have to increase the size of the model because at some point you run into model capacity limits so you can have a model that uh does a really nice job of learning from 50 or 100 languages but it gets to be a really huge model um and so in terms of cost effectiveness we found that like you get like almost all of the benefit of the transfer learning at like a much reduced cost by just grouping um you know 10 15 languages at a time and if they're related it's better but actually even if they're unrelated it still works it's quite amazing how well it works even if the languages are not related yeah you may think of it as like a computational test of chomsky's universal grammar and you know these ideas that suggest that all languages have these common elements yeah if you are able to train these models across languages and improve them that would seem to support those kinds of theories i mean definitely the models do a really good job of bringing uh related concepts together in the in the embedding space right yeah and so is the would you consider this you you reference this is like multilingual transfer learning would you also think of it as a type of multi-task uh learning as well or is is that not um is it not technically what you're doing in this task so so we're also doing in addition to multilingual just machine translation we're also doing uh multilingual multi-task learning and what we're doing there is we are uh combining um the sort of um so there's let me back up a bit there's been this whole strain line of research based on on models like bird right pre-trained language models where um if you look at bert it's actually the encoder half of a machine translation model but it's trained on a monolingual data strain on uh on a single language data on this objective that's a reconstruction objective uh where you know you're given a sentence where you have a couple of uh words or phrases blanked out you need to predict that right and then you have multilingual bot where you take multiple separate monolingual corpora right so it's like a bunch of english text a bunch of french texts and all that and you train them jointly in the same model and it does a pretty good job of actually pulling the representations of those things together so that's one line of research that's sort of really driven um a revolution in like the whole natural language understanding field right so for example today if you want to train a named entity tagger you wouldn't start from scratch on your on your named entity data you would start with a pre-trained model so one of the things that we're very excited about is we have this project that we call z code where we're bringing the um the machine translation work and this sort of pre-trained language model bird style work together right and we train we're training this this multi-task multilingual model that's architecturally it's just a machine translation model right but in addition to training it on the parallel data let's say the english french data and the english german data and conversely the german english data and the french english data and you know 10 or 15 or 50 or 100 other languages in addition we have a second task where we have the birth tasks we take monolingual data and we we have it reconstruct uh you know the missing words and we also have what's called a denoising autoencoder task which is where you give it a scrambled sentence and then it has to output the unscrambled sentence through the decoder and then now you have these three tasks and we train them in rotation on the same model so they're sharing parameters so the model has to figure out how to use the same parameters to do a good job of the birth task to do a good job of the denoising autoencoder task as well as to do a good job of the machine translation task and this we find leads to like much better representations that work for better natural language understanding quality but also better machine translation quality nice and the bert task in this example is within the same language as opposed to right across the the the target to the target language yeah there's actually like a whole family of tasks right i mean people have come up with i mean we've experimented with like 20 25 tasks right so you can do a monolingual mask language model task which is the birth task but you can do a cross-lingual mask language task as well and you can do the denoising autoencoder task monolingually where you have to reconstruct the same language but you can also do that cross-lingually where you have to reconstruct a sort of a scrambled foreign language task so there's like a real like sort of stone soup approach here where people are just throwing in all kinds of tasks and um and they all help a little bit uh but you know it's this we need to figure out like what what's the minimal set that you need because you know it's it's it's work it's computational expense to train these huge models on all these tasks so if we can find the minimal set that works uh that would be ideal and so far what we're working with is like a denoising auto encoder a mass language model and a machine translation task very very cool very cool one of the things that i think as users of machine translation we run into is that it works great in kind of these general contexts right um kind of everyday language but when you start to get into specific domains right harder for the systems to keep up is there any work that you're doing in that area so i think one of the things that um you know often users of these kinds of machine translation services experiences that uh you know they work great in the general case but when you start to try to apply them to specific domains it's a lot more challenging you know and kind of the technical you know technical uh conversations or translating you know medical conversations or you know construction or what have you um is there anything that you're doing to make the domain specific performance better for these kinds of systems yeah definitely uh you know domain performance in specialized domains is a real challenge and we're doing we're doing several things to to get better there right so the the first thing is that uh the quality is really determined by the availability of data right so in the domains like let's say news or web pages where we have a ton of data you know we're doing really really well and then if you go into a more specialized domain like let's say medical or legal uh where we don't have as much data we're maybe not doing quite as well and so one of the things we're doing is we're now taking the same neutral models that are good at translation and we're using them to identify uh parallel data in these domains that we can find on the web that we maybe weren't finding before and we can do that because uh these models uh you know because the representations are shared in the multilingual models uh they're actually very good at identifying uh training data that potential training data that that is uh translations of each other so that's one thing we're doing the other thing we're doing of course is the same kind of transfer learning approach that we're using cross-lingually applies within domains as well right so if you have a small amount of medical domain data you don't want to like just train a model that's based just on that that you know small data what what we're doing instead is we're taking you know our huge model is strained on a ton of like general data across a bunch of domains and then you find unit for the specific domains that uh you're interested in and we actually have a product called custom translator that our that we have like you know thousands of customers using where they are using this approach to customize the machine translation to their company or their application needs right so let's say you're a car company or something and you have a bunch of data that's about like automotive manuals right so you come to our website you you log in you create account etc you upload this data and then what we do is we take your small amount of domain specific data we take our large uh model and then we fine tune it to that data and now you have uh like a model that does like you know sometimes dramatically again 10 15 20 blue points better than the baseline because you know we've learned the vocabulary and the the specifics of your domain but we're still uh leveraging we're standing on this platform of like the broad general domain quality so that's been extremely popular and valuable actually we just shipped a new version of that based on transformers a couple of months ago and in that case the user is presumably bringing translated documents uh so that that you're able to train or fine-tune all with both source and target translations yeah that's exactly right i mean a lot of the companies that we work with have some data right like let's say they had a previous version of their uh vehicle or you know uh whatever and they had manuals that were translated in microsoft's case for example you know we have let's say the manual for microsoft word going back you know a couple of decades and this is the kind of data you can use to customize it so that anything uh any new content that you want to translate can have like a very consistent like vocabulary and and tone and so on yeah and then in that first example or the first technique that you mentioned that sounds really interesting so you've got this um you know this index of the web and bing you know for example or maybe you have a separate one but you gotta have this mechanism to kind of crawl the web and it sounds like the idea is that you can use the model to identify hey i've got these two documents they look really similar but there's a high percentage of words that i don't know that occupy similar positions in the same documents yeah and then you have someone translate the oh well actually then once you know that you can just uh align them so to speak and you've got a more domain specific document to add to your training set is that the general idea yeah i mean it's it's like you're trying to find two very similar looking needles in a very very very large haystack right so so you have to have a magnet that finds exactly those two needles and rejects everything else so um the the the cross-lingual embedding space is pretty key here right so you basically uh in principle if you embedded every single sentence or document on the web and then were able to look at every single document and find all of its very similarly close embeddings uh you'd be done but you know that's that's easier said than done easier said than done right so so that's the that's the kind of thing that we're trying to do at scale right is like you got these you know trillions of documents and uh you know we want to find the matching ones you need to do it efficiently and so there's a lot of like you know clever engineering that goes into like indexing this stuff and and like computing the the uh embeddings efficiently and of course also you know we're not really trying to match every page on the web to every other page in the web because you have um you know a lot of clues that says well you know if i have a document here you know is it likely to have a translated document somewhere it's going to be either in the same like top level domain or you know related sites things like that so there are ways to constrain that search yeah our conversation thus far is focused primarily on uh text translation are you also involved in voice translation yeah so we actually uh have been doing speech translation for a while um we several years ago we shipped a feature for speech translation uh in skype called skype translator it was you know really well received uh super exciting you know a lot of people use use it uh even today right um especially you know people talking to their relatives in another country and um yeah so uh you know there's a lot of interesting challenges in speech translation because it's not that you just take the output of a speech recognition system and then just pass it to machine translation right there's a there's a real mismatch in what comes out of speech recognition and what what is needed to do a good job of translation because of course translation is expecting like you know well-formatted text capitalization punctuation sentence breaks things like that so we put up we put a lot of effort into bridging that gap um you know post-processing the output of speech recognition so that we have uh you know really accurate sentence boundaries so that that matters a lot when you break a sentence in the middle and you try to translate like if you break a sentence in the in the middle the speech recognition itself is okay because as a human reading it you know there's a period in there you just ignore it and move on but the but the machine doesn't know that and so when you're trying to translate it you've got these two separate um these two separate um uh sentences and it does a terrible job of it so doing getting the sentence breaks right getting punctuation right uh and and so on uh is really important and so so that's what we've been doing we actually have a project going on now with the european parliament uh where they are going to uh be using uh our technology well it's it there's three uh contestants or three bidders in this project and so there's a evaluation that will happen in a few months but we're hoping that they'll adopt our technology for live transcription and translation of the european parliament sessions in all 24 languages of the european parliament which is super exciting oh wow yeah so when you think about uh kind of where we are with you know transformers and some of the innovations that we've talked about and you know relative to your 20 30 years into space and i'm curious what you're most excited about and where you see it going yeah i mean the pace of innovation has just been amazing um there's so many things that are happening um that like you know would be a a really dramatic impact right so so one is just much larger models right like as we scale up the model we see continual improvements and so as the hardware and the and the you know our ability to serve up these larger models keeps growing the quality will also keep growing right the the architecture of these large models also matters right like it's not just a matter of taking the smaller model and scaling it up exactly as is so there are things like mixture of experts models that for example uh allow you to scale the number of parameters uh without the cost scaling as linearly right because each you have parts of the model that specialize in different parts of the problem um and then you know multilingual is definitely the future pre-trained models is definitely the future right so so like you put that all together like pre-trained multi-lingual multi-task trained uh maybe with mixture of experts huge models and then we would you know specialize them for uh individual language pairs or groups of languages and then distill them down to something we can ship so um you know that's that's one area that there's a lot of innovation happening the other thing is that you know 10 years ago uh people were just amazed that translation worked at all right uh and now we're doing a really good job and the expectations have risen so you get to the point where a lot of sort of smaller let's call them long tail problems um start to matter a lot right so if you look at translation of names we probably get them 99 right right but you know a few years ago it would have been fine to say hey we're 97 accurate on names but maybe now that's not good enough right like one percent screwing up one percent of the names is not acceptable so you know how do we get that last one percent of names uh and you know i'm just making up the it may be 99.9 percent you're still going to have upset customers if you get you know 0.1 percent of your names or your numbers numbers are even worse right like if you if you uh misstate a number even like point one percent of the time you could have catastrophic consequences right so that's uh that's an important area um i mentioned neural hallucination before that's something we see where again it may happen only 0.1 percent of the time but if you get like a completely unrelated sentence that has nothing to do with your input but it's really fluent uh it's pretty deceptive right like because especially if i'm just putting my faith in this translation that and i don't understand the source language at all you'd be like oh sounds okay and move on but maybe it says something completely different from what the source uh said right and so that's that's a challenge um yeah i mean there's lots of really cool things happening in this space awesome awesome well arul thanks so much for taking some time to share a bit about what you're up to very cool stuff thank you you're welcome sam thank you happy to be on on the show take care bye thank you you\n"

Innovating Neural Machine Translation with Arul Menezes - #458

Random Videos