How Alexa Works (Probably!) - Computerphile

The Art of Conversation: A Simplified Exploration of Voice Assistants and Natural Language Processing

In this exploration, we will delve into the world of voice assistants and natural language processing (NLP), focusing on the key components that enable these devices to understand and respond to human input. Our journey begins with a simplified breakdown of how a typical conversation takes place.

The Initial Parsing Stage

-------------------------

At the outset, when you interact with a voice assistant, such as Alexa or Google Assistant, your words are picked up by a device, often a smart speaker or echo. This device is equipped with a microphone that captures the audio input from you. The captured audio signal is then sent to the device's processing unit for analysis. Here, specialized algorithms and software take over, parsing your sentence into its constituent parts.

This parsing stage involves identifying specific keywords, understanding context, and disambiguating ambiguous words or phrases. It's a critical step in determining the meaning behind your spoken words. The parser breaks down your sentence into meaningful components, such as subject, object, and command. In our example, "Tell me is it a significant phrase again?" would be parsed into its individual elements, including the subject ("you know my shopping list") and the command ("tell me").

The Role of the Dialogue Manager

--------------------------------

With the parsing complete, the dialogue manager takes over. This component is responsible for understanding the context of the conversation, determining the user's intent, and generating a response. The dialogue manager integrates the parsed information with additional resources, such as data from Amazon services or web resources, to provide a more informed response.

In our scenario, when you ask "Tell me is it a significant phrase again?" the dialogue manager must consider the context of your question. It's not just about identifying the keywords; it also needs to understand that this is a follow-up question and adjust its response accordingly. The dialogue manager draws on these resources to determine the next step in the conversation, including any necessary clarification questions or additional information to be gathered.

The State of Conversational Flow

-------------------------------

As the dialogue manager generates its response, it must also consider the state of the conversational flow. This involves understanding the current context and anticipating what might come next in the conversation. The device may need to ask follow-up questions to clarify your intentions or gather more information.

In our example, if the dialogue manager determines that there's no shopping list, it might ask additional questions to confirm this assumption. For instance, "Are you sure there are no items on your shopping list?" This approach ensures that the conversation remains relevant and productive.

Text-to-Speech Generation

-------------------------

Once the dialogue manager has generated a response, the text is then converted into speech using text-to-speech (TTS) technology. This process involves complex algorithms and models that aim to accurately reproduce human speech patterns. The TTS system must not only generate coherent but also natural-sounding speech.

The development of TTS technology is an active area of research, with significant advancements in recent years. However, this complexity means that the device may require periodic updates to ensure that its response remains accurate and effective.

Speech Recognition and ASR

---------------------------

Conversely, when a voice assistant receives audio input from you, it must first recognize the spoken words using automatic speech recognition (ASR) technology. This involves comparing the captured audio signal with pre-existing models or dictionaries to identify the most likely phrase or sentence.

The quality of ASR depends on various factors, including the device's microphone quality, the environment in which the conversation takes place, and the availability of training data for the model. While significant progress has been made in this area, there is still room for improvement, particularly when it comes to understanding nuances of language or handling complex accents.

Conclusion

----------

In conclusion, the process of interacting with a voice assistant involves multiple stages, from parsing and dialogue management to text-to-speech generation and speech recognition. Each component plays a critical role in ensuring that your conversation flows smoothly and naturally. While we've simplified this explanation for clarity, it's essential to appreciate the complexity and sophistication involved in these technologies.

As research continues to advance, we can expect voice assistants to become even more accurate and responsive. The development of more sophisticated models, better microphone designs, and improved TTS technology will help bridge the gap between human language and machine comprehension. For now, however, the art of conversation remains a fascinating area of study, with much to learn from these innovative technologies.

"WEBVTTKind: captionsLanguage: enalexa how do i add something to my shopping list according to wikihow to make a shopping list first identify a few items you need to that's not what i meant to be things you've recently that's not what i meant like dish soap or shampoo or items you have this is actually a very useful thing if you didn't know how to make a shopping list but it's not to do with my shopping list a piece of paper or on a note on your phone alexa stop today i think we're talking about um well voice interfaces i think and the uh amazon echo which uses the alexa service which i'm sure a lot of viewers will have uh will have used or have one or something like that and we've got it on mute in a minute haven't we yeah just because you know it would be quite irritating if it wasn't on meat especially talking about alexa without actually addressing alexa oh you're going to be annoying a lot of people at home now in our studies actually that the people who are using the devices have to work out ways to talk about it without addressing it so actually that becomes something you've got to do in essence we say something like a question ask for information perhaps you want to play a game or something like that and the device hears it and then it responds and you know gives us the information we want perhaps add something to our shopping list or whatever it might be you know gives you directions for you know going to some place on the map or whatever and that's essentially what happens so why don't you ask something to alexa alexa what is computer file the definition of computer file is computer science a file maintained in computer readable form did that answer your question no well that's a tricky one because there's ambiguity there right because i've asked about what i'm talking about computer file is one word and it's taking it as being two words right yeah yeah so so we think we're in one situation but the uh alexa thinks we're in another situation so and now it's just triggered again thanks a lot alexa stop okay i'll mute it in the description i'm going to go through kind of some of the basic things that happened but obviously it's much more complex than this if we give the example of kind of thinking about a shopping list so with a with alexa you can maintain a shopping list you can add stuff remove stuff whatever it might be so what we're going to do is we're going to say something like you know alexa could you tell me what's on my shopping list so this is us saying something uh so you know this is our smiley face speech bubble see if i can draw a speech bubble you know this is a waveform effectively when it's picked up by the device so you know sound the sound wave what the device is going to do it's going to take this stuff and it's going to run it through automatic speech recognition asr and this is detecting what was said essentially the first thing the asr is doing but it's local to the device is picking up the wake word alexa the wake word is the first thing that's being detected so there is some kind of on-board speech recognition going on on the device itself to work out when alexa's being said now the rest of this stuff the kind of the could you tell me what's on my shopping list that's being shipped off to the cloud for speech recognition being done on the rest of this sentence that we're saying to the device and that's kind of passing through these very sophisticated complex deep learning models you know this is i think one of the major innovations of these devices is actually having asr that works you know pretty well i'm not saying it works for everyone but it works uh well for a lot of people at least compared to kind of how things used to be so it's shipping all this stuff off into the cloud you know could you tell me what's on my shopping list it's transcribing those into a bit of text essentially so we've got this bit of text that says could you tell me what's on my shopping list and now we need to do something with that text we need to make sense of it in some way and one of the first stages that it goes through is something called natural language processing nlp or natural language understanding and this is taking this text and breaking it up into things that are meaningful from the point of view of the uh the system essentially or from the point of view of the alexa service and that's not going to be everything so some stuff is going to get chucked away so i would guess and this is just my guess that the things that's happening in the in the natural language processing and natural language understanding um elements of the of the kind of the cycle if you like it's things like shopping lists that's something that it knows about perhaps you know could you not really necessarily that useful so probably being chucked away that's almost almost like a politeness yeah yeah yeah yeah exactly um so it's kind of redundant from the point of view of the system although you know if we talk about actual conversation and talk it's certainly not redundant in actual talk it's meaningful it's probably something like tell me is is a significant phrase again that's being passed out and noticed by the parser and maybe some of the other bits about you know my shopping list rather than you know sean's shopping list or something like that so it's parsing these things out the sentence is starting to be decomposed into things that are meaningful from the system point of view after this stage what we're now looking at is you know we've got to do something with these bits of sentence that we've passed and this is obviously a simplification and i'm sure there are lots of other architectures around there and that are different to this one um but i'm just going to go with kind of what i know then there's something called a dialog manager and this thing is taking all those bits the parsed bits we know we know something about you know subject and the object and whatever it might be in this case you know the meaningful thing might be shopping list and the kind of command to tell me and the fact that it's my shopping list not someone else's and the dialogue manager is taking all these bits and pieces and it's got to come up with the next response sorry a response to it in this case it might be you know there's nothing currently on your shopping list or whatever alexa actually says as a result of that that command so you know it's got to generate something but in the course of doing so the dialogue manager's got to do all sorts of other stuff it's got to kind of put you perhaps in some kind of conversational flow or like a state it's going to be looking at kind of what stage you're at in the you know current assumed from the system's point of view assumed assumed conversational um you know uh state might be and then it's got to draw on other resources as well with a shopping list it's kind of something about amazon services come into play so you know where it's storing this information i'm no idea where it's storing it in the cloud somewhere but it's retrieving that information about what's actually on the shopping list this is like data it might be other stuff that we're looking up like perhaps web resources you know if you asked about information about a particular topic it's got to scrape that stuff off the web or grab it somehow and it's got to feed that into the dialogue manager which is doing this kind of generating next responses so in my kind of simplified version there's something about the kind of state we're in in the conversational flow perhaps there's more questions that the device is going to ask after this or whatever it might be to clarify things or whatever and there's some other resources that it might be drawing on to feed data into the response and then it's generating response which might be as i said you know um you know you have no items on your shopping list right which is gonna you know that's kind of what it's coming out with but then that's just a kind of you know text output it's got to say this so the next stage is to do text to speech so it's got to generate speech based on what this text is your shopping list is empty so there's a whole load of complex stuff around speech generation you know there's a whole whole area of research um about how you actually go about doing that which is again very sophisticated and complex and then it comes out of the device the echo as the response and so we hear it the really heavy stuff is this stuff here like the asr stuff here the speech recognition stuff that's where you need a whole ton of data and quite you know significant models that you've learned which then you know you put in input which is these bits of audio and you get output and that's you know which is these bits of text which are which are uh which map you know there's none there's obviously a confidence associated with that um that's a whole you know massive complex area so that's the stuff that really relies on on you know this kind of um significant computing power um also you know that the parsing they're going to be wanting to update at what time so that has to be a service but yeah in terms of the text-to-speech because they want to kind of update it and change it and pushing out updates the device be a problem so that might be why you you would actually be shipping audio but something we could we could find out or be told by uh by commenters i'm sure the reason for doing this rather artificial example is to say oh dear does this matter we have got a sentence that makes perfect sense to us and one picture of sean right so maybe rob miles gets put over here near me which is not so good but we'll get to that and then you're put over here like thisalexa how do i add something to my shopping list according to wikihow to make a shopping list first identify a few items you need to that's not what i meant to be things you've recently that's not what i meant like dish soap or shampoo or items you have this is actually a very useful thing if you didn't know how to make a shopping list but it's not to do with my shopping list a piece of paper or on a note on your phone alexa stop today i think we're talking about um well voice interfaces i think and the uh amazon echo which uses the alexa service which i'm sure a lot of viewers will have uh will have used or have one or something like that and we've got it on mute in a minute haven't we yeah just because you know it would be quite irritating if it wasn't on meat especially talking about alexa without actually addressing alexa oh you're going to be annoying a lot of people at home now in our studies actually that the people who are using the devices have to work out ways to talk about it without addressing it so actually that becomes something you've got to do in essence we say something like a question ask for information perhaps you want to play a game or something like that and the device hears it and then it responds and you know gives us the information we want perhaps add something to our shopping list or whatever it might be you know gives you directions for you know going to some place on the map or whatever and that's essentially what happens so why don't you ask something to alexa alexa what is computer file the definition of computer file is computer science a file maintained in computer readable form did that answer your question no well that's a tricky one because there's ambiguity there right because i've asked about what i'm talking about computer file is one word and it's taking it as being two words right yeah yeah so so we think we're in one situation but the uh alexa thinks we're in another situation so and now it's just triggered again thanks a lot alexa stop okay i'll mute it in the description i'm going to go through kind of some of the basic things that happened but obviously it's much more complex than this if we give the example of kind of thinking about a shopping list so with a with alexa you can maintain a shopping list you can add stuff remove stuff whatever it might be so what we're going to do is we're going to say something like you know alexa could you tell me what's on my shopping list so this is us saying something uh so you know this is our smiley face speech bubble see if i can draw a speech bubble you know this is a waveform effectively when it's picked up by the device so you know sound the sound wave what the device is going to do it's going to take this stuff and it's going to run it through automatic speech recognition asr and this is detecting what was said essentially the first thing the asr is doing but it's local to the device is picking up the wake word alexa the wake word is the first thing that's being detected so there is some kind of on-board speech recognition going on on the device itself to work out when alexa's being said now the rest of this stuff the kind of the could you tell me what's on my shopping list that's being shipped off to the cloud for speech recognition being done on the rest of this sentence that we're saying to the device and that's kind of passing through these very sophisticated complex deep learning models you know this is i think one of the major innovations of these devices is actually having asr that works you know pretty well i'm not saying it works for everyone but it works uh well for a lot of people at least compared to kind of how things used to be so it's shipping all this stuff off into the cloud you know could you tell me what's on my shopping list it's transcribing those into a bit of text essentially so we've got this bit of text that says could you tell me what's on my shopping list and now we need to do something with that text we need to make sense of it in some way and one of the first stages that it goes through is something called natural language processing nlp or natural language understanding and this is taking this text and breaking it up into things that are meaningful from the point of view of the uh the system essentially or from the point of view of the alexa service and that's not going to be everything so some stuff is going to get chucked away so i would guess and this is just my guess that the things that's happening in the in the natural language processing and natural language understanding um elements of the of the kind of the cycle if you like it's things like shopping lists that's something that it knows about perhaps you know could you not really necessarily that useful so probably being chucked away that's almost almost like a politeness yeah yeah yeah yeah exactly um so it's kind of redundant from the point of view of the system although you know if we talk about actual conversation and talk it's certainly not redundant in actual talk it's meaningful it's probably something like tell me is is a significant phrase again that's being passed out and noticed by the parser and maybe some of the other bits about you know my shopping list rather than you know sean's shopping list or something like that so it's parsing these things out the sentence is starting to be decomposed into things that are meaningful from the system point of view after this stage what we're now looking at is you know we've got to do something with these bits of sentence that we've passed and this is obviously a simplification and i'm sure there are lots of other architectures around there and that are different to this one um but i'm just going to go with kind of what i know then there's something called a dialog manager and this thing is taking all those bits the parsed bits we know we know something about you know subject and the object and whatever it might be in this case you know the meaningful thing might be shopping list and the kind of command to tell me and the fact that it's my shopping list not someone else's and the dialogue manager is taking all these bits and pieces and it's got to come up with the next response sorry a response to it in this case it might be you know there's nothing currently on your shopping list or whatever alexa actually says as a result of that that command so you know it's got to generate something but in the course of doing so the dialogue manager's got to do all sorts of other stuff it's got to kind of put you perhaps in some kind of conversational flow or like a state it's going to be looking at kind of what stage you're at in the you know current assumed from the system's point of view assumed assumed conversational um you know uh state might be and then it's got to draw on other resources as well with a shopping list it's kind of something about amazon services come into play so you know where it's storing this information i'm no idea where it's storing it in the cloud somewhere but it's retrieving that information about what's actually on the shopping list this is like data it might be other stuff that we're looking up like perhaps web resources you know if you asked about information about a particular topic it's got to scrape that stuff off the web or grab it somehow and it's got to feed that into the dialogue manager which is doing this kind of generating next responses so in my kind of simplified version there's something about the kind of state we're in in the conversational flow perhaps there's more questions that the device is going to ask after this or whatever it might be to clarify things or whatever and there's some other resources that it might be drawing on to feed data into the response and then it's generating response which might be as i said you know um you know you have no items on your shopping list right which is gonna you know that's kind of what it's coming out with but then that's just a kind of you know text output it's got to say this so the next stage is to do text to speech so it's got to generate speech based on what this text is your shopping list is empty so there's a whole load of complex stuff around speech generation you know there's a whole whole area of research um about how you actually go about doing that which is again very sophisticated and complex and then it comes out of the device the echo as the response and so we hear it the really heavy stuff is this stuff here like the asr stuff here the speech recognition stuff that's where you need a whole ton of data and quite you know significant models that you've learned which then you know you put in input which is these bits of audio and you get output and that's you know which is these bits of text which are which are uh which map you know there's none there's obviously a confidence associated with that um that's a whole you know massive complex area so that's the stuff that really relies on on you know this kind of um significant computing power um also you know that the parsing they're going to be wanting to update at what time so that has to be a service but yeah in terms of the text-to-speech because they want to kind of update it and change it and pushing out updates the device be a problem so that might be why you you would actually be shipping audio but something we could we could find out or be told by uh by commenters i'm sure the reason for doing this rather artificial example is to say oh dear does this matter we have got a sentence that makes perfect sense to us and one picture of sean right so maybe rob miles gets put over here near me which is not so good but we'll get to that and then you're put over here like this\n"