The Latency of Conversational AI: A Deep Dive into NVIDIA's Jarvis Framework
My own playing around with it suggests that the latencies are very, very quick, which you'd probably expect running locally on my machine. However, when it comes to using the framework in a cloud-based environment, things become more complex. In NVIDIA's current customers' use cases, where the framework is designed to be used at scale, the total latency of around 200 milliseconds is still exceptionally fast for sending data over the network to a server to process it and send back the response.
Now, this might seem like a problem, especially when you're dealing with applications that require instantaneous responses. However, conversational AI is meant to be conversational, and instantaneity can sometimes tip into the uncanny valley of artificial intelligence. Instead, the goal is to provide a seamless experience for users. But what does this mean in practice?
To get running with the Jarvis framework, you'll need access to a relatively modern GPU. NVIDIA specifically requires Volta or Ampere-class GPUs, which means that if you're used to working with older models like the K80 on Google Colab, you won't be able to use it with this framework. The first hurdle to overcome is updating your drivers to the latest version, 460, as of the release of this video. This was the biggest challenge for me to get started.
Once you've updated your drivers and set up an NGC account – NVIDIA's GPU cloud – you'll need to install a command-line interface tool to download and provide an API key. The NGC integration with Docker is what makes things relatively straightforward, as all dependencies are bundled into Docker containers, eliminating the need for pip installs or downloading individual packages.
However, there is one catch: loading state-of-the-art deep learning models for natural language processing requires huge amounts of VRAM. My modest 2080 Ti, with its paltry 11 gigabytes of VRAM, simply isn't enough. You'll need at least 16 gigabytes to load these models and enjoy the full functionality of the Jarvis framework. For NVIDIA's customers who are already accustomed to working with large-scale environments and high-end GPUs, this might not be a significant issue.
But for enthusiasts like me, accessing sufficient VRAM can be a challenge. I've spoken to the engineering team, and they're aware that this is a significant requirement for many users. They're actively researching ways to optimize these models to run on smaller footprints, but it's an open beta at present, so you won't have access to that functionality yet.
To mitigate this limitation, I'm planning to do some testing with the Jarvis framework using NVIDIA's LONER TITAN RTX, which I've been provided for this purpose. Unfortunately, I'll need to return it after my tests are complete. However, I'm excited about the potential of this technology and plan to share my findings in future videos.
In conclusion, conversational AI is a rapidly evolving field, and NVIDIA's Jarvis framework is at the forefront of this development. While there may be some challenges to overcome, especially when it comes to VRAM requirements, the payoff can be significant for those who are willing to invest in the necessary hardware and software.
"WEBVTTKind: captionsLanguage: enhas nvidia done to deep learning development what the invention of code editors in the 1990s did to software development maybe so let's find out before we go any further i have two things to ask of you first if you're into artificial intelligence deep reinforcement learning all things deep learning hit that subscribe button so you get notified when i release new content second you definitely want to check out the gpu technology conference happening this april from april 12 through 16th it's totally free it's online you don't even have to leave your seat it features speakers from a wide variety of topics things like artificial intelligence deep learning high performance computing data science augmented reality virtual reality all kinds of awesome stuff go ahead and check out the link in the description below to register today so what exactly is going on here so about two weeks ago a representative from nvidia reached out to me and said phil we are fans of the channel we've been working on something really big and we would love for you to check it out and share it with your audience and so today the embargo lifts on the open beta of jarvis jarvis is a kind of like a framework it's really more of an uh fully integrated end-to-end pipeline for natural language processing applications as of today it handles everything from text transcription punctuation virtual assistants as well as question and answering applications right out of the box being a beta more is planned in particular they are planning uh a whole host of vision applications as well so it won't be just limited to natural language processing now when i say end to end i actually mean it it handles everything from the audio input the audio processing the um audio to transcription the text to speech uh the natural language processing applications in between it does all of it and that's why i kind of address the question of can it do for deep learning development what code editors did for software development in the 1990s so maybe you remember the days of software development before code editors uh where you had to handle things like linking compiling looking up documentation syntax checking all that stuff on your own when they invented code editors all that got integrated into one simple to use platform and that really sped up the entire process of code development by probably an order of magnitude and so jarvis is really kind of an analog to this where they have integrated all the stuff you're going to need to deploy state-of-the-art deep learning models under one umbrella it handles you know data input feature extraction uh pretty much anything you really need to get something going right out of the box and it does it with minimal code i've been playing around with this in my free time and there really isn't a whole lot you have to do to it to get it to work now nvidia does have a couple of very brief demos to show you uh kind of the basic functionality of it so let's go ahead and take a look at those well if you like i can set you up with an optometrist to get an eye exam that would be great okay no problem uh we'll we'll get you started with that um i see dr lee's office is close to your home she's really well rated in our network is it snowing in chicago right now in chicago it is currently not snowing i'm actually planning to visit there that sounds like a lot of fun what else are you planning to do while you were there so in my mind the biggest game changer of this particular framework is the fact that it has pre-trained state-of-the-art models so these are trained on 7 000 hours of both publicly available and proprietary data so you get access to highly trained uh state-of-the-art natural language processing models right out of the box there's no need for you to go out and gather data and to train models from scratch of course having a state-of-the-art model isn't necessarily enough you will have to do some fine tuning for whatever domain specific applications you are working on so they've included integration with the transfer learning toolkit so you can do some basic um data gathering and feature augmentation excuse me data augmentation to take a relatively small data set you know maybe tens to hundreds of hours of your own data and feed it into the models to fine-tune them for whatever niche specific application you may be working on so speaking of niche applications uh that brings us to some basic use cases so one kind of personal example is that i used to be a process engineer at intel and my whole first month of that job was spent trying to get up to speed with really understanding what people were saying i'm not exaggerating when i say it was almost like a foreign language like you get dropped in a foreign land right into the deep end and you have no idea what people are saying jargon is you know all over the place as well as acronyms there were so many acronyms for everything it was you really needed a decoder ring to figure out what people are saying now like all mega corporations intel loves meetings meetings about meetings sometimes meaningception even so what do you do when you have meetings you have someone taking minutes now imagine this imagine you have spent the first 30 years of your life getting an advanced degree cultivating an elite level intellect you get stuck into a meeting and told to take minutes now this is something that has to be done obviously but this really isn't the best use of you know an elite level problem solver or even a recent college graduates intellect right you're there to solve problems to help you know drive money to the bottom line of the company either cutting costs improving throughput reducing defects whatever anything except writing down what other people are saying and so it's not hard to imagine uh an intrepid intel process engineer taking the jarvis system and then incorporating in some intel proprietary data to train the system to understand exactly what is being said in the meeting and then transcribe that word for word that has the additional benefit that you don't have to worry about a human being sitting there trying to type keep up with what's going on trying to think of ideas of how to contribute because let's face it you can't you know put uh transcribe meeting minutes on your brag sheet at the end of the year to get a raise you have to actually solve real problems so you have to be kind of paying attention you have to be probably thinking about picking up your kids in the evening or doing whatever it is you else you have to do and so you're quite likely to perhaps miss something that may be germane to the discussion at hand and so having an automated system totally totally does away with that entire issue where you don't have to worry about human attention you don't have to worry about anything except letting the machine do the dirty work of transcribing the meeting minutes now it's also not hard to imagine bolting on your own application for say text summarization so you have the system transcribe all the meeting minutes accurately reflecting the jargon and the acronyms used and then you have another uh bit of your own code that you integrate with this that can perform text summarization so you only capture the really salient points or maybe not maybe you want the whole meeting minutes it's really up to the engineer at the end of the day but it's nice to have the flexibility that jarvis provides to integrate your own code with the state-of-the-art models to create new applications for the nlp software now real world applications wouldn't be possible if the latency stretched into the seconds or worse regime so my own playing around with it suggests that the latencies are very very quick which you'd probably expect running locally on my machine but nvidia's own data with their current customers where it's running in the cloud which is really where it's designed to be used right when you want to run it at scale you've got to have a whole bunch of gpus crunching these models but even then including the network latency from what i understand the late total latency is around 200 milliseconds and that is exceptionally fast for sending data over the network to a server to process it and send back the response now it's not instantaneous but when you think about it you probably wouldn't want it to be instantaneous anyway uh so the whole point is this is supposed to be conversational ai right in the case of particularly virtual assistants you want to be conversational and having an immediate response would probably be a little bit too much into the uncanny valley of artificial intelligence now you might think that this is a pretty big pain in the neck to get running and you would be wrong uh in fact this is actually relatively straightforward you only need a few things the first of which is access to a relatively modern gpu this is only configured to work with volta touring and ampere class gpus so if you're used to using like the k80 gpus on something like google colab then unfortunately that won't work you will need uh the latest drivers 460 as of the release of this video so make sure to update that's something i had to do that was the biggest hurdle to getting this running was updating my drivers and dealing with all the dependencies that entailed other than that you make an ngc account which is the nvidia gpu cloud that is pretty straightforward it just takes a few seconds to confirm your registration and then you can actually get access to these scripts you'll need to install a command line interface tool to do the downloads provide an api key so they know you're not just some geek off the streets and then you are off and running the scripts do all the heavy lifting for you oh you will need docker of course uh that's part of the best part is that they use docker integration so you don't have to worry about downloading all the dependencies you don't have to do pip install 50 different packages it's all bundled in in docker so that takes care of that for you now there is one catch and it's not a show stopper but it is something to be aware of so it turns out that loading state-of-the-art deep learning models for natural language processing requires huge amounts of vram my peasant-tier 2080 tis with their paltry 11 gigabytes simply aren't enough it turns out you need 16 gigabytes of vram to load these models to enjoy the natural language processing functionality now for nvidia's customers this isn't a huge deal because large corporations and uh big research teams are gonna have access to cloud gpus anyway and so that's not really a problem if you're a an enthusiast like me then having access to a high-end gpu may be a little bit of a difficulty but you can always rent some time on a cloud gpu provider to get a model running there and then scale it up as you need for your end application now you can still use the automatic speech recognition functionality which i have done it works flawlessly it's as good as they say it is it's pretty cool stuff but to get access to the full functionality of jarvis you do need something a little bit beefier with 16 gigabytes of vram or above now i've spoken to the engineering team and they are aware that for enthusiasts this is a big kind of steep requirement so they are spending active research hours on trying to do um some shall we say optimizations to the model to get it to run on a smaller footprint but you know it's an open beta and so uh that functionality will come later but have no fear i'm going to do a full test on the jarvis framework for you nvidia was kind enough to provide me with a loner titan rtx i do have to send this back unfortunately uh but i'm going to be using this in the coming days and weeks to put the framework through its paces and report back my findings to you i have a few pretty cool ideas for projects in mind hopefully i can get those off the ground to show you the end result all right so that's all i have for today keep in mind in the future i'm going to have actual evaluations of the framework uh with my own code so stay tuned for that any questions comments leave them down in the description below make sure to subscribe if you haven't already leave a like or heck even a dislike if you don't like the not so subtle titan rtx flex i wouldn't blame you i'll see you in the next videohas nvidia done to deep learning development what the invention of code editors in the 1990s did to software development maybe so let's find out before we go any further i have two things to ask of you first if you're into artificial intelligence deep reinforcement learning all things deep learning hit that subscribe button so you get notified when i release new content second you definitely want to check out the gpu technology conference happening this april from april 12 through 16th it's totally free it's online you don't even have to leave your seat it features speakers from a wide variety of topics things like artificial intelligence deep learning high performance computing data science augmented reality virtual reality all kinds of awesome stuff go ahead and check out the link in the description below to register today so what exactly is going on here so about two weeks ago a representative from nvidia reached out to me and said phil we are fans of the channel we've been working on something really big and we would love for you to check it out and share it with your audience and so today the embargo lifts on the open beta of jarvis jarvis is a kind of like a framework it's really more of an uh fully integrated end-to-end pipeline for natural language processing applications as of today it handles everything from text transcription punctuation virtual assistants as well as question and answering applications right out of the box being a beta more is planned in particular they are planning uh a whole host of vision applications as well so it won't be just limited to natural language processing now when i say end to end i actually mean it it handles everything from the audio input the audio processing the um audio to transcription the text to speech uh the natural language processing applications in between it does all of it and that's why i kind of address the question of can it do for deep learning development what code editors did for software development in the 1990s so maybe you remember the days of software development before code editors uh where you had to handle things like linking compiling looking up documentation syntax checking all that stuff on your own when they invented code editors all that got integrated into one simple to use platform and that really sped up the entire process of code development by probably an order of magnitude and so jarvis is really kind of an analog to this where they have integrated all the stuff you're going to need to deploy state-of-the-art deep learning models under one umbrella it handles you know data input feature extraction uh pretty much anything you really need to get something going right out of the box and it does it with minimal code i've been playing around with this in my free time and there really isn't a whole lot you have to do to it to get it to work now nvidia does have a couple of very brief demos to show you uh kind of the basic functionality of it so let's go ahead and take a look at those well if you like i can set you up with an optometrist to get an eye exam that would be great okay no problem uh we'll we'll get you started with that um i see dr lee's office is close to your home she's really well rated in our network is it snowing in chicago right now in chicago it is currently not snowing i'm actually planning to visit there that sounds like a lot of fun what else are you planning to do while you were there so in my mind the biggest game changer of this particular framework is the fact that it has pre-trained state-of-the-art models so these are trained on 7 000 hours of both publicly available and proprietary data so you get access to highly trained uh state-of-the-art natural language processing models right out of the box there's no need for you to go out and gather data and to train models from scratch of course having a state-of-the-art model isn't necessarily enough you will have to do some fine tuning for whatever domain specific applications you are working on so they've included integration with the transfer learning toolkit so you can do some basic um data gathering and feature augmentation excuse me data augmentation to take a relatively small data set you know maybe tens to hundreds of hours of your own data and feed it into the models to fine-tune them for whatever niche specific application you may be working on so speaking of niche applications uh that brings us to some basic use cases so one kind of personal example is that i used to be a process engineer at intel and my whole first month of that job was spent trying to get up to speed with really understanding what people were saying i'm not exaggerating when i say it was almost like a foreign language like you get dropped in a foreign land right into the deep end and you have no idea what people are saying jargon is you know all over the place as well as acronyms there were so many acronyms for everything it was you really needed a decoder ring to figure out what people are saying now like all mega corporations intel loves meetings meetings about meetings sometimes meaningception even so what do you do when you have meetings you have someone taking minutes now imagine this imagine you have spent the first 30 years of your life getting an advanced degree cultivating an elite level intellect you get stuck into a meeting and told to take minutes now this is something that has to be done obviously but this really isn't the best use of you know an elite level problem solver or even a recent college graduates intellect right you're there to solve problems to help you know drive money to the bottom line of the company either cutting costs improving throughput reducing defects whatever anything except writing down what other people are saying and so it's not hard to imagine uh an intrepid intel process engineer taking the jarvis system and then incorporating in some intel proprietary data to train the system to understand exactly what is being said in the meeting and then transcribe that word for word that has the additional benefit that you don't have to worry about a human being sitting there trying to type keep up with what's going on trying to think of ideas of how to contribute because let's face it you can't you know put uh transcribe meeting minutes on your brag sheet at the end of the year to get a raise you have to actually solve real problems so you have to be kind of paying attention you have to be probably thinking about picking up your kids in the evening or doing whatever it is you else you have to do and so you're quite likely to perhaps miss something that may be germane to the discussion at hand and so having an automated system totally totally does away with that entire issue where you don't have to worry about human attention you don't have to worry about anything except letting the machine do the dirty work of transcribing the meeting minutes now it's also not hard to imagine bolting on your own application for say text summarization so you have the system transcribe all the meeting minutes accurately reflecting the jargon and the acronyms used and then you have another uh bit of your own code that you integrate with this that can perform text summarization so you only capture the really salient points or maybe not maybe you want the whole meeting minutes it's really up to the engineer at the end of the day but it's nice to have the flexibility that jarvis provides to integrate your own code with the state-of-the-art models to create new applications for the nlp software now real world applications wouldn't be possible if the latency stretched into the seconds or worse regime so my own playing around with it suggests that the latencies are very very quick which you'd probably expect running locally on my machine but nvidia's own data with their current customers where it's running in the cloud which is really where it's designed to be used right when you want to run it at scale you've got to have a whole bunch of gpus crunching these models but even then including the network latency from what i understand the late total latency is around 200 milliseconds and that is exceptionally fast for sending data over the network to a server to process it and send back the response now it's not instantaneous but when you think about it you probably wouldn't want it to be instantaneous anyway uh so the whole point is this is supposed to be conversational ai right in the case of particularly virtual assistants you want to be conversational and having an immediate response would probably be a little bit too much into the uncanny valley of artificial intelligence now you might think that this is a pretty big pain in the neck to get running and you would be wrong uh in fact this is actually relatively straightforward you only need a few things the first of which is access to a relatively modern gpu this is only configured to work with volta touring and ampere class gpus so if you're used to using like the k80 gpus on something like google colab then unfortunately that won't work you will need uh the latest drivers 460 as of the release of this video so make sure to update that's something i had to do that was the biggest hurdle to getting this running was updating my drivers and dealing with all the dependencies that entailed other than that you make an ngc account which is the nvidia gpu cloud that is pretty straightforward it just takes a few seconds to confirm your registration and then you can actually get access to these scripts you'll need to install a command line interface tool to do the downloads provide an api key so they know you're not just some geek off the streets and then you are off and running the scripts do all the heavy lifting for you oh you will need docker of course uh that's part of the best part is that they use docker integration so you don't have to worry about downloading all the dependencies you don't have to do pip install 50 different packages it's all bundled in in docker so that takes care of that for you now there is one catch and it's not a show stopper but it is something to be aware of so it turns out that loading state-of-the-art deep learning models for natural language processing requires huge amounts of vram my peasant-tier 2080 tis with their paltry 11 gigabytes simply aren't enough it turns out you need 16 gigabytes of vram to load these models to enjoy the natural language processing functionality now for nvidia's customers this isn't a huge deal because large corporations and uh big research teams are gonna have access to cloud gpus anyway and so that's not really a problem if you're a an enthusiast like me then having access to a high-end gpu may be a little bit of a difficulty but you can always rent some time on a cloud gpu provider to get a model running there and then scale it up as you need for your end application now you can still use the automatic speech recognition functionality which i have done it works flawlessly it's as good as they say it is it's pretty cool stuff but to get access to the full functionality of jarvis you do need something a little bit beefier with 16 gigabytes of vram or above now i've spoken to the engineering team and they are aware that for enthusiasts this is a big kind of steep requirement so they are spending active research hours on trying to do um some shall we say optimizations to the model to get it to run on a smaller footprint but you know it's an open beta and so uh that functionality will come later but have no fear i'm going to do a full test on the jarvis framework for you nvidia was kind enough to provide me with a loner titan rtx i do have to send this back unfortunately uh but i'm going to be using this in the coming days and weeks to put the framework through its paces and report back my findings to you i have a few pretty cool ideas for projects in mind hopefully i can get those off the ground to show you the end result all right so that's all i have for today keep in mind in the future i'm going to have actual evaluations of the framework uh with my own code so stay tuned for that any questions comments leave them down in the description below make sure to subscribe if you haven't already leave a like or heck even a dislike if you don't like the not so subtle titan rtx flex i wouldn't blame you i'll see you in the next video\n"