Generative AI on the Edge with Vinesh Sukumar - 623

The Evolution of AI at the Edge: A Conversation with Sam

As we continue to push the boundaries of artificial intelligence, it's becoming increasingly clear that the future of AI lies at the edge of the network. In this conversation with Sam, we explore the latest developments and innovations in AI at the edge, from the use of computational graphs to the emergence of "AI 2.0".

We began by discussing the concept of a computational graph, which is a fundamental component of many machine learning models, including those used in Tensor Flow and PyTorch. According to Sam, a graph model can be constructed out of a single model or multiple models, allowing for greater flexibility and scalability. "This allows us to shard our model or program into smaller portions that can be executed in parallel," Sam explained. "It's similar to traditional distributed computing, but instead of distributing the workload across multiple machines, we're using specialized cores within a device itself."

The process of parallelizing code at the edge is made possible by the use of specialized hardware, such as multi-core processors or GPUs. These devices allow for efficient execution of vector-based functions, scalar-based functions, and matrix-based operations, making it possible to distribute workloads according to their specific requirements. "We're not just parallelizing across a homogeneous infrastructure," Sam noted. "We're also specializing and distributing workloads based on the type of function being invoked at a given time."

This approach has significant implications for the development of AI models that can be executed in real-time, without requiring extensive computational resources. By leveraging the power of specialized hardware and parallel processing, we can create more efficient and effective AI systems that can handle a wide range of use cases.

One area of particular interest is the emergence of "AI 2.0", which refers to the next generation of AI capabilities that promise to revolutionize the way we interact with machines. According to Sam, AI 2.0 represents a significant leap forward in terms of understanding and context, allowing for more personalized and relevant responses to user requests.

"We're seeing some amazing developments in this space," Sam said. "Many startups are already leveraging OpenAI's API to create new and innovative applications. It's exciting to think about the potential implications of AI 2.0 on our daily lives, from more accurate language translation to more intuitive interface design."

In addition to its practical applications, AI 2.0 also raises important questions about the future of work and the role of machines in society. As Sam noted, "we're entering a period where machines are capable of understanding context and nuance in ways that were previously thought impossible. This has significant implications for everything from education to employment."

As we move forward into this new era of AI development, it's clear that the edge is playing an increasingly important role in shaping the future of machine learning. By harnessing the power of specialized hardware and parallel processing, we can create more efficient and effective AI systems that can handle a wide range of use cases.

One area of interest is the emergence of microkubernetes at the edge, where workloads are directed to specific devices based on their requirements. This approach allows for greater flexibility and scalability, making it possible to deploy complex AI models in real-time.

The future of AI at the edge looks bright, with many exciting developments on the horizon. As Sam noted, "it's an exciting time, and we truly believe that consumers will see fantastic new experiences as a result."

"WEBVTTKind: captionsLanguage: enall right everyone welcome to another episode of the twiml AI podcast I'm your host Sam charington and today I'm joined by vaness sukumar vanesh is head of AIML product management at Qualcomm Technologies before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's show vanesh Welcome to the podcast thank you Sam for providing me the opportunity to be in your talk show uh I'm super excited to jump into our conversation as we all know AI has been evolving very quickly and uh we're g to dig into how that evolution is impacting AI at the edge and the various uh use cases that are supported by Edge Technologies before we dive into that topic I'd love to have you share a little bit about your background and how you came to work in machine learning and AI yeah sure absolutely um I've been you know in the AI space for I would say give or take about 10 15 years um started my career working for jet proportional Labs those days was mostly looking at you know uh image classification object classification but under the nomenclature of uh computer vision um and those days it was all about looking at big charts and able to classify know content based of the captures with time transition more towards consumer electronic field like uh Mobile cell phones working on cameras and then you know with time transition over to PC platforms and then to to Automotive platforms and then finally at Qualcomm had the opportunity to look at AIML more horizontally across many verticals from Mobile AL the to automotive and you know the challenge with AI is what you know today is uh obsolete tomorrow so you need to be a student at heart all the time and learn new things and that keeps me quite interested while working at foro well I think that theme of change is going to be a big one uh in our conversation in our previous chat one of the things you mentioned uh that struck me was just how the evolution in uh use cases and uh deep learning architectures as just a couple of examples you know all these things impact the hardware architectures that are used to support these kinds of uh use cases and Technologies on the edge can you talk a little bit about what you're seeing um from a a use case Evolution perspective yeah absolutely a great question by the way historically when you look at AIML on the edge to a large extent was quite small in nature uh was very much evolving around um image uh and video wherein you want to look at image enhancements um image modifications background segmentation classification detection those kind of use cases but as um the technical acumen a lot of contributions from research scientists ecosystem acceptance of AI started to Mo you could see that um the evolution of a use cases started touching other modalities uh from image and video moving on to text uh moving on to Linguistics moving on to Commerce and many other segments so as such uh when you put this totality into picture historically to a large extent all the AI algorithms was most convolution heavy which was mostly reflecting on CNN's um because it was quite good in nature at that point of time but as you get in more modalities into the picture to support use cases or or create key experience indicators now the investment is moving towards I would say Transformers as an example um then moving on to generative content which has been quite a large of Interest these days recommendation engines which has been quite popular in social media and in Commerce platform forms the question really becomes as you know is your architecture a uh generic enough to be able to support new and upcoming architectures which as is it only meant for you know for convolutions or can it go beyond convolutions into complex architectures like large language models and recommendations engines which have been quite popular these days and or do you want to focus on architectures which can do you know very specific tasks they have been excellent in performance excellent in power efficiency see those kind of stuff so these are some of the questions and debates we ask ourselves we try to enable both and depending upon the vertical of Interest vertical elment is a specific form factor of Interest could be Mobile in nature could be Automotive in nature we try to make sure you know those Investments kind of scale accordingly yeah I'd love to to have you kind of dig into some of those tradeoffs when I think about um you know what I know about the way Qualcomm approaches enabling AI on device from you know many prior conversations one of the themes is you know really understanding the architecture and optimizing around uh that architecture which you know as you're you're suggesting here um you know maybe an architecture or maybe a a device has some special accommodation for convolutional neuron networks that allows them to operate you know really quickly and with a lot of efficiency uh and then the predominant architecture shifts uh and um you know there's something else uh that you need to to think about um you at the same time a lot of the the conversation around optimization is is uh at a a bit of a lower level than that you know optimizing matrix multiplication and um you know things like the the um the the resolution of your integers and things like that uh so maybe talk us through kind of how you've approached the the balance in the past and uh when a shift like this happens do you just optimize around the new thing or are you you know kind of creating something that works across the new and the old yeah that's a fair question this is something you know we could constantly challenged by Partners as well so historically you know the way we trying to approach these use cases or applications is try to understand what are the key performance indicators they could be applications which are anchored on latency anchored on performance anchored on quality of service I power efficiency or all the above right so as such um when you have you know a certain use case you try to get into certain system decomposition the decomposition really means is if I invoke a certain application I'm able to meet the kpis of interest at an application layer that kind of answers the question by itself is what is needed from a software perspective and what is needed from a hardware perspective um now the hardware is obviously a little bit more challenging is because we want to kind of think two or three years in advance right to re do we have the right investments in place to really push for it and software is a constantly evolving topic to say what kind of optimizations is necessary at the library level at the kernel level at the compiler level to be in a position to generate you know machine readable code that is easy to execute so let's start off with you know on the hardware perspective from Hardware perspective you know I would say some of the key most important elements say do we have the U the right uh data types in qualcomm's perspective our position and push has always been we want to move towards fixed Point as we believe you know it gives you the highest performance uh uh quite uh you know um leadership class power efficiency and also a very small memory footprint compared to that of a floating Point uh representation of the same model so as such you know we continue to push the boundaries of where we invest in data types historically it was Flo and then we translation to inate and we now moving towards int for representation right so that's one one perspective the second thing is um you know what would be the uh the Investments necessary from a compute standpoint from a local tightly coupled memory standpoint with the expectation that you could you know you know store the bias functions within the title couple memory rather than making round trips to damp because you want to see on power efficiency as an example right so there's always a question of you know what what kind of compute uh is actually you know the right amount of investment so here we start looking at use cases and then try to understand what is the state of art architectures really supporting those kind of use cases right and then we try to construct the investment in compute around that now last but not the least is elements around um bandwidth you know do we have competition schemes that would really push for high INF per second uh you know how is the U register spacing established what kind of sparcity or intelligence needs to be established as part of the hardware so that you don't really spend cycles of time in trying to do zero based computations so all these are feature enhancements that we really push for uh you know from a hardware perspective to really you know drive for leadership then you go towards the software level now software level is obviously complicated here um uh we can start off with quantization quantization is a metric that we heavily push for wherein we can actually provide the opportunity to our partners to transition toward it's um you know uh fix point in 8 Bits and four bits for both activation and weights so you know we have a tool called AIT and we're putting a lot of energy on AMT which is artificial intelligence model efficiency toolkit that provides the flexibility to our partners to not lose an accuracy while getting higher performance right then we move on to our AI stack our you know which we recently uh announced in a couple of months ago last year which we call as a qualcom stack and as part of it we have the runtime we have two tools um we have libraries and the lowlevel API structure that is flexible enough to execute this model or desired application on various OS could be on Windows could be on Linux could be on Android those kind of subsystems here you know uh there's a lot of emphasis on how do we optimize on performance now performance is a very broad term so we look at if I happen to have concurrency you know how can I pack a single graph that is comp you know made up of composite smaller graph into uh you know into our Hardware mechanics and then execute them as an example second thing is um if I have to you know securely store the model in in this case you know what kind of encryption schemes needs to be supported to make sure somebody doesn't hack into the model of Interest the other could be is what kind of preemption schemes I would have if you happen to have concurrency can I you know support all these graphs simultaneously without compromising and quality of service so all these are investments that we really push for working with our engineering staff uh you know these features get updated on a regular basis to really make sure we can you know push for much larger deployment of applications which are quite complex in nature across all bu and one good thing is uh since we have a unified qualcom AI stack that works from mobile all the way to Automotive the lessons learned from Automotive get propagated to mobile and from mobile get propagated to Auto as well that way there is a lot of handshake of good information that that way the applications can be deployed across all form factors with with these you mentioned uh mobile and auto what are the specific use cases you're seeing in each of those domains you know as the shift from convolutions to Transformers uh takes hold are there any in particular the jump out is kind of driving the trend historically when you look at mobile as an anchor point and you look at as a consumer buying pattern uh to a large extent people are buying phones because of good camera so which means most of your AI analytics is revolving around image or revolving around video right right so they are to a large extent convolution heavy now people are slowly you know trying to bring in text as an input and then fuse image with text where Transformers is becoming a quite an important element of interest and then try to see can I use the um you know the sequential data that is embedded as part of a transformer Network to understand context and make a much better prediction so we kind of you know revolving around text or an image and quality but with different classes of networks which are not necessarily Evolution heavy but more towards Transformers right so that's I would say to a large extent um an anchor point for mobile now obviously in Mobile there's a bigger challenge is you know it operates on battery so our expectation is we really give this performance at much lower power uh so that you don't drain the battery as much as possible and have a much smaller memory footprint because know you're operating within a mobile device but as you move towards Automotive is a completely different space here concurrency is a big deal you're trying to work with 50 16 different sensors could be camera sensors liar radar how can you fuse them together to actually provide you a quality of service latency is a much bigger deal uh here the physical uh input tensors or the resolution of the camera is much bigger right either you have 1 megapixel 3 megapixel 5 megapixel 8 megapixel and on the mobile side the input tensors are very small in nature they're probably 512 x 512 256 or 225 by 225 kind of scenarios but on the automotive side because you need to have a much larger field of view you want to make predictions or decisions uh much more stringently you take a much larger input tensor so the physical compute requirements is much more intensive compared to that what you would see from a mobile environment standpoint um and and not to mention is you know you're going to see also the ask for one deep uh or one one AI model that does multiple stuff it could do it could look at Drive policy as an example it could look at parking stack as an example but you have the same infrastructure to really make a lot of those predictions so um a different set of challenges from an automotive standpoint compared to that of mobile but you know uh having a unique and I would say a consistent stack both in the mobile and Automotive front has really helped us really you know accelerate I should say uh new operators new complex models so that our developers can you know really use this investment and then drive Innovative applications when you speak about the compute requirements on the auto side I'm wondering if you you are familiar with this I saw an article just the other day that referred to an Mit paper that uh you know they did did an analysis uh a high level analysis but the uh conclusion that they came to was that at full deployment of fully autonomous driving you know when uh the dominant mode of transportation is you know autonomous vehicles they Pro ejected that the compute requirement uh was pretty much on par with all of the data centers today uh have you seen have you seen that paper by any chance not necessarily but I think in the gist of what they're probably claiming as you're you suggesting is the compute requirements within a space you know historically when you look at Automotive there's an infotainment site which is totally in Cabin entertainment purposes and then there's an Adas site Adas obviously goes by L1 to L5 depends depending upon the levels of autonomy right and I believe they were specifically referring to getting to you know a you know full distributed Fleet of level five autonomous driving Fleet doing that uh to a large extent you know having both infotainment and L level of autonomy Services is quite complex in nature and one layer on top of it is the Fleet Management is how can you have multiple trucks as an example or multiple Robo taxis being you know efficiently managed from one Control Center and all this information handling is going to be quite complex and I wouldn't be surprised you know if you really looking at an infrastructure that powers the data centers today because I think that's where you're going to go towards as fully autonomous vehicles of Fleet Management takes a little bit more you know I would say broad acceptance by you know I would say by consumer space yeah so it sounds like uh that doesn't that level of scale required you know doesn't surprise you given you know what you know about uh Adas and and those requirements absolutely I I think you know we've been doing that to some extent uh historically as well and you look at our history we started off small with a much smaller footprint occupying in the mobile space with time for the last five to six years we kind of expanded to adjacent business units could be ar VR compute uh iot and automotive and with each adjacent BOS getting a lot more fraction in AI the requirements have actually scaled up much much more than you know that we had expected so to answer your question I've been surprised not necessarily we've been planning all all along we just have to make sure is the time to market the time to really accelerate solution deployment can really be much more faster because the rate of innovation that's happening for example in automotive space is enormous right I see a lot of L2 L3 kind of vehicles coming out this year and next year as people start transitioning to level four and level which is full autonomy at that point of time do we have the necessary infrastructure to take you know actions based on the computer Investments and software Investments you're going for it and that's what we've been trying to push for to make it happen yeah so we talked about uh mobile we've talked about Automotive uh uh a third area that historically has been uh interesting for Edge use cases is Enterprise and kind of distributed you know things like distributed cameras um is that still an interesting use case or growing use case yeah so it so when you look at form factors as such um Automotive is one extreme Enterprise could be seen as um something like a PC computer devices as an example could also be seen as uh personalized devices like vable hables as that market takes over and um PC as a fo factor is gaining a lot of popularity within the Enterprise segment for good or bad reason postco only because work from home has become quite important um and uh the need for AI in PC is just soed post the pandemic situation uh few examples in this case would be you know uh how can I improve video conferencing as an example how can I improve streaming as an example within the Enterprise space as well when streaming uh historically when you look at it there's been Enterprise streaming game streaming Enterprise streaming always used to be on a reference platform but post pandemic now people want to be doing it from their own homes the question has really become is are we in a position to support an infrastructure that supports Enterprise streaming and what kind of firm Factor will push for it and you know and PC has been a fantastic firm factor to really drive it and most of our engagements moving forward you're going to see a lot of big announcements come from that front and how we really positioning um I would say PC as a fo factor to drive Enterprise scale experiences that's interesting I hadn't thought about the you know the PC um experience in that question I was mostly thinking of these use cases like I guess what you might call more embedded use cases like distributed cameras and distributed sensors and and that kind of thing do you see a lot of growth in in those use cases absolutely um you know I would probably position it this way when you look at AI inferencing to a large extent is very generic in nature uh you know what applies for me applies to you in any given context there is no personalization element to it so the question really comes is can I collect more data about the user to drive personalization and this could be coming from sensors registered to the user could be cameras registered to the user as such if I get more pieces of information it could be visual in nature could be textual in nature could be audio specific information can I make that inference on the edge for the user more optimal right and this is something that's ging a lot more popularity but the biggest challenge has been as part of this us use case enablement is uh how much data is good data what data is good data uh how do I enable uh some level of uh supervision to make sure the data that's been collected is also labeled properly because if we don't have labeled information I can't engage in retraining if I have to do retraining on the edge uh what kind of constraints I would have on the hardware software side to be in a position to encourage uh learning or retraining so all these big questions usually come up and uh the ecosystem in general is trying to understand is uh how can they scale moving forward and this is you know one such fantastic example as uh collecting the right amount of data and then driving towards personalized inference moving forward interesting that touches on themes of data Centric AI which uh is a big conversation in the industry kind of shifting from you know masses of lower quality data to small amounts of higher quality data and uh putting a lot of care in place around how that data is curated uh so as to make more efficient use of it in training and tuning uh which seems right in line with the kind of thing you'd want to worry about if you're dealing with uh with Edge type devices I would say absolutely you know industry as such is kind of morphing away from a model Centric View to a data Centric view but if you transition to a data Centric view the question really becomes as um can I have Investments across the entire picture of mlops from data annotation all the way to data monitoring and especially on the edge if you're going towards more user based personalization it's extremely critical to have elements of of uh you know um collecting the right data labeling it filtering it and training on the edge uh and then once you enable inference you want to also monitor it you know are you seeing some kind of drift what kind of drift are you suggesting um and can I make any compensation based on the drift so yes absolutely becoming more critical how we go make it happen on the edge is always a big question mark because uh the physical um structure of edge devices is very different not all devices can can support it and the question really becomes is can I run partially on the cloud can I run partially on the edge but these are Big questions that the industry is facing today and I'm pretty sure in the next six to nine months we'll have opportunities out there where people are going to say guys AI is actually becoming more meaningful me because now I'm engaging in more an extensive mlops investment rather than a very toned down generic with no common sense AI inference that exists today when you think about what that mlops footprint looks like like uh kind of this Edge aware mlops footprint what's different about it from the way folks are are doing mlops kind of in in non-edge environments so when you look at the mlops as it exists today historically you know um well I would say two or three scenarios one you have a pre-trained model uh the pre-trained model either has been available in open source Community or somebody's already developed it you take the pre-train model you quantize it you fit it to the envelope of your Edge device and you invoke it and you get it running now uh is a drift you see drift over time nobody monitors it not nobody cares about it and it's already completed the second category I would say is wherein you know um it you don't have a pre-trained model you want to start from the beginning uh and this is a classic example like an infotainment use case within an automotive segment an automotive segment you want to look at let's say a driver monitoring system you want to understand the emotional state of a driver to take certain actions uh and maybe take control over him because if is not in the right state of mind you know you don't want him to encourage driving but understanding the emotional state of a driver means you need to understand or make accurate predictions uh but for that you need access to data today you don't really have too many data points so here there's a lot more emphasis on creating synthetic data to kind of really create a scenario that would show the emotional state of a driver synthetic data was not popular I would say three or four years ago but now it's becoming extremely popular because you don't have enough data to create the right uh model so once you happen to have that again then you get um a pre-trained model you emphasize you get it running and you're fully functional right so the first two is to a large extent mostly popular in the industry today but the third portion is again uh is how do I constantly learn how can I make sure that the model I've de Ed from a synthetic data source or a pre-trained data source can evolve with time as I understand the user much more diligently moving forward that doesn't happen so the question really becomes is how can I make it happen I think um over a couple of years ago um you know Tesla kind of introduced this fact wherein um you know they used to take elements um about the road and you know the scenarios that they're trying to evolve and work on and then they used to send that to the cloud optimize a model and then invoke it back to the device a similar concept Google talked about it what we call as Federated learning a couple of years ago similar Concepts what you're trying to see is you know if that information which is very prevalent to the user you don't want that information to leave the device so what this infrastructure that existed before in the cloud can I bring it to the edge and constantly make the model learn so that it commits less and less mistakes moving forward and on top of it uh can be uh can accommodate user specific information so that's the big difference moving forward from from an industry standpoint you you spoke a bit about these metrics that you are tracking and caring about when you're looking at edge cases and trying to map them onto Hardware architectures uh things like latency and bandwidth and all these things to what degree has you do the each of these architectures kind of Drive their own distinct usage patterns like uh you know we used to think about the kind of the cloud as you know you want you want to kind of provision uh in a hybrid type environment at least like provision for your sustained usage locally and then be able to burst to the cloud for Peak usage is there that kind of scenario uh or thinking uh in place when you're thinking about Edge I would say to a large extent um when you look at Edge by definition has a very wide um footprint the footprint could be in terms of compute in terms of bandwidth in terms of memory in terms of power which really would put a constraint of what kind of inference is possible on the edge right that's one thing the second thing is cloud to a large extent provides you a unified experience meaning that no matter uh where you are as long as you have good connectivity to the cloud you're able to get a certain inference profile but the challenge is uh you need to have that strong connectivity to the cloud now the third scenario which you mentioned some kind of combinations of hybrid wherein can I run portions on the edge portions on the cloud absolutely possible it to some extent it already exists today uh you know a classic example would be uh if you're trying to do uh video conferencing as an example uh blue jeans teams all these folks when they enable language translation or language description they really invoke uh automatic speech recognition neural machine translation and text to speech these are heavy compute workloads and most of them are resident on the cloud the reason they're trying to be resident on the cloud is you want to be supporting multiple languages to be able to translate to English so by definition it's a very large Library file which does not you know really fit an edge envelope so they're going to be these situations wherein you have to partition a workload between Edge and the cloud now the question for us uh is you know people are working on the edge is how can you make Edge more competitive so that you can enhance user experience and I'm pretty sure um as generative AI use cases chat GPD like use cases become more prominent the question only becomes is you know should I be running these models on the cloud going to run portions of the edge and portions on the cloud but these are all you know good discussion topics that is happening in the industry as we speak today to really drive that partition and make Edge more relevant compared to that other Cloud yeah literally as we're recording this one of the things that uh is capturing a lot of excitement in the uh in the community is folks uh are downloading this new llm the The Meta Lambda model and running it on their their you know local machines their Mac uh uh MacBook devices and you know running to seven billion parameter llm and getting results in decent time and I saw a post just earlier today about someone who uh ran this on a Raspberry Pi um and I think this is an indication an early indication maybe that um people are going to want to see uh these types of models uh running on device and uh certainly we're going to hope that the performance is better than the 10 seconds per token than the person saw on the Raspberry Pi absolutely absolutely I mean this is going to be inevitable right people are going to understand the fact is image to image translation or textto text translation especially for complex quaries if it takes like 15 seconds to respond you're not going to get the best user experience right so the question really becomes is can I bring it somewhat manageable and can we get it done so do we know what that that Transformer specific architecture looks like and and how it differs from the convolution tuned architecture yeah so when you look at large language models by definition is an evolution of the Transformer itself uh to a large extent most of these um models are kind of basically divided into three portions the end code portion the decode portion and the unit portion but in the expectation is you you can understand the context understand the relationship between the sequential data that exist as part of the model itself and make an accurate prediction right and as it knows more about the user it makes a recommendation that's very fine-tuned to you and that's the reason Transformers has been quite successful in nature and you can see lot of evolution of many such architectures with which are like um you know multi-layered attention kind of uh uh platform because it actually is able to uh provide an answer which is very relatable to the question you ask rather than a very generic you know statement this is much more different than from a convolution heavy architecture because convolution is more about I would say classifying detecting you know kind of stuff it didn't really have any context but it kind of exactly does manual stuff that you're asking it to do now that being said researchers are already in the industry working on making convolutions or I should say CNN type architectures much more intelligent you know because it kind of solved a unique set of problems but they're trying to understand can see in s get into context I'm pretty sure with time you're going to see there but as of today uh you're seeing you know once Transformers was introduced uh way long ago with um I think from Google if I remember correct and it has taken its own life for the last 5 to 10 years and here we are with llms that are kind of banged on Transformers and so do do we know what that translates into from a a Target Hardware uh architecture perspective meaning the you know as you as you noted the convolutional networks you know the focus is like speeding up Matrix multiplications right you know the faster we get those that's kind of all we need um but now that attention is all we need uh you know Are there specific structures that you see evolving in Hardware to um to accommodate attention you know are there you know registers that you might buildt into Hardware that are attention specific or or kind of designed to accommodate unique uh properties of the Transformer so the first thing usually want to get into this picture is when you look at these Prometheus sty models which are coming up from openi team they're like what 175 billion parameter kind of issues and GPT 4 is releasing this week or next week I'm pretty sure it's north of 200 billion parameters so the question really becomes is as you trying to transition or bring elements of those models or even meta's uh llama model I think that you referring to earlier is around 7 to 10 billion parameter range so the question first really becomes is do you have the necessary uh memory to really store these parameters number one second thing is as you start enabling on the edge does your Hardware support fixed Point like inate or in4 uh so that you can uh not lose on performance have increased performance which a much lower memory footprint that would be a second point the third thing would be is do we have the necessary software to be in a position to release support quantization from a floating Point version to a fix Point version right that would be I would say a critical as you start looking at these large models last from a hardware perspective is do we happen to have a certain amount of acceleration and programmability to support the end code portions especially the initial layers of the stack so that your inference time is much more uh elaborate so I think you know these are some of the fundamentals I would say to really get it done apart from hey do we have the speeds and feeds in term in terms of a number of gos that's available to support it the amount of bandwidth that can you know uh Drive these transactions between memory and compute units those kind of stuff are pretty normal but our expectation is you really need to have at least three or four key elements which I just talked about on the software front and the hardware front really get these llms up and running on the U on the edge and I wouldn't be surprised in the next couple of years you would see major silicon vendors uh start you know enabling uh these large language model in some sh shape or form on their silicon another thing that came up in our conversation that I found was pretty interesting is some of the work that you're doing around microti inferencing can you talk a little bit about where that uh comes up and what's new there yeah absolutely uh the microti incing is something that we announced last year in our annual Snap Dragon tech tech Summit and one of the points or elements of that focus of attention was um these days we have an opportunity to work with quite large graphs and historically you know when you want to load these graphs you you you load this entire graph into our processing units and then you try to run operations one layer at a time so by definition it consumes a lot of power and it takes a lot of time from an inferencing standpoint so the question really becomes is how can we intelligently accelerate performance both from a latency standpoint also from a power consumption standpoint so what we did is we actually broke these crafts into smaller units called tiles and we also have a lot of control processing units that can handle scalar vector and Matrix based operations all we had to do was to really make sure that we have the right uh control mechanics to be able to distribute these small tiles of graphs to accelerate performance so this is what we call as micro tile inferencing is be having the ability to divide the graph into smaller units and then you know activating the several control processing units that we have as part of a system to make sure we can give out uh uh inferences at a much faster Cadence in time compared to the old style of you know reading one entire graph in its uniformity and just to be clear on the nomenclature we talking about um the a computational graph like a tensor flow or pytorch program or a uh a graph of relationships like you might process with graph neural networks oh yeah good point so this is spe this is not the graph neutal netbooks uh like you would see in medicines you know to understand DNA relationships kind of stuff but this is like at a tensor flow a model which I refer to as a graph the model could be a single model could be multiple models but we construct the graph out of it and then we execute the graph as a function of a TENS of flow a p torch model as you stated got it so in a sense you're kind of sharding your your model or your program and operating on uh shards of the the the model or the tensor program that is correct in a in a much more parallel fashion rather than a Serial fashion because you know we' been able to divide them accordingly into smaller portions of or should smaller portions of models or Ops and get them executed Pally at the same time it almost uh calls to mind kind of your traditional uh the kind of thing that you might do in Super Computing or kind of traditional distributed compute is you've got this program you want to uh be able to uh to parallelize it you kind of break it into smaller chunks and you distribute those to to different machines in that case in this case it sounds like we're talking about well I should ask are we talking about devices or are we talking about um cores for example on a single uh device or single processor we have uh I would say multiple smaller cores within the device itself and depending upon the function that is necessary if it's a vector based function then a certain Co gets activated if it's a scalar based function and a certain code gets activated and something similar with the Matrix based operations so that way we kind of distribute it intelligently depending upon what kind of function is being invoked at a certain time period and then execute those functions par got it so it's not just kind of parallelizing across uh a homogeneous uh infrastructure it's also specializing uh par is um parallelizing or Distributing it according to specialization and What that particular tile requires absolutely because uh many of the use cases as we start looking today is just not one model you could have five six 10 15 different models all needing to be executed at the same time so it's quite important to understand what the mechanics of those models are and then try to invoke it uh you know without compromising on latency yeah almost sounds like you're running a micro kubernetes on your on your chip and doing uh I forgot the name of the feature where you where you can kind of uh direct uh specific workloads with specific requirements to a particular pod yeah something like a cube flow pretty much on the edge exactly you know maybe you know just you know briefly with uh maybe just briefly returning to this theme of change and and evolution any uh quick thoughts on where you see this all going over the next I'll let you pick the time frame I would say you know next six to n months is going to be exciting uh you know a lot of craziness around generative AI around uh use cases that can be coming into that picture so I wouldn't be surprised wherein you know I would say AI 2.0 is kind of coming into picture wherein people actually are seeing um context into quiries they're able to understand for any given uh request that you might have youd actually getting a response which is more tailored to you so exciting times moving forward text to text image to image text to 2D text to 3D all kind of scenarios are going to Mo uh the other day I was kind of reading an article is when open came up with their models more than 150 startups have actually you know pruned up in the last uh couple of months using their API so I think exciting times um and uh we truly believe uh you know uh you know end of the day consumers or users are going to see fantastic new experiences which is just not there before yeah yeah yeah super interesting that you say AI 2.0 I've been saying AI 3.0 but it's clear that uh there is a uh n plus one happening somewhere independent of where you started yeah my count starts from 0.0 so maybe that's why I'm off by one awesome well vanesh thanks so much for uh taking the time to catch us up on uh AI at the edge and and uh everything happening there absolutely thanks for the opportunity Sam and I hope to catch up with you uh in a different podcast very soon thank youall right everyone welcome to another episode of the twiml AI podcast I'm your host Sam charington and today I'm joined by vaness sukumar vanesh is head of AIML product management at Qualcomm Technologies before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's show vanesh Welcome to the podcast thank you Sam for providing me the opportunity to be in your talk show uh I'm super excited to jump into our conversation as we all know AI has been evolving very quickly and uh we're g to dig into how that evolution is impacting AI at the edge and the various uh use cases that are supported by Edge Technologies before we dive into that topic I'd love to have you share a little bit about your background and how you came to work in machine learning and AI yeah sure absolutely um I've been you know in the AI space for I would say give or take about 10 15 years um started my career working for jet proportional Labs those days was mostly looking at you know uh image classification object classification but under the nomenclature of uh computer vision um and those days it was all about looking at big charts and able to classify know content based of the captures with time transition more towards consumer electronic field like uh Mobile cell phones working on cameras and then you know with time transition over to PC platforms and then to to Automotive platforms and then finally at Qualcomm had the opportunity to look at AIML more horizontally across many verticals from Mobile AL the to automotive and you know the challenge with AI is what you know today is uh obsolete tomorrow so you need to be a student at heart all the time and learn new things and that keeps me quite interested while working at foro well I think that theme of change is going to be a big one uh in our conversation in our previous chat one of the things you mentioned uh that struck me was just how the evolution in uh use cases and uh deep learning architectures as just a couple of examples you know all these things impact the hardware architectures that are used to support these kinds of uh use cases and Technologies on the edge can you talk a little bit about what you're seeing um from a a use case Evolution perspective yeah absolutely a great question by the way historically when you look at AIML on the edge to a large extent was quite small in nature uh was very much evolving around um image uh and video wherein you want to look at image enhancements um image modifications background segmentation classification detection those kind of use cases but as um the technical acumen a lot of contributions from research scientists ecosystem acceptance of AI started to Mo you could see that um the evolution of a use cases started touching other modalities uh from image and video moving on to text uh moving on to Linguistics moving on to Commerce and many other segments so as such uh when you put this totality into picture historically to a large extent all the AI algorithms was most convolution heavy which was mostly reflecting on CNN's um because it was quite good in nature at that point of time but as you get in more modalities into the picture to support use cases or or create key experience indicators now the investment is moving towards I would say Transformers as an example um then moving on to generative content which has been quite a large of Interest these days recommendation engines which has been quite popular in social media and in Commerce platform forms the question really becomes as you know is your architecture a uh generic enough to be able to support new and upcoming architectures which as is it only meant for you know for convolutions or can it go beyond convolutions into complex architectures like large language models and recommendations engines which have been quite popular these days and or do you want to focus on architectures which can do you know very specific tasks they have been excellent in performance excellent in power efficiency see those kind of stuff so these are some of the questions and debates we ask ourselves we try to enable both and depending upon the vertical of Interest vertical elment is a specific form factor of Interest could be Mobile in nature could be Automotive in nature we try to make sure you know those Investments kind of scale accordingly yeah I'd love to to have you kind of dig into some of those tradeoffs when I think about um you know what I know about the way Qualcomm approaches enabling AI on device from you know many prior conversations one of the themes is you know really understanding the architecture and optimizing around uh that architecture which you know as you're you're suggesting here um you know maybe an architecture or maybe a a device has some special accommodation for convolutional neuron networks that allows them to operate you know really quickly and with a lot of efficiency uh and then the predominant architecture shifts uh and um you know there's something else uh that you need to to think about um you at the same time a lot of the the conversation around optimization is is uh at a a bit of a lower level than that you know optimizing matrix multiplication and um you know things like the the um the the resolution of your integers and things like that uh so maybe talk us through kind of how you've approached the the balance in the past and uh when a shift like this happens do you just optimize around the new thing or are you you know kind of creating something that works across the new and the old yeah that's a fair question this is something you know we could constantly challenged by Partners as well so historically you know the way we trying to approach these use cases or applications is try to understand what are the key performance indicators they could be applications which are anchored on latency anchored on performance anchored on quality of service I power efficiency or all the above right so as such um when you have you know a certain use case you try to get into certain system decomposition the decomposition really means is if I invoke a certain application I'm able to meet the kpis of interest at an application layer that kind of answers the question by itself is what is needed from a software perspective and what is needed from a hardware perspective um now the hardware is obviously a little bit more challenging is because we want to kind of think two or three years in advance right to re do we have the right investments in place to really push for it and software is a constantly evolving topic to say what kind of optimizations is necessary at the library level at the kernel level at the compiler level to be in a position to generate you know machine readable code that is easy to execute so let's start off with you know on the hardware perspective from Hardware perspective you know I would say some of the key most important elements say do we have the U the right uh data types in qualcomm's perspective our position and push has always been we want to move towards fixed Point as we believe you know it gives you the highest performance uh uh quite uh you know um leadership class power efficiency and also a very small memory footprint compared to that of a floating Point uh representation of the same model so as such you know we continue to push the boundaries of where we invest in data types historically it was Flo and then we translation to inate and we now moving towards int for representation right so that's one one perspective the second thing is um you know what would be the uh the Investments necessary from a compute standpoint from a local tightly coupled memory standpoint with the expectation that you could you know you know store the bias functions within the title couple memory rather than making round trips to damp because you want to see on power efficiency as an example right so there's always a question of you know what what kind of compute uh is actually you know the right amount of investment so here we start looking at use cases and then try to understand what is the state of art architectures really supporting those kind of use cases right and then we try to construct the investment in compute around that now last but not the least is elements around um bandwidth you know do we have competition schemes that would really push for high INF per second uh you know how is the U register spacing established what kind of sparcity or intelligence needs to be established as part of the hardware so that you don't really spend cycles of time in trying to do zero based computations so all these are feature enhancements that we really push for uh you know from a hardware perspective to really you know drive for leadership then you go towards the software level now software level is obviously complicated here um uh we can start off with quantization quantization is a metric that we heavily push for wherein we can actually provide the opportunity to our partners to transition toward it's um you know uh fix point in 8 Bits and four bits for both activation and weights so you know we have a tool called AIT and we're putting a lot of energy on AMT which is artificial intelligence model efficiency toolkit that provides the flexibility to our partners to not lose an accuracy while getting higher performance right then we move on to our AI stack our you know which we recently uh announced in a couple of months ago last year which we call as a qualcom stack and as part of it we have the runtime we have two tools um we have libraries and the lowlevel API structure that is flexible enough to execute this model or desired application on various OS could be on Windows could be on Linux could be on Android those kind of subsystems here you know uh there's a lot of emphasis on how do we optimize on performance now performance is a very broad term so we look at if I happen to have concurrency you know how can I pack a single graph that is comp you know made up of composite smaller graph into uh you know into our Hardware mechanics and then execute them as an example second thing is um if I have to you know securely store the model in in this case you know what kind of encryption schemes needs to be supported to make sure somebody doesn't hack into the model of Interest the other could be is what kind of preemption schemes I would have if you happen to have concurrency can I you know support all these graphs simultaneously without compromising and quality of service so all these are investments that we really push for working with our engineering staff uh you know these features get updated on a regular basis to really make sure we can you know push for much larger deployment of applications which are quite complex in nature across all bu and one good thing is uh since we have a unified qualcom AI stack that works from mobile all the way to Automotive the lessons learned from Automotive get propagated to mobile and from mobile get propagated to Auto as well that way there is a lot of handshake of good information that that way the applications can be deployed across all form factors with with these you mentioned uh mobile and auto what are the specific use cases you're seeing in each of those domains you know as the shift from convolutions to Transformers uh takes hold are there any in particular the jump out is kind of driving the trend historically when you look at mobile as an anchor point and you look at as a consumer buying pattern uh to a large extent people are buying phones because of good camera so which means most of your AI analytics is revolving around image or revolving around video right right so they are to a large extent convolution heavy now people are slowly you know trying to bring in text as an input and then fuse image with text where Transformers is becoming a quite an important element of interest and then try to see can I use the um you know the sequential data that is embedded as part of a transformer Network to understand context and make a much better prediction so we kind of you know revolving around text or an image and quality but with different classes of networks which are not necessarily Evolution heavy but more towards Transformers right so that's I would say to a large extent um an anchor point for mobile now obviously in Mobile there's a bigger challenge is you know it operates on battery so our expectation is we really give this performance at much lower power uh so that you don't drain the battery as much as possible and have a much smaller memory footprint because know you're operating within a mobile device but as you move towards Automotive is a completely different space here concurrency is a big deal you're trying to work with 50 16 different sensors could be camera sensors liar radar how can you fuse them together to actually provide you a quality of service latency is a much bigger deal uh here the physical uh input tensors or the resolution of the camera is much bigger right either you have 1 megapixel 3 megapixel 5 megapixel 8 megapixel and on the mobile side the input tensors are very small in nature they're probably 512 x 512 256 or 225 by 225 kind of scenarios but on the automotive side because you need to have a much larger field of view you want to make predictions or decisions uh much more stringently you take a much larger input tensor so the physical compute requirements is much more intensive compared to that what you would see from a mobile environment standpoint um and and not to mention is you know you're going to see also the ask for one deep uh or one one AI model that does multiple stuff it could do it could look at Drive policy as an example it could look at parking stack as an example but you have the same infrastructure to really make a lot of those predictions so um a different set of challenges from an automotive standpoint compared to that of mobile but you know uh having a unique and I would say a consistent stack both in the mobile and Automotive front has really helped us really you know accelerate I should say uh new operators new complex models so that our developers can you know really use this investment and then drive Innovative applications when you speak about the compute requirements on the auto side I'm wondering if you you are familiar with this I saw an article just the other day that referred to an Mit paper that uh you know they did did an analysis uh a high level analysis but the uh conclusion that they came to was that at full deployment of fully autonomous driving you know when uh the dominant mode of transportation is you know autonomous vehicles they Pro ejected that the compute requirement uh was pretty much on par with all of the data centers today uh have you seen have you seen that paper by any chance not necessarily but I think in the gist of what they're probably claiming as you're you suggesting is the compute requirements within a space you know historically when you look at Automotive there's an infotainment site which is totally in Cabin entertainment purposes and then there's an Adas site Adas obviously goes by L1 to L5 depends depending upon the levels of autonomy right and I believe they were specifically referring to getting to you know a you know full distributed Fleet of level five autonomous driving Fleet doing that uh to a large extent you know having both infotainment and L level of autonomy Services is quite complex in nature and one layer on top of it is the Fleet Management is how can you have multiple trucks as an example or multiple Robo taxis being you know efficiently managed from one Control Center and all this information handling is going to be quite complex and I wouldn't be surprised you know if you really looking at an infrastructure that powers the data centers today because I think that's where you're going to go towards as fully autonomous vehicles of Fleet Management takes a little bit more you know I would say broad acceptance by you know I would say by consumer space yeah so it sounds like uh that doesn't that level of scale required you know doesn't surprise you given you know what you know about uh Adas and and those requirements absolutely I I think you know we've been doing that to some extent uh historically as well and you look at our history we started off small with a much smaller footprint occupying in the mobile space with time for the last five to six years we kind of expanded to adjacent business units could be ar VR compute uh iot and automotive and with each adjacent BOS getting a lot more fraction in AI the requirements have actually scaled up much much more than you know that we had expected so to answer your question I've been surprised not necessarily we've been planning all all along we just have to make sure is the time to market the time to really accelerate solution deployment can really be much more faster because the rate of innovation that's happening for example in automotive space is enormous right I see a lot of L2 L3 kind of vehicles coming out this year and next year as people start transitioning to level four and level which is full autonomy at that point of time do we have the necessary infrastructure to take you know actions based on the computer Investments and software Investments you're going for it and that's what we've been trying to push for to make it happen yeah so we talked about uh mobile we've talked about Automotive uh uh a third area that historically has been uh interesting for Edge use cases is Enterprise and kind of distributed you know things like distributed cameras um is that still an interesting use case or growing use case yeah so it so when you look at form factors as such um Automotive is one extreme Enterprise could be seen as um something like a PC computer devices as an example could also be seen as uh personalized devices like vable hables as that market takes over and um PC as a fo factor is gaining a lot of popularity within the Enterprise segment for good or bad reason postco only because work from home has become quite important um and uh the need for AI in PC is just soed post the pandemic situation uh few examples in this case would be you know uh how can I improve video conferencing as an example how can I improve streaming as an example within the Enterprise space as well when streaming uh historically when you look at it there's been Enterprise streaming game streaming Enterprise streaming always used to be on a reference platform but post pandemic now people want to be doing it from their own homes the question has really become is are we in a position to support an infrastructure that supports Enterprise streaming and what kind of firm Factor will push for it and you know and PC has been a fantastic firm factor to really drive it and most of our engagements moving forward you're going to see a lot of big announcements come from that front and how we really positioning um I would say PC as a fo factor to drive Enterprise scale experiences that's interesting I hadn't thought about the you know the PC um experience in that question I was mostly thinking of these use cases like I guess what you might call more embedded use cases like distributed cameras and distributed sensors and and that kind of thing do you see a lot of growth in in those use cases absolutely um you know I would probably position it this way when you look at AI inferencing to a large extent is very generic in nature uh you know what applies for me applies to you in any given context there is no personalization element to it so the question really comes is can I collect more data about the user to drive personalization and this could be coming from sensors registered to the user could be cameras registered to the user as such if I get more pieces of information it could be visual in nature could be textual in nature could be audio specific information can I make that inference on the edge for the user more optimal right and this is something that's ging a lot more popularity but the biggest challenge has been as part of this us use case enablement is uh how much data is good data what data is good data uh how do I enable uh some level of uh supervision to make sure the data that's been collected is also labeled properly because if we don't have labeled information I can't engage in retraining if I have to do retraining on the edge uh what kind of constraints I would have on the hardware software side to be in a position to encourage uh learning or retraining so all these big questions usually come up and uh the ecosystem in general is trying to understand is uh how can they scale moving forward and this is you know one such fantastic example as uh collecting the right amount of data and then driving towards personalized inference moving forward interesting that touches on themes of data Centric AI which uh is a big conversation in the industry kind of shifting from you know masses of lower quality data to small amounts of higher quality data and uh putting a lot of care in place around how that data is curated uh so as to make more efficient use of it in training and tuning uh which seems right in line with the kind of thing you'd want to worry about if you're dealing with uh with Edge type devices I would say absolutely you know industry as such is kind of morphing away from a model Centric View to a data Centric view but if you transition to a data Centric view the question really becomes as um can I have Investments across the entire picture of mlops from data annotation all the way to data monitoring and especially on the edge if you're going towards more user based personalization it's extremely critical to have elements of of uh you know um collecting the right data labeling it filtering it and training on the edge uh and then once you enable inference you want to also monitor it you know are you seeing some kind of drift what kind of drift are you suggesting um and can I make any compensation based on the drift so yes absolutely becoming more critical how we go make it happen on the edge is always a big question mark because uh the physical um structure of edge devices is very different not all devices can can support it and the question really becomes is can I run partially on the cloud can I run partially on the edge but these are Big questions that the industry is facing today and I'm pretty sure in the next six to nine months we'll have opportunities out there where people are going to say guys AI is actually becoming more meaningful me because now I'm engaging in more an extensive mlops investment rather than a very toned down generic with no common sense AI inference that exists today when you think about what that mlops footprint looks like like uh kind of this Edge aware mlops footprint what's different about it from the way folks are are doing mlops kind of in in non-edge environments so when you look at the mlops as it exists today historically you know um well I would say two or three scenarios one you have a pre-trained model uh the pre-trained model either has been available in open source Community or somebody's already developed it you take the pre-train model you quantize it you fit it to the envelope of your Edge device and you invoke it and you get it running now uh is a drift you see drift over time nobody monitors it not nobody cares about it and it's already completed the second category I would say is wherein you know um it you don't have a pre-trained model you want to start from the beginning uh and this is a classic example like an infotainment use case within an automotive segment an automotive segment you want to look at let's say a driver monitoring system you want to understand the emotional state of a driver to take certain actions uh and maybe take control over him because if is not in the right state of mind you know you don't want him to encourage driving but understanding the emotional state of a driver means you need to understand or make accurate predictions uh but for that you need access to data today you don't really have too many data points so here there's a lot more emphasis on creating synthetic data to kind of really create a scenario that would show the emotional state of a driver synthetic data was not popular I would say three or four years ago but now it's becoming extremely popular because you don't have enough data to create the right uh model so once you happen to have that again then you get um a pre-trained model you emphasize you get it running and you're fully functional right so the first two is to a large extent mostly popular in the industry today but the third portion is again uh is how do I constantly learn how can I make sure that the model I've de Ed from a synthetic data source or a pre-trained data source can evolve with time as I understand the user much more diligently moving forward that doesn't happen so the question really becomes is how can I make it happen I think um over a couple of years ago um you know Tesla kind of introduced this fact wherein um you know they used to take elements um about the road and you know the scenarios that they're trying to evolve and work on and then they used to send that to the cloud optimize a model and then invoke it back to the device a similar concept Google talked about it what we call as Federated learning a couple of years ago similar Concepts what you're trying to see is you know if that information which is very prevalent to the user you don't want that information to leave the device so what this infrastructure that existed before in the cloud can I bring it to the edge and constantly make the model learn so that it commits less and less mistakes moving forward and on top of it uh can be uh can accommodate user specific information so that's the big difference moving forward from from an industry standpoint you you spoke a bit about these metrics that you are tracking and caring about when you're looking at edge cases and trying to map them onto Hardware architectures uh things like latency and bandwidth and all these things to what degree has you do the each of these architectures kind of Drive their own distinct usage patterns like uh you know we used to think about the kind of the cloud as you know you want you want to kind of provision uh in a hybrid type environment at least like provision for your sustained usage locally and then be able to burst to the cloud for Peak usage is there that kind of scenario uh or thinking uh in place when you're thinking about Edge I would say to a large extent um when you look at Edge by definition has a very wide um footprint the footprint could be in terms of compute in terms of bandwidth in terms of memory in terms of power which really would put a constraint of what kind of inference is possible on the edge right that's one thing the second thing is cloud to a large extent provides you a unified experience meaning that no matter uh where you are as long as you have good connectivity to the cloud you're able to get a certain inference profile but the challenge is uh you need to have that strong connectivity to the cloud now the third scenario which you mentioned some kind of combinations of hybrid wherein can I run portions on the edge portions on the cloud absolutely possible it to some extent it already exists today uh you know a classic example would be uh if you're trying to do uh video conferencing as an example uh blue jeans teams all these folks when they enable language translation or language description they really invoke uh automatic speech recognition neural machine translation and text to speech these are heavy compute workloads and most of them are resident on the cloud the reason they're trying to be resident on the cloud is you want to be supporting multiple languages to be able to translate to English so by definition it's a very large Library file which does not you know really fit an edge envelope so they're going to be these situations wherein you have to partition a workload between Edge and the cloud now the question for us uh is you know people are working on the edge is how can you make Edge more competitive so that you can enhance user experience and I'm pretty sure um as generative AI use cases chat GPD like use cases become more prominent the question only becomes is you know should I be running these models on the cloud going to run portions of the edge and portions on the cloud but these are all you know good discussion topics that is happening in the industry as we speak today to really drive that partition and make Edge more relevant compared to that other Cloud yeah literally as we're recording this one of the things that uh is capturing a lot of excitement in the uh in the community is folks uh are downloading this new llm the The Meta Lambda model and running it on their their you know local machines their Mac uh uh MacBook devices and you know running to seven billion parameter llm and getting results in decent time and I saw a post just earlier today about someone who uh ran this on a Raspberry Pi um and I think this is an indication an early indication maybe that um people are going to want to see uh these types of models uh running on device and uh certainly we're going to hope that the performance is better than the 10 seconds per token than the person saw on the Raspberry Pi absolutely absolutely I mean this is going to be inevitable right people are going to understand the fact is image to image translation or textto text translation especially for complex quaries if it takes like 15 seconds to respond you're not going to get the best user experience right so the question really becomes is can I bring it somewhat manageable and can we get it done so do we know what that that Transformer specific architecture looks like and and how it differs from the convolution tuned architecture yeah so when you look at large language models by definition is an evolution of the Transformer itself uh to a large extent most of these um models are kind of basically divided into three portions the end code portion the decode portion and the unit portion but in the expectation is you you can understand the context understand the relationship between the sequential data that exist as part of the model itself and make an accurate prediction right and as it knows more about the user it makes a recommendation that's very fine-tuned to you and that's the reason Transformers has been quite successful in nature and you can see lot of evolution of many such architectures with which are like um you know multi-layered attention kind of uh uh platform because it actually is able to uh provide an answer which is very relatable to the question you ask rather than a very generic you know statement this is much more different than from a convolution heavy architecture because convolution is more about I would say classifying detecting you know kind of stuff it didn't really have any context but it kind of exactly does manual stuff that you're asking it to do now that being said researchers are already in the industry working on making convolutions or I should say CNN type architectures much more intelligent you know because it kind of solved a unique set of problems but they're trying to understand can see in s get into context I'm pretty sure with time you're going to see there but as of today uh you're seeing you know once Transformers was introduced uh way long ago with um I think from Google if I remember correct and it has taken its own life for the last 5 to 10 years and here we are with llms that are kind of banged on Transformers and so do do we know what that translates into from a a Target Hardware uh architecture perspective meaning the you know as you as you noted the convolutional networks you know the focus is like speeding up Matrix multiplications right you know the faster we get those that's kind of all we need um but now that attention is all we need uh you know Are there specific structures that you see evolving in Hardware to um to accommodate attention you know are there you know registers that you might buildt into Hardware that are attention specific or or kind of designed to accommodate unique uh properties of the Transformer so the first thing usually want to get into this picture is when you look at these Prometheus sty models which are coming up from openi team they're like what 175 billion parameter kind of issues and GPT 4 is releasing this week or next week I'm pretty sure it's north of 200 billion parameters so the question really becomes is as you trying to transition or bring elements of those models or even meta's uh llama model I think that you referring to earlier is around 7 to 10 billion parameter range so the question first really becomes is do you have the necessary uh memory to really store these parameters number one second thing is as you start enabling on the edge does your Hardware support fixed Point like inate or in4 uh so that you can uh not lose on performance have increased performance which a much lower memory footprint that would be a second point the third thing would be is do we have the necessary software to be in a position to release support quantization from a floating Point version to a fix Point version right that would be I would say a critical as you start looking at these large models last from a hardware perspective is do we happen to have a certain amount of acceleration and programmability to support the end code portions especially the initial layers of the stack so that your inference time is much more uh elaborate so I think you know these are some of the fundamentals I would say to really get it done apart from hey do we have the speeds and feeds in term in terms of a number of gos that's available to support it the amount of bandwidth that can you know uh Drive these transactions between memory and compute units those kind of stuff are pretty normal but our expectation is you really need to have at least three or four key elements which I just talked about on the software front and the hardware front really get these llms up and running on the U on the edge and I wouldn't be surprised in the next couple of years you would see major silicon vendors uh start you know enabling uh these large language model in some sh shape or form on their silicon another thing that came up in our conversation that I found was pretty interesting is some of the work that you're doing around microti inferencing can you talk a little bit about where that uh comes up and what's new there yeah absolutely uh the microti incing is something that we announced last year in our annual Snap Dragon tech tech Summit and one of the points or elements of that focus of attention was um these days we have an opportunity to work with quite large graphs and historically you know when you want to load these graphs you you you load this entire graph into our processing units and then you try to run operations one layer at a time so by definition it consumes a lot of power and it takes a lot of time from an inferencing standpoint so the question really becomes is how can we intelligently accelerate performance both from a latency standpoint also from a power consumption standpoint so what we did is we actually broke these crafts into smaller units called tiles and we also have a lot of control processing units that can handle scalar vector and Matrix based operations all we had to do was to really make sure that we have the right uh control mechanics to be able to distribute these small tiles of graphs to accelerate performance so this is what we call as micro tile inferencing is be having the ability to divide the graph into smaller units and then you know activating the several control processing units that we have as part of a system to make sure we can give out uh uh inferences at a much faster Cadence in time compared to the old style of you know reading one entire graph in its uniformity and just to be clear on the nomenclature we talking about um the a computational graph like a tensor flow or pytorch program or a uh a graph of relationships like you might process with graph neural networks oh yeah good point so this is spe this is not the graph neutal netbooks uh like you would see in medicines you know to understand DNA relationships kind of stuff but this is like at a tensor flow a model which I refer to as a graph the model could be a single model could be multiple models but we construct the graph out of it and then we execute the graph as a function of a TENS of flow a p torch model as you stated got it so in a sense you're kind of sharding your your model or your program and operating on uh shards of the the the model or the tensor program that is correct in a in a much more parallel fashion rather than a Serial fashion because you know we' been able to divide them accordingly into smaller portions of or should smaller portions of models or Ops and get them executed Pally at the same time it almost uh calls to mind kind of your traditional uh the kind of thing that you might do in Super Computing or kind of traditional distributed compute is you've got this program you want to uh be able to uh to parallelize it you kind of break it into smaller chunks and you distribute those to to different machines in that case in this case it sounds like we're talking about well I should ask are we talking about devices or are we talking about um cores for example on a single uh device or single processor we have uh I would say multiple smaller cores within the device itself and depending upon the function that is necessary if it's a vector based function then a certain Co gets activated if it's a scalar based function and a certain code gets activated and something similar with the Matrix based operations so that way we kind of distribute it intelligently depending upon what kind of function is being invoked at a certain time period and then execute those functions par got it so it's not just kind of parallelizing across uh a homogeneous uh infrastructure it's also specializing uh par is um parallelizing or Distributing it according to specialization and What that particular tile requires absolutely because uh many of the use cases as we start looking today is just not one model you could have five six 10 15 different models all needing to be executed at the same time so it's quite important to understand what the mechanics of those models are and then try to invoke it uh you know without compromising on latency yeah almost sounds like you're running a micro kubernetes on your on your chip and doing uh I forgot the name of the feature where you where you can kind of uh direct uh specific workloads with specific requirements to a particular pod yeah something like a cube flow pretty much on the edge exactly you know maybe you know just you know briefly with uh maybe just briefly returning to this theme of change and and evolution any uh quick thoughts on where you see this all going over the next I'll let you pick the time frame I would say you know next six to n months is going to be exciting uh you know a lot of craziness around generative AI around uh use cases that can be coming into that picture so I wouldn't be surprised wherein you know I would say AI 2.0 is kind of coming into picture wherein people actually are seeing um context into quiries they're able to understand for any given uh request that you might have youd actually getting a response which is more tailored to you so exciting times moving forward text to text image to image text to 2D text to 3D all kind of scenarios are going to Mo uh the other day I was kind of reading an article is when open came up with their models more than 150 startups have actually you know pruned up in the last uh couple of months using their API so I think exciting times um and uh we truly believe uh you know uh you know end of the day consumers or users are going to see fantastic new experiences which is just not there before yeah yeah yeah super interesting that you say AI 2.0 I've been saying AI 3.0 but it's clear that uh there is a uh n plus one happening somewhere independent of where you started yeah my count starts from 0.0 so maybe that's why I'm off by one awesome well vanesh thanks so much for uh taking the time to catch us up on uh AI at the edge and and uh everything happening there absolutely thanks for the opportunity Sam and I hope to catch up with you uh in a different podcast very soon thank you\n"

Generative AI on the Edge with Vinesh Sukumar - 623

Random Videos