Building LLM-Based Applications with Azure OpenAI with Jay Emery - 657

The Art and Science of Leveraging Large Language Models

In today's rapidly evolving landscape of natural language processing (NLP), large language models have become an essential tool for various applications, including text analysis, sentiment analysis, and content generation. These models are capable of accomplishing impressive feats with the prompt and in some cases, 100% it makes sense to use the larger newer model but maybe not in every scenario, um that's where that concept of pre-processing we're seeing being used so you know with basic heristics, you're able to determine which model to leverage and win, um but ultimately, that's the conversation that that you know as an organization you're having with you know either your internal team or your partners whether that be Microsoft or another company.

It takes good architecture to really manage those costs, so leveraging the dashboards that you get from the provider, yep uh using the least expensive smallest model possible um pre-processing if possible to determine that only the requests that need the larger models use them, yeah. ALS we also saw we also saw Sam actually leveraging, um taking a document or some sort of a corpus of data summarizing it in in maybe 3.5 and then feeding that summary into maybe something that's a little bit more rich and and and has a lot more depth like four, and that answer coming back is actually less expensive or uses less tokens than had you fed it all into to to four immediately.

So, yeah, get it's a little bit the context of in the in the context of Rag or just impr prompting, impr promps. Yeah, so okay, so it's it's a little bit of an art and not necessarily the science um, and that's where sometimes these these kind of either third-party or first-party tools with prompt variants can can kind of be worth their weight and gold depending on what you come up with, yeah. So that just ART versus science kind of conveys that it's a lot of trial and error you know of course you've got the the the um these ideas that we just talked about but uh a big part of it is just uh you know starting with what's easy a big model maybe and just trying to whittle it down where you can um to try to get the cost under control.

My advice in general is to leverage the kind of tried and true models where you can, and and if there's a reason why you can't then that's where you really want to kind of start to look at prompting or Rag or or or even fine-tuning. But these models that are out today are very mature very powerful I would always try to explore the most with them before you undertake you know creating your own or or or getting into the fine-tuning space, um you know it's amazing what some accurate prompting can do when you thought maybe it was only achievable with with fine tuning got it got it awesome.

Looking Ahead: The Future of Large Language Models

Within two weeks half of it had uh come to pass, what what are your predictions now. I don't know, I think I need to go out and buy some lottery tickets is what I think we need to do oh my gosh no you know for those that weren't part of the original recording we we talked a little bit about um you know not necessarily Quantum step in uh knowledge but you know it's going to be more around performance and and finding ways to be more energy efficient um, and and we're starting to see that I think you know they they announced uh GPT for Turbo, you know again much more performant actually prices went down which is fantastic.

They included some more multimodality so now you can can funnel in you know basically a picture book and allow that as input. I think you know again this is you know again maybe this is where that lottery ticket comes into play here but I think that eventually we'll probably get to the point where video and like 3D models or objects uh could potentially be input um, and then uh yeah. I I think there was some really interesting products and platforms announced um, and I'm sure there's going to be more announced at ignite the you know the GPT store was a really cool concept coming out of open AI um, and you know to be to be determined what impact or or where Microsoft's going to have a part in that but.

Very very cool things coming you can really tell that they uh the open AI team is super passionate about connecting with the developer community and and making others successful awesome awesome. Well Jay thanks so much for joining me once again uh for uh the conversation and to share a bit about what you have been seeing working with uh startups digital natives uh on uh LM enablement. Sam, thank you so much it's been just an absolute pleasure and look forward to talk to you soon all righty thanks Jay all right thanks bye

"WEBVTTKind: captionsLanguage: enall right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington today I'm joined by J Emory director of technical sales and architecture at Microsoft Azure before we get into today's conversation be sure to head over to Apple podcast or your listening platform of choice and if you enjoy the show please leave us a rating and review Jay welcome to the podcast thank you Sam thank you for having me uh I'm excited to talk to you you spend a lot of your time working with uh organizations startups and and fast growing organizations in particular uh that are building around large language models and I'm excited to kind of pick your brain around what are some of the things that you're learning that you're seeing them struggle with that you're helping them with um should be a really interesting conversation especially uh given how quickly this field is moving and uh some of the cool things that people are doing absolutely it buckle up it is it's constantly changing uh week by week and sometimes hour by hour it's fantastic uh it is uh and I I guess I'll mention I don't necessarily uh often kind of talk about the things happening behind the the screen or through the the behind the veil or whatever but we there's the second time we're having this conversation we had a technical glitch doesn't happen very often but it it happened last time and we the audio quality wasn't where we like it to be for the the conversations and so we're having this one again a week and a half later maybe and you could almost say the world changed like two days ago three days ago uh when open AI uh had their Dev day uh a lot of the things that we were talking about as maybe possible Futures uh like we actually heard about at uh at at Dev day and we may touch on that uh but before we jump into the conversation uh I'd love to have you uh kind of introduce yourself to our audience share a bit about your background and uh what you do at Microsoft and I think we have St Louis in common you spent a bunch of time there we do we do I uh spent uh almost two decades uh working for anheiser Bush uh in various engineering and it leadership roles and and and funny to note I actually spent like five or six years uh abroad in Europe um doing various projects and and and strategic initiatives so um finally brought me back here to California where I joined Microsoft and have been with Microsoft for almost four years now um like you mentioned earlier uh really partnering with uh digital Natives and startups uh the team really focuses on what we would consider strategic Partnerships with within the startup ecosystem so we partner uh a lot with venture Capital firms and some of their portfolio companies and finding ways that the Microsoft relationship can ultimately grow their total addressable Market or make them more successful by getting them access to some of our our Enterprise customers and are you is uh generative Ai and the the Azure open AI products is that uh you know what takes up all of your time is is that just one of the tools in a a broader tool chest it is it's it's interesting Sam so now it is uh I think that um you know a year ago two years ago a big portion of what was attractive to startups was uh our partnership our ability to kind of reach into the Enterprise customer the the large Fortune 500 Fortune 1000 lately it's been everything AI uh the Advent of the large language model and Azure openai and open AI has really just busted open the doors uh for opportunities for us to find ways to partner together and many times some of the some of the startups that weren't answering our calls are actually calling us directly and so it's been a very refreshing and exciting time to to be be at Microsoft that uh to call the the open AI partnership strategic seems like an understatement often that's like a fluffy thing but it is it is I I will say and and I mean this with With all sincerity I think the the cultural change that Saia brought to Microsoft has just been fantastic and and it's and it's evident even in these partnership with open AI where we were very likely developing our own large language model but he was willing and able to lean in notice when somebody is doing a better job or or or is more maybe ready to go to market much sooner and to to Ink that strategic partnership has just you know oneone wonders I'm sure for open AI but has just really revolutionized Microsoft and ready to take us into the next century and in fact Microsoft historically has been very well positioned from a a research perspective um you know for a while many of the best conversations I've had around NLP were Folks at Microsoft that all had this like common root in having worked on Bing yeah um just a lot of expertise there um not to mention you know some of the achievements coming out of the Microsoft research um yeah and I think one of the thing obviously Azure openi gets or open ey gets kind of the Lion Share of the press and attention but there there's even a bigger story around how it ties into maybe our our um cognitive Services portfolio and general so it's truly a One-Stop shop Azure open AI is obviously the kind of the Crown Jewel in it at this point point but we have some really capable uh products and services that can help enhance or or Aid in developing your own large language model or figuring out a ways to to kind of integrate it in with your your existing ID um so try to characterize for me the the kind of maturity level I guess of the companies that you're working with are they all coming to you um you know knowing all about llms and uh you know having built products and trying to you know get some help uh you know optimizing kind of putting the cherry on top or are they uh earlier stage and um uh you know needing help kind of getting started like yeah and I imagine that you know they come in different flavors but like how do you think about uh you know the the folks that you work with and kind of where they are from a maturity and and knowledge perspective around llms excellent question we benefit from getting to work with uh startups and or or what we would consider digital native organizations I think that they have taken a more aggressive and maybe thought forward approach to large language models I think if you look at a traditional Enterprise customers with rare exception most don't have a large language model expert or maybe even an AI expert in staff uh and rely pretty heavily on consulting firms and uh external uh uh service providers to to bring that to Bear uh most if not all of these digital natives have someone on staff that's familiar with it and I think it's just the Clairvoyance of knowing that's where the industry is going and so what we tend to find with startups is they are ready to go feet first in and a lot of times we're super super early adopters we of course have seen some shifts and Trends with uh with large language models probably the past year year and a half uh but but by and large the customers that we interact with are are very thought forward um they're either figuring out how to integrate generative AI into their existing product or platform or maybe they're actually developing their own and need to leverage some of the the GPU capacity that uh We've fortunately been able to to build out and and share to our customers uh so that having been said what do they need you for like what are the challenges that they're running into that uh cause your your proverbial phone to phone line the ring or your teams to start blinging it exactly exactly you know so I we kind of tend to see um you know a lot of times there's this I would say misconception around security or data privacy I think that with any new technology uh the cesos door is you know Bell is ringing off the hook and there's a lot of concern um especially in the startup space IP is their lifeline and there's a lot of concern about IP leaking or or um being leveraged uh by a competitor uh so one of the very first things we do is try to dissmell and dispel any myths around their data or your data being used to train new models and it's simply not the case um I think it was stories like the Samsung story uh continue to live on in people's minds they do it is and so you know if you look back at at at 3.5 turbo uh that model was trained up until I want to say it's like September 2021 data and it was only leveraging openly available thus the open an open AI things like Wikipedia or you know publicly available novels or articles or books um and in essence the way that the models work is there's data put into a prompt and then the model does its stuff and and brings back a response and it's always done in memory and after either your session terminates or you end your session um that memory is re-released to the system and and is ready to bring on the next prompt and response and so there's no point in time where that information that you're putting in is is stored to dis and um you know use to to retrain the one exception uh would be that I would call out is maybe our content moderation I think one of the one of the benefits of of kind of having the power of Microsoft behind uh the the various models is around our content moderation we take things very seriously so we we purposely screen uh both prompt and response uh for things like violence or hate speech or sexual content when it's detected it it's actually stripped from the response so you don't get that back but then also it's stored where one of our uh human um individuals can come and review it because it might be there might be a a reason for the language right it might be I I I love to use the example of uh you could go in and ask for the uh the lyrics to A 9-inch nail song and it will it will it will not come back with all the language and the songs but um but it but there might be a reason why you need that language and so so we also put in different places you could bring your own um content moderation um you know obviously don't need to to use ours and we have options to kind of opt out of it you have to go through some some uh uh a process to do that but but things like Health Data pii um you know it's really not acceptable to record anywhere and so we can you can opt out of of those things uh through one of our internal processes one of the things it sounds like early on folks will engage you with is just really understanding uh how privacy security those types of things work and it sounds like that's like the you know seeso seet folks trying to understand if they can even do this like if it is uh within bounds given whatever you know PRI privacy framework that they have in place uh beyond that um I'm imagining you know Engineers are running into kind of technical challenges focused on their various use cases when you think about folks trying to implement a particular use case um and trying to get the llm to to do what they want basically um you know is that something that you get a lot of calls about it is I you know I'll paint with a broad broad brushstroke here but um you know we we tend to see uh people tend to are getting better I would say at thinking maybe a little bit further out um you know I think the initial gut response is oh I can create a great chatbot um that that leverages na natural human language uh but it's the companies that are really kind of thinking beyond that and and how are they're going to integrate that or tailor it to their specific business or business processes we see three three main ways that um uh our customers are able to leverage large language models to really drive that business impact uh the first and and it's actually probably the most common is with prompt engineering or prompt tuning prompt chaining um and it's really about um putting in the right level of detail information into the prompt getting a response back uh the the customer getting more and more uh uh in innovative would say on on how they can actually take the outputs of a prompt and actually feed that back into the prompt uh of the next one and that chaining or that tuning ultimately delivers a much more rich and robust answer and we've got some first party tools uh and I'm sure there's you know open source tools as well that are available but things like promp flow which it does a little bit with um kind of like a a graphical um uh flow or workflow tool that can actually you can actually chain together different models so you can pull up hugging face and use the output of a hugging face model as an input to uh 3.5 turbo or you can use llama 2 uh uh and feed it into tp4 so there's there's a there's a whole host that uh of tools and capabilities um that that are available to to the customers when they kind of buy into the to the ecosystem and the other thing that is really interesting in is they have this concept called um prompt variance variance and you can actually use AI to tweak your prompting questions in just tiny different ways that will actually believe it or not produce you know in some cases dramatically different answers and so you can actually go in and fine-tune how the questions are being asked you don't have to think of it the AI can rephrase it for you and so what we're finding is a lot of startups Leverage that tool to make sure that they're getting the most out of their their prompts and I love interesting yeah I love and I love to use the example we we work um you know believe it or not with a lot of edtech companies and um I think of any industry I think if you watched NBC Nightly News or or you know name your News Network you know people were prognosticating the death of college and kids education and now everybody's G to be cheating and and I think what we found is the companies that kind of Lean Into The Challenge and and try to think about how do they embrace the technology versus fight it are really coming out with some some really cool uh examples in use cases and so we we spoke with one edtech and the concept behind it was imagine you are a uh High School history teacher and you know you've got your normal uh you know lesson plan and and you know State mandated things that you have to teach your class but imagine if you want to kind of piss it a little bit and maybe look at uh I don't know Abraham Lincoln's early life you can actually leverage these large language models to create an outline for you using promp chainy can feed it back in so you can have it create an outline feed that outline back into the model to generate actual content or course material and then also leverage that output to generate quizzes and answers and ultimately you can feed essays and responses back into the tool and it can grade it based on like a matrix of scoring that you can also feed the model so it's it's just a Really Brave New World and uh I have no doubt even places like education are gonna are going to look drastically different than what we've seen in the past so that was prompt engineering yeah I I forget the name of the company but that sounds a lot like one of the examples that was shown at the open AI Dev day of in the keynote of a a company that was building I forget if it was an assistant or a GPT um I think a GPT uh but the example that they showed was uh teacher going in asking for like a custom lesson plan or something yeah yeah it was it's it's amazing it's amazing what what we're going to be able to do and and ultimately the impact on on everybody from us down to our our kids uh it's it's a really really fun time and this prompt variance is that a a tool that is available it is so we have a product uh part of our cognitive Services it's called it's called Azure prompt flow and and in essence it's a graphical you can chain together different modules you can take the output of one open source model and feed it as an input to a next uh and part of that is there this it's called variant I think it's called prompt variants uh and it will actually allow you to come up with you know maybe if you if you have a prompt you know tell me about Abraham Lincoln's early life you know you can put in that into the prompt variant and then it will generate up to 10 or more you you set the value different ways of asking that question so Abraham Lincoln was important in his early live tell me about it you know so it gives you a couple different or several different ways to ask the same question and then Believe It or Not There are differences in the outputs because it's so non-deterministic uh some of the ways that you can ask those questions and then one of the challenges that I hear about all the time is hey we try all these prompts you it's kind of like Thor spaghetti at a wall the output's different uh when we were doing old school machine learning we built out these experiment management tools so we can kind of track our results and compare you know Au for different hyperparameters etc etc like we kind of just figured that out and now we're doing all this text stuff and evaluation is really hard like are there tools that you uh are aware of or does does I'm imagining that comes up in the the folks that you're with the folks that you're talking with as well it is we we see there there's like a you know there's no lack of of Open Source tools that are popping up everywhere I think Lang Chang is is a popular one that we see a lot of maybe early adopters using um you know we have our own kind of first-party tools and solutions so so Azure ml AI studio is kind of our foray into that space um it tries to take a more graphical workflow um project type of of spin on it um feedback uh from those customers that have adopted have been have been very positive uh but I think the when you look at the what I would call like the picks and shovels of AI I think we're really in that early stage um and it's actually where we're probably seeing most of the startups come up uh is creating tools to make it easier the challenge I think Sam to be honest is you want to make sure that your startup isn't one feet away from being irrelevant um and so we are actually starting to see where you know some of the larger linguage models like Azure open or open AI or maybe llama 2 uh or um you know hugging face you know they could actually add a feature such as PDF input and that is actually negating what some of the startups are doing or in innovating on so you just have to be a little bit careful is your tool or product a feature or is it is it actually a product that that has longm longterm legs yeah uh so that's prompting it's prompting the other one is fine-tuning and uh so fine-tuning this is always uh one of the other kind of big misconceptions that we get is you know you have to find tune and I think where that came from is really from early adopters in in this large language model space I think that the prompt sizes were small enough that you couldn't get enough into that prompt response to be as effective as you wanted to be but by and large that's kind of left the that horse has left the barn um if you look at 3.5 turbo and four it's what I going say 16,000 and 32,000 tokens respectively and then coming out of the uh the open AI Dev days this week it was like 128 2 tokens which is which is essentially I want to say it's like a 300 page book and so you know the the ability to to just ingest large amounts of data just using prompting is is GNA likely going to continue to grow the other thing that we talk a little bit about with fine-tuning or or those folks that are interested in it there's there's absolutely a place for it um you know I don't want to I don't want to just dis lead the the audience but it's expensive it's time consuming and it's resource intensive and so you know you really want to work with your solution architect or your your si or your hyperscaler to really understand you know when is the right time to use it and not um we've got one startup that's that's a little bit top of mine because we just got done doing a a um a briefing with and some work with some of my solution Architects but it's a it's a it's a software company that does actually essentially workflow uh and so it creates you can create flowcharts or business charts or business processes and essentially what they they've done is they've taken GPT and they have taught it a their proprietary language so their domain specific language so okay what'll happen is customers will feed in maybe complex descriptions of projects or or processes they'll actually feed it into 3.5 or four to get kind of that rich summary back and then they'll feed it into a fine-tuned model that can then translate that into um their their own domain specific language so you know in the grand schemey thing they basically taught GPT a language that it didn't know existed in the past right and the language that the domain specific language doesn't change a lot so they're able to train it once and not have to you know continually train and update based on on knowledge that that that becomes available it's a pretty static um environment sounds like that's an example of both the chaining that you were talking about earlier and the fine tuning so they use off-the-shelf 35 turbo uh faster cheaper to do some manipulation to get the input text into a better format for the downstream task and then they have this Downstream task that they F tuned uh that is uh taking that intermediate description and turning it into a workflow language essentially you're 100% right it's it's it's it's never or very frequently is it just one way of doing things you know I think what we're seeing is this hybrid approach of leveraging the you know the the the power of of GPT uh and and that it comes out of the box but then also bringing in you know maybe um intellectual property or data in a in an external Corpus to to really take it to that next level and I think that's where like that the the third one on the on the way we're seeing things it's called retrieval augmented generation or rag um and you know ultimately and I'm sure I'm sure you've covered it with your with your audience in the past but it you know the prompt goes looks at that external Corpus of information um retrieves it really augments The Prompt and then the the the the generative response comes back and it's it's much more Rich it's much more specific um and we're actually seeing a lot of of companies um leverage rag much more than than the fine tuning um you know so I I'll throw an example out we've got um uh a startup and again we we actually had a a really interesting open hack uh with them just recently but they're a sales enablement platform um and they use rag um they actually have made available all their proprietary marketing and best practices and sales data to ultimately generate um customer sales collateral and they leverage retrieval augmented Generations too that so they can continually add to that Corpus of information and then not need to obviously retrain retrain uh if they were gonna if they had to happen to do fine to we um and so it's able to pull that relevant content and then and then feed it back um so really good example of of meaning in their case their this is an internal use case as opposed to a product they've got their sales and marketing team that has created their own uh collab Al and they want a salesperson to say well I want uh something that talks about this product for healthcare vertical clients and it'll go and pull that information and kind of create something totally and they even they even and so as they were developing that use case they also came and and found another one where as they hire new sales Engineers or or or sales staff you know how do you effectively as their manag or coach them on what a good pitch is versus not and so what they're actually also doing is leveraging like teams meetings or or or getting the transcripts from conference calls and then feeding that transcript in and then kind of grading it or scoring it based on a matrix of what you know how many times they talk about business challenges how many times did they let the customer finish before interrupting and they're actually able to coach and guide the sales team to be much more effective earlier on in their in their career so it's it's it's really cool use cases that we're seeing kind of come out yeah one of my favorite things about talking about llm use cases is for each one you talk about like you come up with five or 10 more that it's czy uh are like just beyond that that would be really cool yeah so we so it's funny because I I mean I could go on all day but so what one of the ones so we we were able to um we had like a private preview of uh our Microsoft 365 co-pilot uh and and so so we had several large Enterprise customers take part in that and we you know really took their feedback to heart but one of the interesting one was there was a senior leader at this isv or or independent software vendor that would use the transcript to see was her team being inclusive in their meetings and and and it was a really focused on okay did Bob talk over everybody did Bob talk 80% of the time how did Billy did he speak up you know and so she she was able to leverage it to really provide feedback and coaching to her team in the space of of diversity and inclusion and so it was a really different use case but you can a after you saw it come out and I think every organization as they gain access to these tools are going to find some really interesting ways to to make use of them that maybe you wouldn't initially have thought of right right um so kind of talking about rag or returning to rag uh you know how do you see how do you see folks evolving uh their their rag deployments like or is it a you know I guess my experience with it is that conceptually it seems easy but it it can be harder to get to the desired end product um and not to overload the use of the word tuning it takes a lot of tuning or fiddling with to get it right do you do you see that uh as well or honestly I I haven't it's not not to say that it that it's not you know prevalent in the space um part of the challenge with rag is you know really vectorizing that external Corpus of data and I know that there's some some uh you know third-party Solutions we we we see uh and use pine pine cone quite a bit um Microsoft has you know its version of of U Vector databases and so I I think that what we're probably going to see is people become more fluent and Adept at kind of vectorizing their data that would ultimately kind of speed up or helped improve the performance uh and response and and um Effectiveness I think of that that retrieval of that that you know external Corpus of information but I I haven't seen it be a mass issue at least across these startups and digital Naes and and so that's kind of the things that folks are running into just trying to tailor llm responses to their particular use case or business um uh another area that uh from our last conversation I I remember we talked about was the the whole idea of incorporating llms into like like Core Business Systems and and core business workflows and some challenges that you have run into there uh talk a little bit about that yeah I think um when I see some of the digital natives you know try to weave it into their their core product or service the performance the inference performance can can cause some challenges um I I think that one of the misconceptions that we run into is oh it's a performance issue we just throw more hard workare at it let's let's let's let's get the next big version and it's not always the case I think I think inference is is inefficient and low latency is you know difficult to to obtain I I I love the the there's a there's a statistic that I heard that um you know one single gp4 prompt response if you were just to use a a single Intel processor it could take 39 hours to Crunch through and so it really tells you just the massive amount of compute and and complexity that's happening kind of behind the curtain and so making sure that when you're developing your app or designing how it's going to integrate into your product or Services you really need to make sure you realize that latency is never going to not be there or at least will be there for for the immediate future um there's no magic bullet but in but we do find ways that the customers can kind of lean in and and and work through it ear around it uh H happy to talk through those as well if that if that's of Interest yeah you said one thing that you said uh was interesting you said that throwing more Hardware at it wasn't always effective is it ever effective like is that a thing that a customer can do for a Azure open AI GPT model is like is there a knob to scale the the inference Hardware yeah so um let me let me kind of walk through for the audience the three ways that we that that we could Speed Up Performance and I think the last one is absolutely going to hit on what what you talked about there I think I think the first one and and kind of the easiest one in all honesty is is picking the right model um you know jury's out I haven't seen kind of the metrics yet on what got announced at uh at open open AI Dev day earlier this week but prior to that 3.5 turbo was the fastest you were going to get um yeah you know I think that you know when looking at 3.5 your prompt response can be anywhere from three seconds to 15 gp4 it's like 15 to maybe upwards of a minute um but you get much more Rich content back um so figuring out um how to or which model to use I think a lot of the kind of the leading startups are actually using like a pre-processing engine to determine where to funnel uh their workload so so avoid thinking everything's got to go gp4 because it's the biggest and baddest and and actually leveraging different models and even different posted models right you know take a stab at putting it in llama 2 or or hugging face you know each model performs a little bit differently and so really ultimately it's about the customer success what model behaves best for what what that specific need is the other one is around parallelization and so because we have like group put management or or there's only so many tokens per minute that can be pushed through the system and it's a lot I mean it's anywhere from you know 200 to 300,000 you know splitting those workloads up across regions and then doing a bit of like post processing at the end of the process to kind of combine the results back um we're finding a lot of startups leverage that and it's really helped them at least maximize performance from a throughput perspective and then the last one again I think this is kind before we get to the last one is that something that the customer needs to do or like is it two to three 200,000 tokens per user or per account or something like that and they need to do that on their end or is it Microsoft is doing this in the backend kind of thing yeah Microsoft's doing it in the backend and then a lot of times what happens is we will you know essentially sitting down with your solution architect and make sure that you're right side ing your uh capacity so if if you don't need 300,000 tokens per minute we don't necessarily use it um so part of the the part of the secret to success is really sitting down with your Microsoft or your system integrator or your hyperscaler to determine what is that right bandwidth or throughput to get and then actually tailoring it um but it will it will go up to around 300,000 tokens per minute which is a pretty pretty large um throughput um and we we have yet I mean there there are several examples of when companies are pushing that envelope but by and large people have been able to kind of play Within that standard range um the the way that we can actually you know better guarantee consistency and performance is through that third one which we call provision throughput units or or ptus okay and so it's it's essentially so when you go out to your chat GPT terminal and type in and and and promp the system something you're out and you're going out and hitting some public endpoint someplace and you're a little bit at the mercy of this Noisy Neighbor so you might be yeah happen to be on the same node as somebody I don't know curing cancer or in a you know some kind of genomic exercise and so one way that you can ensure that um you don't run into that Noisy Neighbor scenario through these PT and essentially I like to tell the customer it's a little a little bit like a reserved instance where you are kind of reserving the capacity on this node just for yourself and then you could buy certain sizes um and you know at this point the costs are more than what a what a single query would be but if you were to actually need to guarantee that performance or know that you're going to need to use a high throughput it's really the best way to go um one of the things that got announced at um at at Deb days was this concept of um you know partial ptus uh and I think you're probably going to see some things announced at ignite as well around maybe you don't need to have that throughput but you definitely need to guarantee the uh the latency and the performance there's a concept of partial pts that uh I would say in the next couple weeks but we're going to learn a lot more about so it's going to really kind of open it up for the masses um in the near future okay and implement a wise is this is it um you know is it just kind of tweaking rate limits on the the Ingress side or are you at the point where you know some backend team is standing up you know new instances of models on you know customer specific Hardware you know or you know in a region based on customer requests or contracts so I let me me make sure I understood your question so so the question is how is the backend infrastructure you know guaranteed to be up and goinging or or meeting specific um needs is that is that the question Sam I mean I guess a little bit more specific to an individual customer you've got these Concepts like ptus and uh you talked about the 200k 300K which sounds like a committed volume or something something like that and I'm envisioning that simply there's a a couple of simple things that I can Envision are happening one is there's just a single you know pool of potential throughput and you know there's you know uh kind of a rate limiter that you're kind of doing out to you know Parts pieces of to each customer and you know but it all goes into the same infrastructure uh which I guess is kind of like a tendency like that's like a single tenant you know instance of the the API or I guess that would be like a multi-tenant instance of the API and then I'm thinking of uh some other model where you know when you have these you know it's more like um you know tenant per customer or you know something like that or multi uh these things are for whatever reason are uh I think you can use the terms both ways but yeah in this case I'm I'm thinking of a multi tenant kind of architecture where you know maybe a customer you know says oh I'm going to do you know 300,000 tokens you know they get their own little Enclave of open AI Azure uh stuff Azure open AI stuff and no one else you know it's theirs like committed capacity in some way and I'm just wondering I neither of those extremes could be actually how it works but I'm just kind of wondering how it works on the back end to guarantee performance or to to to maintain performance from One customer to the next yeah and I think too I think a lot of it set with from the from within Microsoft engineering I think that they have observed that you know with 3.5 turbo or four you know that that rate limit of you know 200 to 300K is kind of that sweet spot I I think you're seeing that stance change as time goes on I mean just look at you know the the max tokens that you could input right it went from 16 to 32 and now you you know to 128 so I think the the thresholds are are are Ever Changing based on what we're seeing in the reality at the back end uh I think that what we'll find is as is the infrastructure continues to mature I think you know there's a big chunk of the the back-end systems that are run by maybe a100 gpus from nid Nvidia those are being moved into h100s which you know there's a you know a large multiplier effect on on a performance perspective with the h100s so I think you're going to continue to see that um you're also going to continue to see you know open Ai and and meta and you know you name it anthropic figure out ways to fine-tune and improve just the performance of their models um eventually I think you're gonna you're get to a point where it's an almost an instantaneous response I think we're a little further away from that but um that's probably the ultimate goal is to make it um you know minimize the wait time for you to get uh response back from the from the model yeah yeah uh so what I'm hearing is that or what I'm taking from what I'm hearing is that uh at this point in time you're not seeing you know custom infrastructure build outs for committed performance levels or slas it's just kind of one big pool and Microsoft Engineers are managing it as as best as they can as quickly as it's growing and uh the key mechanism for doing that is kind of API rate limits and committed uh tokens and stuff like that that's allowing them to do performance planning and management and absolutely well stuff like that on the back end okay cool uh you know speaking of all uh of all that um and also tying into uh you know the previous thing with chaining we were talking about and you just mentioned folks were using you know one call to a language model to determine uh which other you know language model uh the which language model to put to send the requests to uh presumably for cost management and perform Performance Management as well um but you know as folks are trying to pull together these applications that are now chaining together multiple indications of llms and using llms to determine you know which route to go uh you know this becomes it can become expensive um exactly do you do you get involved in helping folks think through cost management as well cost management probably is only second to the security and data privacy concerns um you know it's it can be very expensive and if you're not prepared or or go into it with the m right mindset um you know I think that the benefit of of of of leveraging a kind of off-the-shelf model or maybe what we would consider a first-party model such as Azure open AI is you get all the the bells and whistles of that cloud provider's cost management and monitoring services so you can set up things like budgeting or alerting um you can integrate maybe you are a an organization that you know uses data dog or Prometheus you can actually export and and continue to leverage those third party tools um but it's going to it it it creates um and I think you're going to see a burgeoning field in in kind of mlops uh that that it staffs or or organizations are really going to have to kind of invest in to make sure that these costs don't don't grow grow grow fast uh the it all comes down to and and I always like try to reframe it with the customer it's not so much cost management it's token management because tokens are kind of the currency that uh that the large language models uses so how can you use as few tokens as you can to get the right response but then also pick the right model where the token price is the lowest and I think what you'll probably see going forward is as new models come out you'll be able to track the the per token cost and then and then be able to kind of weigh the pros and cons of going with one model versus the next one of the big pitfalls uh that we do see some companies jump into is they want the biggest and bass right they they want the you know they want everything to go to gp4 or maybe gp4 turbo uh and when in actuality they could actually get by with one of the earlier models like 35 turbo or even three the use case or the example I like to tell them is that if you were you had a a specific request and that request could be serviced equally between 35 and four if you chose four it would cost almost 10x and it would be 5x less performant had you used 3.5 and so we have to sit down and really talk through what are you trying to accomplish with the prompt and in some cases 100% it makes sense to to use the the larger newer model but maybe not in every scenario um that's where that that concept of pre-processing we're seeing being used so you know with basic heris stics you're able to determine which model to to leverage and and win um but ultimately that's the conversation that that you know as an organization you're having with you know either your internal team or your your partners whether that be Microsoft or or another another company um it takes good architecture um to really manage those costs so leveraging the dashboards that you get from the provider yep uh using the least expensive smallest model possible um pre-processing if possible to determine that only the requests that need the larger models use them yeah we ALS we also saw we also saw Sam actually leveraging um you know taking a a document or some sort of a a corpus of data summarizing it in in maybe 3.5 and then feeding that summary into maybe something that's a little bit more rich and and and has a lot more depth like four and that answer coming back is actually less expensive or uses less tokens than had you fed it all into to to four immediately so yeah get it's a little bit the context of in the in the context of rag or uh just impr prompting impr promp oh just impr prompt impr promps yeah yeah so okay so it's it's a little bit of an art and not necessarily the science um and that's where sometimes these these kind of either the third party or first party tools with prompt variants can can kind of be worth their weight and gold depending on um what you come up with yeah so that just ART versus science kind of conveys that it's a lot of trial and error you know of course you've got the the the um these ideas that we just talked about but uh a big part of it is just uh you know starting with what's easy a big model maybe and just trying to whittle it down where you can um to try to get the cost under control y my my advice in general is to leverage the kind of tried and trude models where you can and and if there's a reason why you can't then that's where you really want to kind of start to look at prompting or rag or or or even fine tuning but these models that are out today are very mature very powerful I would always try to explore the most with them before you undertake you know creating your own or or or or getting into the fine-tuning space um you know it's amazing what some some accurate prompting can do when you thought maybe it was only achievable with with fine tun got it got it awesome uh well I I alluded to this earlier uh we spoke a a little bit about kind of Futures and how you see this all evolving and I I've got to say you pretty pretty accurate based on the fact that within two weeks half of it had uh come to pass what what are your predictions now I don't know I think I need to go out and buy some lotto tickets is what I what I think we need to do oh my gosh no you know for for those that weren't part of the original recording we we talked a little bit about um you know not necessarily Quantum step in uh you know knowledge but you know it's going to be more around performance and and finding ways to be more energy efficient um and and we're starting to see that I think you know they they announced uh GPT for Turbo you know again much more performant actually prices went down which is fantastic um they included some more multimodality so now you can can funnel in you know basically a picture book and allow that as input I think you know again this is you know again maybe this is where that lotto ticket comes into play here but I think that eventually we'll probably get to the point where video and like 3D models or objects uh could potentially be input um and then uh yeah I I think there was some really interesting products and platforms announced um and I'm sure there's going to be more announced at ignite the you know the GPT store was a really cool concept coming out of open AI um and you know to be to be determined you know what impact or or where Microsoft's going to have a part in that but um very very cool things coming you can really tell that they uh the open AI team is super passionate about connecting with the developer community and and making others successful awesome awesome well Jay thanks so much for joining me once again uh for uh the conversation and to share a bit about what you have been seeing working with uh startups digital natives uh on uh LM enablement so Sam thank you so much it's been just an absolute pleasure and uh look forward to talk to you soon all righty thanks Jay all right thanks byeall right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington today I'm joined by J Emory director of technical sales and architecture at Microsoft Azure before we get into today's conversation be sure to head over to Apple podcast or your listening platform of choice and if you enjoy the show please leave us a rating and review Jay welcome to the podcast thank you Sam thank you for having me uh I'm excited to talk to you you spend a lot of your time working with uh organizations startups and and fast growing organizations in particular uh that are building around large language models and I'm excited to kind of pick your brain around what are some of the things that you're learning that you're seeing them struggle with that you're helping them with um should be a really interesting conversation especially uh given how quickly this field is moving and uh some of the cool things that people are doing absolutely it buckle up it is it's constantly changing uh week by week and sometimes hour by hour it's fantastic uh it is uh and I I guess I'll mention I don't necessarily uh often kind of talk about the things happening behind the the screen or through the the behind the veil or whatever but we there's the second time we're having this conversation we had a technical glitch doesn't happen very often but it it happened last time and we the audio quality wasn't where we like it to be for the the conversations and so we're having this one again a week and a half later maybe and you could almost say the world changed like two days ago three days ago uh when open AI uh had their Dev day uh a lot of the things that we were talking about as maybe possible Futures uh like we actually heard about at uh at at Dev day and we may touch on that uh but before we jump into the conversation uh I'd love to have you uh kind of introduce yourself to our audience share a bit about your background and uh what you do at Microsoft and I think we have St Louis in common you spent a bunch of time there we do we do I uh spent uh almost two decades uh working for anheiser Bush uh in various engineering and it leadership roles and and and funny to note I actually spent like five or six years uh abroad in Europe um doing various projects and and and strategic initiatives so um finally brought me back here to California where I joined Microsoft and have been with Microsoft for almost four years now um like you mentioned earlier uh really partnering with uh digital Natives and startups uh the team really focuses on what we would consider strategic Partnerships with within the startup ecosystem so we partner uh a lot with venture Capital firms and some of their portfolio companies and finding ways that the Microsoft relationship can ultimately grow their total addressable Market or make them more successful by getting them access to some of our our Enterprise customers and are you is uh generative Ai and the the Azure open AI products is that uh you know what takes up all of your time is is that just one of the tools in a a broader tool chest it is it's it's interesting Sam so now it is uh I think that um you know a year ago two years ago a big portion of what was attractive to startups was uh our partnership our ability to kind of reach into the Enterprise customer the the large Fortune 500 Fortune 1000 lately it's been everything AI uh the Advent of the large language model and Azure openai and open AI has really just busted open the doors uh for opportunities for us to find ways to partner together and many times some of the some of the startups that weren't answering our calls are actually calling us directly and so it's been a very refreshing and exciting time to to be be at Microsoft that uh to call the the open AI partnership strategic seems like an understatement often that's like a fluffy thing but it is it is I I will say and and I mean this with With all sincerity I think the the cultural change that Saia brought to Microsoft has just been fantastic and and it's and it's evident even in these partnership with open AI where we were very likely developing our own large language model but he was willing and able to lean in notice when somebody is doing a better job or or or is more maybe ready to go to market much sooner and to to Ink that strategic partnership has just you know oneone wonders I'm sure for open AI but has just really revolutionized Microsoft and ready to take us into the next century and in fact Microsoft historically has been very well positioned from a a research perspective um you know for a while many of the best conversations I've had around NLP were Folks at Microsoft that all had this like common root in having worked on Bing yeah um just a lot of expertise there um not to mention you know some of the achievements coming out of the Microsoft research um yeah and I think one of the thing obviously Azure openi gets or open ey gets kind of the Lion Share of the press and attention but there there's even a bigger story around how it ties into maybe our our um cognitive Services portfolio and general so it's truly a One-Stop shop Azure open AI is obviously the kind of the Crown Jewel in it at this point point but we have some really capable uh products and services that can help enhance or or Aid in developing your own large language model or figuring out a ways to to kind of integrate it in with your your existing ID um so try to characterize for me the the kind of maturity level I guess of the companies that you're working with are they all coming to you um you know knowing all about llms and uh you know having built products and trying to you know get some help uh you know optimizing kind of putting the cherry on top or are they uh earlier stage and um uh you know needing help kind of getting started like yeah and I imagine that you know they come in different flavors but like how do you think about uh you know the the folks that you work with and kind of where they are from a maturity and and knowledge perspective around llms excellent question we benefit from getting to work with uh startups and or or what we would consider digital native organizations I think that they have taken a more aggressive and maybe thought forward approach to large language models I think if you look at a traditional Enterprise customers with rare exception most don't have a large language model expert or maybe even an AI expert in staff uh and rely pretty heavily on consulting firms and uh external uh uh service providers to to bring that to Bear uh most if not all of these digital natives have someone on staff that's familiar with it and I think it's just the Clairvoyance of knowing that's where the industry is going and so what we tend to find with startups is they are ready to go feet first in and a lot of times we're super super early adopters we of course have seen some shifts and Trends with uh with large language models probably the past year year and a half uh but but by and large the customers that we interact with are are very thought forward um they're either figuring out how to integrate generative AI into their existing product or platform or maybe they're actually developing their own and need to leverage some of the the GPU capacity that uh We've fortunately been able to to build out and and share to our customers uh so that having been said what do they need you for like what are the challenges that they're running into that uh cause your your proverbial phone to phone line the ring or your teams to start blinging it exactly exactly you know so I we kind of tend to see um you know a lot of times there's this I would say misconception around security or data privacy I think that with any new technology uh the cesos door is you know Bell is ringing off the hook and there's a lot of concern um especially in the startup space IP is their lifeline and there's a lot of concern about IP leaking or or um being leveraged uh by a competitor uh so one of the very first things we do is try to dissmell and dispel any myths around their data or your data being used to train new models and it's simply not the case um I think it was stories like the Samsung story uh continue to live on in people's minds they do it is and so you know if you look back at at at 3.5 turbo uh that model was trained up until I want to say it's like September 2021 data and it was only leveraging openly available thus the open an open AI things like Wikipedia or you know publicly available novels or articles or books um and in essence the way that the models work is there's data put into a prompt and then the model does its stuff and and brings back a response and it's always done in memory and after either your session terminates or you end your session um that memory is re-released to the system and and is ready to bring on the next prompt and response and so there's no point in time where that information that you're putting in is is stored to dis and um you know use to to retrain the one exception uh would be that I would call out is maybe our content moderation I think one of the one of the benefits of of kind of having the power of Microsoft behind uh the the various models is around our content moderation we take things very seriously so we we purposely screen uh both prompt and response uh for things like violence or hate speech or sexual content when it's detected it it's actually stripped from the response so you don't get that back but then also it's stored where one of our uh human um individuals can come and review it because it might be there might be a a reason for the language right it might be I I I love to use the example of uh you could go in and ask for the uh the lyrics to A 9-inch nail song and it will it will it will not come back with all the language and the songs but um but it but there might be a reason why you need that language and so so we also put in different places you could bring your own um content moderation um you know obviously don't need to to use ours and we have options to kind of opt out of it you have to go through some some uh uh a process to do that but but things like Health Data pii um you know it's really not acceptable to record anywhere and so we can you can opt out of of those things uh through one of our internal processes one of the things it sounds like early on folks will engage you with is just really understanding uh how privacy security those types of things work and it sounds like that's like the you know seeso seet folks trying to understand if they can even do this like if it is uh within bounds given whatever you know PRI privacy framework that they have in place uh beyond that um I'm imagining you know Engineers are running into kind of technical challenges focused on their various use cases when you think about folks trying to implement a particular use case um and trying to get the llm to to do what they want basically um you know is that something that you get a lot of calls about it is I you know I'll paint with a broad broad brushstroke here but um you know we we tend to see uh people tend to are getting better I would say at thinking maybe a little bit further out um you know I think the initial gut response is oh I can create a great chatbot um that that leverages na natural human language uh but it's the companies that are really kind of thinking beyond that and and how are they're going to integrate that or tailor it to their specific business or business processes we see three three main ways that um uh our customers are able to leverage large language models to really drive that business impact uh the first and and it's actually probably the most common is with prompt engineering or prompt tuning prompt chaining um and it's really about um putting in the right level of detail information into the prompt getting a response back uh the the customer getting more and more uh uh in innovative would say on on how they can actually take the outputs of a prompt and actually feed that back into the prompt uh of the next one and that chaining or that tuning ultimately delivers a much more rich and robust answer and we've got some first party tools uh and I'm sure there's you know open source tools as well that are available but things like promp flow which it does a little bit with um kind of like a a graphical um uh flow or workflow tool that can actually you can actually chain together different models so you can pull up hugging face and use the output of a hugging face model as an input to uh 3.5 turbo or you can use llama 2 uh uh and feed it into tp4 so there's there's a there's a whole host that uh of tools and capabilities um that that are available to to the customers when they kind of buy into the to the ecosystem and the other thing that is really interesting in is they have this concept called um prompt variance variance and you can actually use AI to tweak your prompting questions in just tiny different ways that will actually believe it or not produce you know in some cases dramatically different answers and so you can actually go in and fine-tune how the questions are being asked you don't have to think of it the AI can rephrase it for you and so what we're finding is a lot of startups Leverage that tool to make sure that they're getting the most out of their their prompts and I love interesting yeah I love and I love to use the example we we work um you know believe it or not with a lot of edtech companies and um I think of any industry I think if you watched NBC Nightly News or or you know name your News Network you know people were prognosticating the death of college and kids education and now everybody's G to be cheating and and I think what we found is the companies that kind of Lean Into The Challenge and and try to think about how do they embrace the technology versus fight it are really coming out with some some really cool uh examples in use cases and so we we spoke with one edtech and the concept behind it was imagine you are a uh High School history teacher and you know you've got your normal uh you know lesson plan and and you know State mandated things that you have to teach your class but imagine if you want to kind of piss it a little bit and maybe look at uh I don't know Abraham Lincoln's early life you can actually leverage these large language models to create an outline for you using promp chainy can feed it back in so you can have it create an outline feed that outline back into the model to generate actual content or course material and then also leverage that output to generate quizzes and answers and ultimately you can feed essays and responses back into the tool and it can grade it based on like a matrix of scoring that you can also feed the model so it's it's just a Really Brave New World and uh I have no doubt even places like education are gonna are going to look drastically different than what we've seen in the past so that was prompt engineering yeah I I forget the name of the company but that sounds a lot like one of the examples that was shown at the open AI Dev day of in the keynote of a a company that was building I forget if it was an assistant or a GPT um I think a GPT uh but the example that they showed was uh teacher going in asking for like a custom lesson plan or something yeah yeah it was it's it's amazing it's amazing what what we're going to be able to do and and ultimately the impact on on everybody from us down to our our kids uh it's it's a really really fun time and this prompt variance is that a a tool that is available it is so we have a product uh part of our cognitive Services it's called it's called Azure prompt flow and and in essence it's a graphical you can chain together different modules you can take the output of one open source model and feed it as an input to a next uh and part of that is there this it's called variant I think it's called prompt variants uh and it will actually allow you to come up with you know maybe if you if you have a prompt you know tell me about Abraham Lincoln's early life you know you can put in that into the prompt variant and then it will generate up to 10 or more you you set the value different ways of asking that question so Abraham Lincoln was important in his early live tell me about it you know so it gives you a couple different or several different ways to ask the same question and then Believe It or Not There are differences in the outputs because it's so non-deterministic uh some of the ways that you can ask those questions and then one of the challenges that I hear about all the time is hey we try all these prompts you it's kind of like Thor spaghetti at a wall the output's different uh when we were doing old school machine learning we built out these experiment management tools so we can kind of track our results and compare you know Au for different hyperparameters etc etc like we kind of just figured that out and now we're doing all this text stuff and evaluation is really hard like are there tools that you uh are aware of or does does I'm imagining that comes up in the the folks that you're with the folks that you're talking with as well it is we we see there there's like a you know there's no lack of of Open Source tools that are popping up everywhere I think Lang Chang is is a popular one that we see a lot of maybe early adopters using um you know we have our own kind of first-party tools and solutions so so Azure ml AI studio is kind of our foray into that space um it tries to take a more graphical workflow um project type of of spin on it um feedback uh from those customers that have adopted have been have been very positive uh but I think the when you look at the what I would call like the picks and shovels of AI I think we're really in that early stage um and it's actually where we're probably seeing most of the startups come up uh is creating tools to make it easier the challenge I think Sam to be honest is you want to make sure that your startup isn't one feet away from being irrelevant um and so we are actually starting to see where you know some of the larger linguage models like Azure open or open AI or maybe llama 2 uh or um you know hugging face you know they could actually add a feature such as PDF input and that is actually negating what some of the startups are doing or in innovating on so you just have to be a little bit careful is your tool or product a feature or is it is it actually a product that that has longm longterm legs yeah uh so that's prompting it's prompting the other one is fine-tuning and uh so fine-tuning this is always uh one of the other kind of big misconceptions that we get is you know you have to find tune and I think where that came from is really from early adopters in in this large language model space I think that the prompt sizes were small enough that you couldn't get enough into that prompt response to be as effective as you wanted to be but by and large that's kind of left the that horse has left the barn um if you look at 3.5 turbo and four it's what I going say 16,000 and 32,000 tokens respectively and then coming out of the uh the open AI Dev days this week it was like 128 2 tokens which is which is essentially I want to say it's like a 300 page book and so you know the the ability to to just ingest large amounts of data just using prompting is is GNA likely going to continue to grow the other thing that we talk a little bit about with fine-tuning or or those folks that are interested in it there's there's absolutely a place for it um you know I don't want to I don't want to just dis lead the the audience but it's expensive it's time consuming and it's resource intensive and so you know you really want to work with your solution architect or your your si or your hyperscaler to really understand you know when is the right time to use it and not um we've got one startup that's that's a little bit top of mine because we just got done doing a a um a briefing with and some work with some of my solution Architects but it's a it's a it's a software company that does actually essentially workflow uh and so it creates you can create flowcharts or business charts or business processes and essentially what they they've done is they've taken GPT and they have taught it a their proprietary language so their domain specific language so okay what'll happen is customers will feed in maybe complex descriptions of projects or or processes they'll actually feed it into 3.5 or four to get kind of that rich summary back and then they'll feed it into a fine-tuned model that can then translate that into um their their own domain specific language so you know in the grand schemey thing they basically taught GPT a language that it didn't know existed in the past right and the language that the domain specific language doesn't change a lot so they're able to train it once and not have to you know continually train and update based on on knowledge that that that becomes available it's a pretty static um environment sounds like that's an example of both the chaining that you were talking about earlier and the fine tuning so they use off-the-shelf 35 turbo uh faster cheaper to do some manipulation to get the input text into a better format for the downstream task and then they have this Downstream task that they F tuned uh that is uh taking that intermediate description and turning it into a workflow language essentially you're 100% right it's it's it's it's never or very frequently is it just one way of doing things you know I think what we're seeing is this hybrid approach of leveraging the you know the the the power of of GPT uh and and that it comes out of the box but then also bringing in you know maybe um intellectual property or data in a in an external Corpus to to really take it to that next level and I think that's where like that the the third one on the on the way we're seeing things it's called retrieval augmented generation or rag um and you know ultimately and I'm sure I'm sure you've covered it with your with your audience in the past but it you know the prompt goes looks at that external Corpus of information um retrieves it really augments The Prompt and then the the the the generative response comes back and it's it's much more Rich it's much more specific um and we're actually seeing a lot of of companies um leverage rag much more than than the fine tuning um you know so I I'll throw an example out we've got um uh a startup and again we we actually had a a really interesting open hack uh with them just recently but they're a sales enablement platform um and they use rag um they actually have made available all their proprietary marketing and best practices and sales data to ultimately generate um customer sales collateral and they leverage retrieval augmented Generations too that so they can continually add to that Corpus of information and then not need to obviously retrain retrain uh if they were gonna if they had to happen to do fine to we um and so it's able to pull that relevant content and then and then feed it back um so really good example of of meaning in their case their this is an internal use case as opposed to a product they've got their sales and marketing team that has created their own uh collab Al and they want a salesperson to say well I want uh something that talks about this product for healthcare vertical clients and it'll go and pull that information and kind of create something totally and they even they even and so as they were developing that use case they also came and and found another one where as they hire new sales Engineers or or or sales staff you know how do you effectively as their manag or coach them on what a good pitch is versus not and so what they're actually also doing is leveraging like teams meetings or or or getting the transcripts from conference calls and then feeding that transcript in and then kind of grading it or scoring it based on a matrix of what you know how many times they talk about business challenges how many times did they let the customer finish before interrupting and they're actually able to coach and guide the sales team to be much more effective earlier on in their in their career so it's it's it's really cool use cases that we're seeing kind of come out yeah one of my favorite things about talking about llm use cases is for each one you talk about like you come up with five or 10 more that it's czy uh are like just beyond that that would be really cool yeah so we so it's funny because I I mean I could go on all day but so what one of the ones so we we were able to um we had like a private preview of uh our Microsoft 365 co-pilot uh and and so so we had several large Enterprise customers take part in that and we you know really took their feedback to heart but one of the interesting one was there was a senior leader at this isv or or independent software vendor that would use the transcript to see was her team being inclusive in their meetings and and and it was a really focused on okay did Bob talk over everybody did Bob talk 80% of the time how did Billy did he speak up you know and so she she was able to leverage it to really provide feedback and coaching to her team in the space of of diversity and inclusion and so it was a really different use case but you can a after you saw it come out and I think every organization as they gain access to these tools are going to find some really interesting ways to to make use of them that maybe you wouldn't initially have thought of right right um so kind of talking about rag or returning to rag uh you know how do you see how do you see folks evolving uh their their rag deployments like or is it a you know I guess my experience with it is that conceptually it seems easy but it it can be harder to get to the desired end product um and not to overload the use of the word tuning it takes a lot of tuning or fiddling with to get it right do you do you see that uh as well or honestly I I haven't it's not not to say that it that it's not you know prevalent in the space um part of the challenge with rag is you know really vectorizing that external Corpus of data and I know that there's some some uh you know third-party Solutions we we we see uh and use pine pine cone quite a bit um Microsoft has you know its version of of U Vector databases and so I I think that what we're probably going to see is people become more fluent and Adept at kind of vectorizing their data that would ultimately kind of speed up or helped improve the performance uh and response and and um Effectiveness I think of that that retrieval of that that you know external Corpus of information but I I haven't seen it be a mass issue at least across these startups and digital Naes and and so that's kind of the things that folks are running into just trying to tailor llm responses to their particular use case or business um uh another area that uh from our last conversation I I remember we talked about was the the whole idea of incorporating llms into like like Core Business Systems and and core business workflows and some challenges that you have run into there uh talk a little bit about that yeah I think um when I see some of the digital natives you know try to weave it into their their core product or service the performance the inference performance can can cause some challenges um I I think that one of the misconceptions that we run into is oh it's a performance issue we just throw more hard workare at it let's let's let's let's get the next big version and it's not always the case I think I think inference is is inefficient and low latency is you know difficult to to obtain I I I love the the there's a there's a statistic that I heard that um you know one single gp4 prompt response if you were just to use a a single Intel processor it could take 39 hours to Crunch through and so it really tells you just the massive amount of compute and and complexity that's happening kind of behind the curtain and so making sure that when you're developing your app or designing how it's going to integrate into your product or Services you really need to make sure you realize that latency is never going to not be there or at least will be there for for the immediate future um there's no magic bullet but in but we do find ways that the customers can kind of lean in and and and work through it ear around it uh H happy to talk through those as well if that if that's of Interest yeah you said one thing that you said uh was interesting you said that throwing more Hardware at it wasn't always effective is it ever effective like is that a thing that a customer can do for a Azure open AI GPT model is like is there a knob to scale the the inference Hardware yeah so um let me let me kind of walk through for the audience the three ways that we that that we could Speed Up Performance and I think the last one is absolutely going to hit on what what you talked about there I think I think the first one and and kind of the easiest one in all honesty is is picking the right model um you know jury's out I haven't seen kind of the metrics yet on what got announced at uh at open open AI Dev day earlier this week but prior to that 3.5 turbo was the fastest you were going to get um yeah you know I think that you know when looking at 3.5 your prompt response can be anywhere from three seconds to 15 gp4 it's like 15 to maybe upwards of a minute um but you get much more Rich content back um so figuring out um how to or which model to use I think a lot of the kind of the leading startups are actually using like a pre-processing engine to determine where to funnel uh their workload so so avoid thinking everything's got to go gp4 because it's the biggest and baddest and and actually leveraging different models and even different posted models right you know take a stab at putting it in llama 2 or or hugging face you know each model performs a little bit differently and so really ultimately it's about the customer success what model behaves best for what what that specific need is the other one is around parallelization and so because we have like group put management or or there's only so many tokens per minute that can be pushed through the system and it's a lot I mean it's anywhere from you know 200 to 300,000 you know splitting those workloads up across regions and then doing a bit of like post processing at the end of the process to kind of combine the results back um we're finding a lot of startups leverage that and it's really helped them at least maximize performance from a throughput perspective and then the last one again I think this is kind before we get to the last one is that something that the customer needs to do or like is it two to three 200,000 tokens per user or per account or something like that and they need to do that on their end or is it Microsoft is doing this in the backend kind of thing yeah Microsoft's doing it in the backend and then a lot of times what happens is we will you know essentially sitting down with your solution architect and make sure that you're right side ing your uh capacity so if if you don't need 300,000 tokens per minute we don't necessarily use it um so part of the the part of the secret to success is really sitting down with your Microsoft or your system integrator or your hyperscaler to determine what is that right bandwidth or throughput to get and then actually tailoring it um but it will it will go up to around 300,000 tokens per minute which is a pretty pretty large um throughput um and we we have yet I mean there there are several examples of when companies are pushing that envelope but by and large people have been able to kind of play Within that standard range um the the way that we can actually you know better guarantee consistency and performance is through that third one which we call provision throughput units or or ptus okay and so it's it's essentially so when you go out to your chat GPT terminal and type in and and and promp the system something you're out and you're going out and hitting some public endpoint someplace and you're a little bit at the mercy of this Noisy Neighbor so you might be yeah happen to be on the same node as somebody I don't know curing cancer or in a you know some kind of genomic exercise and so one way that you can ensure that um you don't run into that Noisy Neighbor scenario through these PT and essentially I like to tell the customer it's a little a little bit like a reserved instance where you are kind of reserving the capacity on this node just for yourself and then you could buy certain sizes um and you know at this point the costs are more than what a what a single query would be but if you were to actually need to guarantee that performance or know that you're going to need to use a high throughput it's really the best way to go um one of the things that got announced at um at at Deb days was this concept of um you know partial ptus uh and I think you're probably going to see some things announced at ignite as well around maybe you don't need to have that throughput but you definitely need to guarantee the uh the latency and the performance there's a concept of partial pts that uh I would say in the next couple weeks but we're going to learn a lot more about so it's going to really kind of open it up for the masses um in the near future okay and implement a wise is this is it um you know is it just kind of tweaking rate limits on the the Ingress side or are you at the point where you know some backend team is standing up you know new instances of models on you know customer specific Hardware you know or you know in a region based on customer requests or contracts so I let me me make sure I understood your question so so the question is how is the backend infrastructure you know guaranteed to be up and goinging or or meeting specific um needs is that is that the question Sam I mean I guess a little bit more specific to an individual customer you've got these Concepts like ptus and uh you talked about the 200k 300K which sounds like a committed volume or something something like that and I'm envisioning that simply there's a a couple of simple things that I can Envision are happening one is there's just a single you know pool of potential throughput and you know there's you know uh kind of a rate limiter that you're kind of doing out to you know Parts pieces of to each customer and you know but it all goes into the same infrastructure uh which I guess is kind of like a tendency like that's like a single tenant you know instance of the the API or I guess that would be like a multi-tenant instance of the API and then I'm thinking of uh some other model where you know when you have these you know it's more like um you know tenant per customer or you know something like that or multi uh these things are for whatever reason are uh I think you can use the terms both ways but yeah in this case I'm I'm thinking of a multi tenant kind of architecture where you know maybe a customer you know says oh I'm going to do you know 300,000 tokens you know they get their own little Enclave of open AI Azure uh stuff Azure open AI stuff and no one else you know it's theirs like committed capacity in some way and I'm just wondering I neither of those extremes could be actually how it works but I'm just kind of wondering how it works on the back end to guarantee performance or to to to maintain performance from One customer to the next yeah and I think too I think a lot of it set with from the from within Microsoft engineering I think that they have observed that you know with 3.5 turbo or four you know that that rate limit of you know 200 to 300K is kind of that sweet spot I I think you're seeing that stance change as time goes on I mean just look at you know the the max tokens that you could input right it went from 16 to 32 and now you you know to 128 so I think the the thresholds are are are Ever Changing based on what we're seeing in the reality at the back end uh I think that what we'll find is as is the infrastructure continues to mature I think you know there's a big chunk of the the back-end systems that are run by maybe a100 gpus from nid Nvidia those are being moved into h100s which you know there's a you know a large multiplier effect on on a performance perspective with the h100s so I think you're going to continue to see that um you're also going to continue to see you know open Ai and and meta and you know you name it anthropic figure out ways to fine-tune and improve just the performance of their models um eventually I think you're gonna you're get to a point where it's an almost an instantaneous response I think we're a little further away from that but um that's probably the ultimate goal is to make it um you know minimize the wait time for you to get uh response back from the from the model yeah yeah uh so what I'm hearing is that or what I'm taking from what I'm hearing is that uh at this point in time you're not seeing you know custom infrastructure build outs for committed performance levels or slas it's just kind of one big pool and Microsoft Engineers are managing it as as best as they can as quickly as it's growing and uh the key mechanism for doing that is kind of API rate limits and committed uh tokens and stuff like that that's allowing them to do performance planning and management and absolutely well stuff like that on the back end okay cool uh you know speaking of all uh of all that um and also tying into uh you know the previous thing with chaining we were talking about and you just mentioned folks were using you know one call to a language model to determine uh which other you know language model uh the which language model to put to send the requests to uh presumably for cost management and perform Performance Management as well um but you know as folks are trying to pull together these applications that are now chaining together multiple indications of llms and using llms to determine you know which route to go uh you know this becomes it can become expensive um exactly do you do you get involved in helping folks think through cost management as well cost management probably is only second to the security and data privacy concerns um you know it's it can be very expensive and if you're not prepared or or go into it with the m right mindset um you know I think that the benefit of of of of leveraging a kind of off-the-shelf model or maybe what we would consider a first-party model such as Azure open AI is you get all the the bells and whistles of that cloud provider's cost management and monitoring services so you can set up things like budgeting or alerting um you can integrate maybe you are a an organization that you know uses data dog or Prometheus you can actually export and and continue to leverage those third party tools um but it's going to it it it creates um and I think you're going to see a burgeoning field in in kind of mlops uh that that it staffs or or organizations are really going to have to kind of invest in to make sure that these costs don't don't grow grow grow fast uh the it all comes down to and and I always like try to reframe it with the customer it's not so much cost management it's token management because tokens are kind of the currency that uh that the large language models uses so how can you use as few tokens as you can to get the right response but then also pick the right model where the token price is the lowest and I think what you'll probably see going forward is as new models come out you'll be able to track the the per token cost and then and then be able to kind of weigh the pros and cons of going with one model versus the next one of the big pitfalls uh that we do see some companies jump into is they want the biggest and bass right they they want the you know they want everything to go to gp4 or maybe gp4 turbo uh and when in actuality they could actually get by with one of the earlier models like 35 turbo or even three the use case or the example I like to tell them is that if you were you had a a specific request and that request could be serviced equally between 35 and four if you chose four it would cost almost 10x and it would be 5x less performant had you used 3.5 and so we have to sit down and really talk through what are you trying to accomplish with the prompt and in some cases 100% it makes sense to to use the the larger newer model but maybe not in every scenario um that's where that that concept of pre-processing we're seeing being used so you know with basic heris stics you're able to determine which model to to leverage and and win um but ultimately that's the conversation that that you know as an organization you're having with you know either your internal team or your your partners whether that be Microsoft or or another another company um it takes good architecture um to really manage those costs so leveraging the dashboards that you get from the provider yep uh using the least expensive smallest model possible um pre-processing if possible to determine that only the requests that need the larger models use them yeah we ALS we also saw we also saw Sam actually leveraging um you know taking a a document or some sort of a a corpus of data summarizing it in in maybe 3.5 and then feeding that summary into maybe something that's a little bit more rich and and and has a lot more depth like four and that answer coming back is actually less expensive or uses less tokens than had you fed it all into to to four immediately so yeah get it's a little bit the context of in the in the context of rag or uh just impr prompting impr promp oh just impr prompt impr promps yeah yeah so okay so it's it's a little bit of an art and not necessarily the science um and that's where sometimes these these kind of either the third party or first party tools with prompt variants can can kind of be worth their weight and gold depending on um what you come up with yeah so that just ART versus science kind of conveys that it's a lot of trial and error you know of course you've got the the the um these ideas that we just talked about but uh a big part of it is just uh you know starting with what's easy a big model maybe and just trying to whittle it down where you can um to try to get the cost under control y my my advice in general is to leverage the kind of tried and trude models where you can and and if there's a reason why you can't then that's where you really want to kind of start to look at prompting or rag or or or even fine tuning but these models that are out today are very mature very powerful I would always try to explore the most with them before you undertake you know creating your own or or or or getting into the fine-tuning space um you know it's amazing what some some accurate prompting can do when you thought maybe it was only achievable with with fine tun got it got it awesome uh well I I alluded to this earlier uh we spoke a a little bit about kind of Futures and how you see this all evolving and I I've got to say you pretty pretty accurate based on the fact that within two weeks half of it had uh come to pass what what are your predictions now I don't know I think I need to go out and buy some lotto tickets is what I what I think we need to do oh my gosh no you know for for those that weren't part of the original recording we we talked a little bit about um you know not necessarily Quantum step in uh you know knowledge but you know it's going to be more around performance and and finding ways to be more energy efficient um and and we're starting to see that I think you know they they announced uh GPT for Turbo you know again much more performant actually prices went down which is fantastic um they included some more multimodality so now you can can funnel in you know basically a picture book and allow that as input I think you know again this is you know again maybe this is where that lotto ticket comes into play here but I think that eventually we'll probably get to the point where video and like 3D models or objects uh could potentially be input um and then uh yeah I I think there was some really interesting products and platforms announced um and I'm sure there's going to be more announced at ignite the you know the GPT store was a really cool concept coming out of open AI um and you know to be to be determined you know what impact or or where Microsoft's going to have a part in that but um very very cool things coming you can really tell that they uh the open AI team is super passionate about connecting with the developer community and and making others successful awesome awesome well Jay thanks so much for joining me once again uh for uh the conversation and to share a bit about what you have been seeing working with uh startups digital natives uh on uh LM enablement so Sam thank you so much it's been just an absolute pleasure and uh look forward to talk to you soon all righty thanks Jay all right thanks bye\n"

Building LLM-Based Applications with Azure OpenAI with Jay Emery - 657

Random Videos