The New Stack and Ops for AI

**Creating a Custom Fine-Tuned Version of 3.5 Turbo**

While using pre-trained models like GPT-4 is great, there's often a huge activation energy associated with doing this. It can be quite expensive to generate a curated dataset for your use case. Like I mentioned, you might need hundreds, thousands, or sometimes even tens of thousands of examples for your specific narrow domain. Oftentimes, you'll have to manually create these yourself or hire some contractor to do it manually as well.

However, one really cool method that we've seen a lot of customers adopt is using GPT-4 to create the training dataset to fine-tune 3.5 Turbo. It's starting to look very similar to what Shyamal just mentioned around evals as well, but GPT-4 is at an intelligence level where you can actually give it a bunch of prompts, and it'll output a bunch of outputs for you here, and that output can just be your training set. You don't need any human manual intervention here. What you're effectively doing here is distilling the outputs from GPT-4 and feeding that into 3.5 Turbo so it can learn.

Oftentimes, what this does is that in your specific narrow domain, it helps this fine-tuned version of 3.5 Turbo be almost as good as GPT-4. If you do take the effort in doing all of this, the dividends that you get down the line are actually quite significant. Not only from a latency perspective, because GPT-3.5 Turbo is obviously a lot faster, but also from a cost perspective. Just to illustrate this a little bit more concretely, if you look at the table, even after today's GPT-4 price drops, a fine-tuned version of 3.5 Turbo is still 70% to 80% cheaper than vanilla 3.5 Turbo. You can see it's still quite a bit off from GPT-4, and if you switch over to fine-tuned 3.5 Turbo, you'll be saving on a lot of cost.

**Recap: Building with Large Language Models**

We talked about a framework that can help you navigate the unique considerations and challenges that come with scaling applications built with our models, going from prototype to production. We discussed how to build a useful, delightful, and human-centric user experience by controlling for uncertainty and adding guardrails. Then we talked about how do we deliver that experience consistently through grounding the model and through some of the model-level features.

Then we talked about consistently delivering that experience without regressions by implementing evaluations. Finally, we talked about considerations that come with scale, which is managing latency and costs. As we've seen, building with our models increases surface area for what's possible, but it has also increased the footprint of challenges. All of these strategies we talked about, including the orchestration part of the stack, have been converging into this new discipline called LLM Ops or Large Language Model Operations.

**LLM Ops: The Practice, Tooling, and Infrastructure for Operational Management**

Just as DevOps emerged in the early 2000s to streamline the software development process, LLM Ops has recently emerged in response to the unique challenges that are posed by building applications with LLMs. And they've become a core component of many enterprise architecture and stacks. You can think of LLM Ops as basically the practice, tooling, and infrastructure required for the operational management of LLMs end-to-end.

It's a vast and evolving field, and we're still scratching the surface. While we won't go into details here's a preview of what this could look like. LLM Ops capabilities help address challenges like monitoring, optimizing performance, helping with security compliance, managing your data and embeddings, increasing development velocity, and really accelerating the process of reliable testing and evaluation at scale.

Here, observability and tracing become especially important to identify and debug failures with your prompt chains and assistants and handle issues in production faster, making just collaboration between different teams easier. Gateways, for example, are important to simplify integrations, can help with centralized management of security, API keys, and so on. LLM Ops really enable scaling to thousands of applications and millions of users, and with the right foundations here, organizations can really accelerate their adoption.

Rather than one-off tools, the focus should be really developing these long-term platforms and expertise. Just like this young explorer standing at the threshold, we have a set of wide fields of opportunities in front of us to build the infrastructure and primitives that stretch beyond the framework we talked about today.

We're really excited to help you build the next-generation assistants and ecosystem for generations to come. There's so much to build and discover, and we can only do it together. Thank you.

"WEBVTTKind: captionsLanguage: en-Hi, everyone.Welcome to the new Stackand Ops for AI,going from prototypeto production.My name is Sherwin,and I lead the Engineering teamfor the OpenAI Developer Platform,the team that builds and maintainsthe APIs that over 2 million developers,including hopefully many of you,have used to build products on top of our models.-I'm Shyamal,I'm part of the Applied team where I've workedwith hundreds of startupsand enterprisesto help them build great productsand experiences on our platform.-Today, we're really excited to talkto you all about the processof taking your applicationsand bringing themfrom the prototype stage into production.First, I wanted to put thingsinto perspective for a little bit.While it might seemlike it's been a very long timesince ChatGPThas entered our livesand transformed the world,it actually hasn't even been a full calendar yearsince it was launched.ChatGPT was actually launchedin late November 2022,and it hasn't even beena full 12 months yet.Similarly, GPT-4was only launched in March 2023,and it hasn't even been eight monthssince people have experiencedour flagship modeland tried to use it into their products.In this time, GPT has gonefrom being a toy for usto play around with and share on social mediainto a tool for usto use in our day-to-day livesand our workplacesinto now a capability that enterprises,startups, and developerseverywhere are tryingto bake into their own products.Oftentimes, the first stepis to build a prototype.As many of you probably know,it's quite simple and easyto set up a really cool prototypeusing one of our models.It's really cool to come up with a demoand show it to all of our friends.However, oftentimes there's a really big gapin going from there into production,and oftentimes it's hardto get things into production.A large part of this is dueto the nondeterministic nature of these models.Scaling non-deterministic appsfrom prototypeinto production can oftentimesfeel quite difficultwithout a guiding framework.Oftentimes,you might feel something like thiswhere you have a lot of toolsout there for you to use.The field is moving very quickly.There's a lot of different possibilities,but you don't really knowwhere to go and what to start with.For this talk, we wantedto give you all a framework to useto help guide you moving your appfrom prototype into production.This framework we wanted to provide to youis in the form of a stack diagramthat is influencedby a lot of the challengesthat our customers have brought to usin scaling their apps.We'll be talkingabout how to build a delightfuluser experience on top of these LLMs.We'll be talking about handlingmodel inconsistencyvia grounding the modelwith Knowledge Store and Tools.We'll be talking about howto iterate on your applicationsin confidence using Evaluations.Finally, we'll be talking abouthow to manage scalefor your applications and thinking about costand latency using orchestration.For each one of these,we'll be talking abouta couple of strategiesthat hopefully you all can bring backand use in your own different products.Oftentimes first,we just have a simple prototype.At this point,there isn't a whole stack like what I just showed.There's usually justa very simple setup herewhere you have your applicationand it's talking directly with their API.While this works great initially,very quickly you'll realize that it's not enough.Shyamal: Let's talk aboutthe first layer of this framework.Technology is as usefulas the user experience surrounding it.While the goalis to build a trustworthy, defensive,and delightful user experience,AI-assisted copilotsand assistants present a different setof human-computer interactionand UX challenges.The unique considerationsof scaling applications builtwith our modelsmakes it even more importantto drive betterand safe outcomes for users.We're going to talkabout two strategies hereto navigate someof the challenges that comewith building apps on top of our models,which are inherently probabilistic in nature.Controlling for uncertaintyand building guardrailsfor steerability and safety.Controllingfor uncertainty refers to proactivelyoptimizing the user experienceby managing how the model interactsand responds to the users.Until now, a lot of productshave been deterministicwhere interactions can happenin repeatable and precise ways.This has been challenging with the shifttowards building language user interfaces.It has become importantto design for human centricityby having the AI enhanceand augment human capabilitiesrather than replacing human judgment.When designing ChatGPT,for example, we baked in a few UX elementsto help guide the users and controlfor this inherent uncertaintythat comes with building appspowered by models.The first one,depending on the use case, the first strategy hereis to keep human in the loop and understandthat the first artifact createdwith generative AImight not bethe final artifact that the user wants.Giving the usersan opportunity to iterateand improve the qualityover time is importantfor navigating uncertaintyand building a robust UX.The feedback controls,on the other hand,also provide affordances for fixing mistakesand are useful signalsto build a solid data flywheel.Another important aspectof building transparent UXis to communicate the system's capabilitiesand limitations to the users.The user can understandwhat the AI can or cannot do.You can take this further by explainingto the user how the AIcan make mistakes.In ChatGPT's case,this takes the form of an AI notice at the bottom.This sets the right expectationswith the user.Finally, a well-designed user interfacecan guide user interactionwith AI to get the most helpfuland safer responsesand the best out of the interaction.This can take the formof suggestive prompts in ChatGPT,which not only help onboardthe users to this experience,but also provide the user an opportunityto ask better questions,suggest alternative waysof solving a problem, inspire,and probe deeper.All three of these strategiesreally put the usersin the center and at the controlof the experience by designing a UXthat brings the best out of workingwith AI productsand creating a collaborativeand human-centric experience.To establish a foundationof trust and for youto build more confidence in deployingyour GPT-powered applications,it's not only importantto build a human-centric UXbut also to build guardrailsfor both steerability and safety.You can think of guardrailsas essentially constraintsor preventative controls that sit betweenthe user experience and the model.They aim to prevent harmful and unwantedcontent getting to your applications,to your users, and also adding steerabilityto the models in production.Some of the best interaction paradigmsthat we've seen developersbuild have built safetyand security at the core of the experience.Some of our best models are the onesthat are most aligned with human values.We believe someof the most useful and capable UXbrings the best out of safetyand steerability for better, safer outcomes.To demonstrate an example of this,let's start with a simple prompt in DALL·E.Very timely for Christmas,to create an abstract oil paintingof a Christmas tree.DALL·E uses the model to enhancethe prompt by adding more detailsand specificity around the hues,the shape of the tree,the colors and brush strokes,and so on.Now, I'm not an artist,so I wouldn't have donea better job at this,but in this case,I'm using DALL·E as a partnerto bring my ideas to imagination.Now, you might be wondering,how is this a safety guardrail?Well, the same prompt enrichmentused to create better artifactsalso functions as a safety guardrail.If the model in this case detectsa problematic promptthat violates the privacyor rights of individuals,it'll suggest a different promptrather than refusing it outright.In this case, instead of generatingan image of a real person,it captures the essence and then createsan image of a fictional person.We shared one exampleof a guardrail that can helpwith both steerability and safety,but guardrails can take many other forms.Some examples of thisare compliance guardrails,security guardrails,and guardrails to ensurethat the model outputsare syntactically and semantically correct.Guardrails become essentially importantwhen you're building interfacesfor highly regulated industrieswhere there's low tolerancefor errors and hallucinationand where you have to prioritize securityand compliance.We built a great user experiencewith both steerability and safety,but our journey doesn't end there.-At this point, you've builta delightful user experiencefor all of your users that can manage around someof the uncertainty of these models.While this works really great as a prototype,when the types of queriesthat you'll be getting from your usersare pretty constrained,as you scale this into production,you'll very quickly start runninginto consistency issuesbecause as you scale out your application,the types of queries and inputsthat you'll get will start varying quite a lot.With this, we want to talkabout model consistency,which introduces the second partof our stack involving grounding the modelwith the knowledge store and tools.Two strategies that we've seenin our customers adopt pretty well hereto manage around the inherent inconsistencyof these models include one,constraining the model behaviorat the model level itself,and then two, grounding the modelwith some real-world knowledge using somethinglike a knowledge store or your own tools.The first one of these is constrainingthe model behavior itself.This is an issuebecause oftentimes it's difficultto manage aroundthe inherent probabilistic nature of LLMsand especiallyas a customer of our APIwhere you don't have reallylow-level access to the model,it's really difficult to managearound some of this inconsistency.Today, we actually introducedtwo new model-level featuresthat help you constrain model behavior,and wanted to talk to you about this today.The first one of these is JSON modewhich if toggled on will constrainthe output of the modelto be within the JSON grammar.The second one is reproducible outputsusing a new parameter named Cthat we're introducing into chat completions.The first one of these, JSON mode,has been a really commonly askedfeature from a lot of people.It allows you to force the modelto output within the JSON grammar.Often times this is really important to developersbecause you're taking the outputfrom an LLMand feeding it intoa downstream software system.A lot of times, in order to do that you'll needa common data formatand JSON is one of the most popular of these.While this is great, one big downfallof inconsistency hereis when the model outputs invalid JSONit will actually break your systemand throw an exceptionwhich is not a great experiencefor your customers.JSON mode that we introduce todayshould significantlyreduce the likelihood of this.The way it works is somethinglike this where in chat completions,we've added a new argument knownas JSON Schema.If you pass and type object into that parameterand you pass it into our API,the output that you'll be gettingfrom our systemor from the API will be constrainedto within the JSON grammar.The content field therewill be constrained to the JSON grammar.While this doesn't remove 100%of all JSON errors in our evalsthat we've seen internally,it does significantly reduce the error ratefor JSON being outputby this model.The second thing is getting significantlymore reproducible outputsvia a C parameter in chat completions.A lot of our modelsare non-deterministicbut if you look under the hood,they're actually three main contributorsto a lot of the inconsistent behaviorhappening behind the scenes.One of theseis how the model samples its tokens based offof the probability that it gets.That's controlled by the temperatureand the top P parametersthat we already have.The second one is the C parameterwhich is the random numberthat the model usesto start its calculations and base it off of.The third one is this thingcalled system fingerprintwhich describes the stateof our engines that are runningin the backend and the codethat we have deployed on those.As those change, there will be some inherentnon-determinants when that happens.As of today, we only give people accessto temperature and top P.Starting today,we'll actually be giving developers accessto the C parameter as in inputand giving developers visibilityinto system fingerprint in the responsesof the chat completions model.In practice, it looks something like thiswhere in chat completions therewill now be a seed parameterthat you can pass in which is an integer.If you're passing a seedlike one, two, three, four, five,and you're controlling the temperature setting itto something like zero,your output will be significantlymore consistent over time.If you send this particularrequest over to us five times,the output that you will be gettingunder choiceswill be significantly more consistent.Additionally, we're giving you accessto the system fingerprint parameterwhich on every responsefrom the modelwill tell you a fingerprint aboutour engine system under the hood.If you're getting the exactsame system fingerprint backfrom earlier responses,and you passed in the same seedand temperature zero you're almost certainlygoing to get the same response.Cool, so those are model-level behaviorsthat you can actually very quickly pick upand just try with even today.A more involved techniqueis called grounding the modelwhich helps reduce the inconsistencyof the model behaviorby giving it additional factsto base its answer off of.The root of thisis that when it's on its own a modelcan often hallucinate informationas you all are aware of.A lot of this is due to the factthat we're forcing the modelto speakand if it doesn't really know anythingit will have to tryand say something and a lot of the timesit will make something up.The idea behind thisis to ground the modeland give it a bunch of factsso that it doesn't have nothing to go off of.Concretely, what we'd be doing hereis in the input context explicitly givingthe model some grounded factsto reduce the likelihoodof hallucinations from the model.This is actually quite a broad sentiment.The way this might lookin a system diagram is like thiswhere a query will come infrom your user,hits our servers and insteadof first passing it over to our API,we're first going to do a round tripto some type of grounded fact source.Let's say we pass the query in there.Then in our grounded fact source,it will ideally return some typeof grounded fact for usand then we will then takethe grounded factand the query itself and pass it over to our API.Then ideally the API takes that informationand synthesizes some type of responseusing the grounded fact here.To make this a little bit more concrete,one way that this might beimplemented is using RAGor vector databases which is a very commonand popular technique today.In this example, let's say I'm buildinga customer server spotand a user asks, how do I delete my account?This might be specific to my own applicationor my own productso the API by itself won't really know this.Let's say, I have a retrieval servicelike a vector databasethat I've used to index a bunchof my internal documentsand a bunch of my FAQs about supportand it knows about how to delete documents.What I would do here first is doa query to the retrieval servicewith how do I delete my account.Let's say it finds a relevant snippetfor me here that says,in the account deletion FAQ,you go to settings,you scroll down and click here, whatever.We would then pass that alongwith the original query to our APIand then the API would use that factto ground some response back to the user.In this case, it would say,to delete your account,go to settings, scroll down, click here.This is one implementation,but actually, this can be quite broadand with OpenAI function calling in the API,you can actually use your own servicesand we've seen this usedto great effect by our customers.In this case, instead of havinga vector database,we might use our own APIor own microservice here.In this case, let's say a customer is askingfor what the current mortgage ratesare which of course,even our LMS don't know immediatelybecause this changes all the time.Let's say we have a microservicethat's doing some daily sync jobthat's downloadingand keeping trackof the current mortgage rates.In this case,we would use function calling.We would tell our modelthat has it access to this function knownas get_mortgage_rates(),which is within our microservice.We'd first send a request over to the APIand it would express its intent to callthis get_mortgage_rates() function.We would then fulfill that intent by callingour API with get_mortgage_rates().Let's say it returns somethinglike 8% mortgage ratesfor a 30-year fixed mortgageand then the rest looks very similarwhere you're passing that into the APIwith the original queryand the model is then respondingwith a ground response,saying something like, not great.Current 30-year fixed rates are actually at 8% already.At a very broad level, you're usingthis grounded fact source in a generic way,to help ground the modeland help reduce model inconsistency.I just wrote two different examples of this,but the grounded fact sourcecan also be other thingslike a search index even elastics earchor some type of more general search index.It can be something like a database.It could even be somethinglike browsing the internetor trying some smart mechanismthat grab additional facts.The main idea is to give somethingfor the model to work.One thing I wanted to call outis that the OpenAI Assistants APIthat we just announced today,actually offersan out-of-the-box retrieval setup for youto use and build on top ofwith retrieval built right inin a first-class experience.I'd recommend checking it out.Shyamal: So far we talkedabout building a transparenthuman-centric user experience.Then we talked abouthow do you consistently deliverthat user experiencethrough some of the model-level featureswe released today and thenby grounding the model.Now we're goingto talk about how do we deliverthat experience consistentlywithout regressions.This is where evaluatingthe performance of the modelbecomes really important.We're going to talk abouttwo strategies herethat will help evaluate performancefor applications built with our models.The first one is to create evaluation suitesfor your specific use cases.Working with many orgs,we hear time and time againthat evaluating the modeland the performanceand testing progressions is hard,often slowing down development velocity.Part of that problem is for developersto not thinkabout a systematic processfor evaluating the performanceof these modelsand also doing evaluations too late.Evaluations are reallythe key to success here.In measuring the performanceof the modelson real product scenariosis really essentialto prevent regressionsand for you to build confidenceas you deploy these models at scale.You can think of evalsas essentially unit testsfor the large language models.People often think of promptingas a philosophy,but it is more of a science.When you pair itwith evaluations,you can treat it likea software product or delivery.Evals can really transform ambiguous dialoguesinto quantifiable experiments.They also make model governance,model upgrades,much easier setting expectationsaround what's good or bad.Capabilities, evaluations,and performance really go hand-in-handand they should be the place whereyou begin your AI engineering journey.In order to build evals,let's say we start simpleand have human annotatorsevaluate the outputsof an application as you're testing.A typical approach in this caseis where you have an applicationwith different sets of promptsor retrieval approachesand so on and you'd want to startby building a golden test data setof evals by lookingat these responsesand then manually grading them.As you annotate this over time,you end up with a test suitethat you can then run in onlineor offline fashion or partof your CICD pipelines.Due to the natureof large language models,they can make mistakes,so do humans.Depending on your use case,you might want to considerbuilding evals to test for thingslike bad output formattingor hallucinations,agents going off the rails,bad tone, and so on.Let's talk about how to build an Eval.Earlier this year,we open-sourced the evals framework,which has been an inspirationfor many developers.This library contains a registryof really challenging evalsfor different specificuse cases and verticals,and a lot of templates,which can come in handyand can be a solid starting pointfor a lot of youto understand the kind of evaluationsand tests you should be buildingfor your specific use cases.After you've built an eval suite,a good practiceand hygiene here is to logand track your eval runs.In this case, for example,we have five different eval runs,each scored againstour golden test dataset,along with the annotation feedbackand audit of changes.The audit of changes could include thingslike changes to your prompt,to your retrieval strategy,few short examples,or even upgradeto model snapshots.You don't need complicated tooling to startwith tracking something like this.A lot of our customers startwith just a spreadsheet,but the point is each runshould be storedat a very granular levelso you can track it accordingly.Although human feedback and user evalsare the highest signal in quality,it's often expensiveor not always practical,for example, when you cannot usereal customer data for evals.This is where automated evalscan help developers monitor progressand test for regressions quickly.Let's talk about model-graded evalsor essentially using AI to grade AI.GPT-4 can be a strong evaluator.In fact, in a lot of natural languagegeneration tasks,we've seen GPT-4 evaluations to be wellcorrelated with human judgmentwith some additional prompting methods.The benefit of model-graded evals hereis that by reducing human involvementin parts of the evaluation processthat can be handled by language models,humans can be more focused on addressingsome of the complex edge casesthat are needed for refiningthe evaluation methods.Let's look at an exampleof what this could look like in practice.In this case, we have an input queryand two pairs of completions.One that is the ground truth and onethat is sampled from the model.The evaluation here is a very simpleprompt that asks GPT-4to compare the factual contentof the submitted answer with the expert answer.This is passed to GPT-4to grade, and in this case,GPT-4's observation is there's a disparitybetween the submitted answerand the expert answer.We can take this further by improvingour evaluation promptwith some additional prompt engineering techniqueslike chain of thought and so on.In the previous example,the eval was pretty binary.Either the answer matchedthe ground truth or it did not.In a lot of cases,you'd want to think about eval metrics,which are closely correlatedwith what your users would expector the outcomes that you're trying to derive.For example, going back to Sherwin's exampleof a customer service assistant,we'd want to eval for custom metricslike the relevancy of the response,the credibility of the response, and so on,and have the model essentially scoreagainst those different metricsor the criteria that we decide.Here's an example of what that criteriaor scorecard would look like.Here we have provided GPT-4 essentiallythis criteria for relevance, credibility,and correctness, and then use GPT-4to score the candidate outputs.A good tip here is a show rather than tell,which basically including examplesof what a score of oneor a FI could look like,would really help in this evaluation processso that model would really appreciatethe spread of the criteria.In this case, GPT-4 has effectively learnedan internal model of language quality,which helps it to differentiatebetween relevant text and low-quality text.Harnessing this internal scoringmechanism allows usto do auto valuationof new candidate outputs.When GPT-4 is expensiveor slow for evals,even after today's price drops,you can fine-tune a 3.5 turbo model,which essentially distills GPT-4's outputsto become really goodat evaluating your use cases.In practice, what this meansis you can use GPT-4to curate high-quality datafor evaluations,then fine-tune a 3.5 judge modelthat gets really good at evaluating those outputs,and then use that fine-tuned model to valuatethe performance of your application.This also helps reduce someof the biases that comewith just using GPT-4 for evaluations.The key here is to adoptevaluation-driven development.Good evaluations are the oneswhich are well correlatedto the outcomes that you're tryingto derive or the user metricsthat you care about.They have really high end-to-endcoverage in the case of RAGand they're scalable to compute.This is where automated evaluations really help.-At this point,you've built a delightful user experience,you're able to deliver itconsistently to your usersand you're also able to iterate on the productin confidence using evaluations.If you do all this right,oftentimes you'll find yourselveswith a product that's blowing upand really, really popular.If the last year has shown us anything,it's that the consumer appetiteand even the internal employee appetitefor AI is quite insatiable.Oftentimes, you'll now start thinkingabout how to manage scale.Oftentimes, managing scalemeans managing around latencyand managing around cost.With this, we introducethe final part of our stack,known as orchestration,where you can manage around scaleby adding a couple of additional mechanismsand forks into your application.Two strategies that we've seenin managing costsand latency involveusing semantic cachingto reduce the numberof round trips that you're takingto our API as well as routingto the cheaper models.The first one of these is knownas semantic caching.What semantic caching looks likein practice from a systems perspective,it you're going to be addinga new layer in your logicto sit between us and your application.In this case, if a query comes inasking when was GPT-4 released,you would first go to your semantic cacheand do a lookup thereand see if you have anything in your cache.In this case, we don't and then you wouldjust pass this request over to our API.Then the API would respondto something like March 14th, 2023,and then you'd save thiswithin your semantic cache,which might be a vector databaseor some other type of store.The main point here is you're savingthe March 14th, 2023 responseand keying it with that queryof when was GPT-4 releasedand then you pass this backover to your users.This is fine, but let's say,a month or a week from now,another request comes in wherea user asks GPT-4 release date?Now, this isn't the exact same querythat you had before,but it is very semantically similarand can be answeredby the exact same response.In this case, you would doa semantic lookup in your cache,realize that you have this alreadyand you'd just return back to the userwith March 14th, 2023.With this setup,you've actually saved latencybecause you're no longer doinga round trip to our APIand you've saved costsbecause you're no longer hittingand paying for additional tokens.While this works great,oftentimes, it might be a little bitdifficult to manageand there are often even more capable waysof managing cost and latency.This is where we start thinkingabout routing to cheaper modelsand where orchestration really comes into play.When I talk about routingto cheaper models,oftentimes the first thing to think aboutis to go from GPT-4 into 3.5 Turbo,which sounds greatbecause GPT-3.5 Turbo is so cheap, so fast,however, it's obviously not nearlyas smart as GPT-4.If you were to just dragand drop 3.5 Turbo into your application,you'll very quickly realizethat you're not deliveringas great of a customer experience.However, the GPT-3.5 Turbo Finetuning APIthat we released only two months agohas already become a huge hitwith our customers and it's beena really great way for customersto reduce costs by fine-tuning a customversion of GPT-3.5 Turbofor their own particular use case,and get all the benefitsof the lower latency and the lower cost.There's obviouslya full talk about fine-tuning earlier,but just in a nutshell, the main idea hereis to take your own curated dataset.This might be something like hundredsor even thousands of examples at times,describing the model on how to actin your particular use case.You'd pass in that curated datasetinto our fine-tuning API maybe tweaka parameter or two hereand then the main output here is a customfine-tuned version of 3.5 Turbo specificto you and your organizationbased off of your dataset.While this is great, oftentimes actually,there's a huge activation energyassociated with doing thisand it's because it can be quite expensiveto generate this curated data set.Like I mentioned,you might need hundreds, thousands,sometimes even tens of thousandsof examples for your use case,and oftentimes you'll be manuallycreating these yourselfor hiring some contractorsto do this manually as well.However, one really cool methodthat we've seen a lot of customers adoptis you can actually use GPT-4to create the training datasetto fine-tune 3.5 Turbo.It's starting to look very similarto what Shyamal just mentionedaround evals as well,but GPT-4 is at an intelligence levelwhere you can actuallyjust give it a bunch of prompts,it'll output a bunchof outputs for you here,and that outputcan just be your training set.You don't need any humanmanual intervention here.What you're effectively doing hereis you're distilling the outputsfrom GPT-4and feeding that into 3.5 Turboso it can learn.Oftentimes, what this doesis that in your specific narrow domain,it helps this fine-tuned versionof 3.5 Turbo be almost as good as GPT-4.If you do take the effort in doing all of this,the dividends that you get downthe line are actually quite significant,not only from a latency perspective,because GPT-3.5 Turbo is obviously a lot faster,but also from a cost perspective.Just to illustrate this a little bit more concretely,if you look at the table,even after today's GPT-4 price drops,a fine-tuned version of 3.5 Turbois still 70% to 80% cheaper.While it's not as cheapas the vanilla 3.5 Turbo,you can see it's still quite a bit off from GPT-4,and if you switch overto fine-tuned 3.5 Turbo,you'll be saving on a lot of cost.-All right, so we talked about a frameworkthat can help you navigatethe unique considerationsand challenges that comewith scaling applications built with our models,going from prototype to production.Let's recap.We talked about howto build a useful, delightful,and human-centric user experienceby controlling for uncertaintyand adding guardrails.Then we talkedabout how do we deliverthat experience consistentlythrough grounding the modeland through some of the model-level features.Then we talked about consistentlydelivering that experiencewithout regressionsby implementing evaluations.Then finally, we talked about considerationsthat come with scale,which is managing latency and costs.As we've seen,building with our models increasessurface area for what's possible,but it has also increasedthe footprint of challenges.All of these strategies we talked about,including the orchestration partof the stack,have been converging intothis new discipline called LLM Opsor Large Language Model Operations.Just as DevOps emergedin the early 2000sto streamline the software development process,LLM Ops has recently emergedin response to the unique challengesthat are posed by building applications with LLMsand they've become a core componentof many enterprise architecture and stacks.You can think of LLM Opsas basically the practice, tooling,and infrastructure that is requiredfor the operational managementof LLMs end-to-end.It's a vast and evolving field,and we're still scratching the surface.While we won't go into details,here's a preview of what this could look like.LLM Ops capabilities help addresschallenges like monitoring,optimizing performance,helping with security compliance,managing your data and embeddings,increasing development velocity,and really accelerating the processof reliable testingand evaluation at scale.Here, observability and tracingbecome especially importantto identify and debug failureswith your prompt chains and assistantsand handle issues in production faster,making just collaborationbetween different teams easier.Gateways, for example,are important to simplify integrations,can help with centralized managementof security, API keys, and so on.LLM Ops really enable scalingto thousands of applicationsand millions of users,and with the right foundations here,organizations can reallyaccelerate their adoption.Rather than one-off tools,the focus should be really developingthese long-term platformsand expertise.Just like this young explorerstanding at the threshold,we have a set of wide fieldof opportunities in front of usto build the infrastructureand primitivesthat stretch beyondthe framework we talked about today.We're really excited to help youbuild the next-generation assistantsand ecosystem for generations to come.There's so much to build and discover,and we can only do it together.Thank you.\n"

The New Stack and Ops for AI

Random Videos