An Agentic Mixture of Experts for DevOps with Sunil Mallya - 708

The Moe: A Layered Approach to Language Understanding

The Moe, a complex system for understanding and generating human language, can be broken down into several distinct parts. While everything is abstracted, with agents calling on a shared LLM interface, each layer has its own set of prompts or tasks to carry out. The agents' thinking about the task and semantics are all tied together by this single LLM, which serves as a kind of "common sense" or overall understanding.

One key aspect of the Moe is the way it approaches language learning. Initially, the system was trained from scratch using ground-up data for the model to learn English. However, as the technology advanced and new models became available, the team decided to incorporate off-the-shelf models into their workflow. By fine-tuning these pre-trained models on specific tasks, they were able to extend the capabilities of the Moe without having to retrain from scratch.

This approach allowed the team to decouple themselves from a specific LLM model, making it easier to adapt and change over time. By treating the models as a toolbox or library of capabilities, rather than a single, monolithic system, they were able to develop a more flexible and robust approach to language understanding. This flexibility is key to the Moe's ability to tackle complex tasks and domains.

One interesting aspect of the Moe is its use of individual, pre-trained models as building blocks. These models can be taken from existing systems or developed in-house, but they must all be carefully integrated and wired together in order to work effectively. This requires a good understanding of how each model interacts with others, including tokenizers and output formats.

The team has had to get creative when it comes to combining these different models into a cohesive system. For example, they have used techniques like passing the output of one model to the input of another in order to generate more complex or nuanced responses. This approach allows them to leverage the strengths of each individual model while still producing coherent and effective results.

The Moe is also notable for its use of specialized models, such as the Time Series model, which is able to surface interesting insights and patterns in data. These models can be used to extract meaning from complex data sources, and then feed that information into the LLM to generate more accurate or informative responses.

One key benefit of this approach is the ability to decouple different components of the system. By focusing on specific tasks or domains, rather than trying to create a single, all-encompassing model, the team has been able to develop a more modular and adaptable system. This makes it easier to update or replace individual components over time, without affecting the overall performance or effectiveness of the Moe.

The use of off-the-shelf models also allows the team to take advantage of advances in language technology without having to reinvent the wheel. By building on existing capabilities, rather than starting from scratch, they are able to accelerate their progress and achieve more complex goals. This is particularly important for tasks that require a high level of domain knowledge or specialized expertise.

Overall, the Moe represents an interesting and innovative approach to language understanding and generation. By combining individual models and techniques in creative ways, the team has developed a powerful system that can tackle a wide range of challenges and domains.

"WEBVTTKind: captionsLanguage: enone of the emerging patterns is defining clear roles and boundaries and interfaces that part is what's lacking today in most agentic sort of workflows or orchestration systems and and that's what we said is like no we got to get this right because it can't be like out of 10 runs we S one magic it has to work nine out of 10 times right like or want to get 10 to 10 out of 10 but like that's at least a start we cannot be one out of 10 so it was really that effort that we put in into okay what are the fundamental pieces all right everyone welcome to another episode of the tml AI podcast I am your host Sam charington and today I'm joined by Sunil MAA Sunil is CTO and co-founder of flip AI before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's Show Sunil welcome to the podcast thanks Sam uh good to see you again after many years good to see you for sure it has been a while uh there are going to be a few folks listening who were at our twion event back in 2019 uh and they will remember that that you put on an amazing deep racer demo SLC contest for us at the conference uh that was a lot of fun um and you did that because you were on that deep racer team at at AWS um tell us a little bit about what you've been up to since then yeah uh deep Razer was a crazy ride and still seems to be going strong um so it's amazing that they continue to host contests around the world around deep racer and like 120,000 people that apparently last year uh you know participated that's crazy wow wow yeah yeah uh since then um sort of uh uh you know ventured into NLP uh which was sort of gaining a lot of traction uh with all of uh sort of uh you know the fine-tuning uh or building those foundations was sort of emerging with uml fit and you know ber sort of coming into that picture and that sort of got me thinking that hey this is a this is sort of a rocket ship that's that's that's a uh you know building up and and I was so wrong it wasn't it was more like a a voyager uh or you know sort of outer space mission like that you know that that nobody could have predicted so uh it was a sort of a lucky break to be on that uh ship I would say and you did that by switching over from Deep learning from Deep RAC from reinforc compr team at AWS yeah so I took over the comprehend team and and um then eventually laid the foundation for what's now Bedrock um so it was a quite a crazy journey to sort of you know Bert was considered llm in in our world back then the term didn't exist but it it was like you still had to put all that compute to train that and then suddenly uh you know uh you can probably train Bert on your laptop Now with uh you know five minutes of compute time that's awesome and since then you've gone on to found a company co-found a company what what's flip AI up to yeah flip AI uh we are observability AI um so the thesis around this is um you know all majority of our sort of team has been building software at scale maintaining five9 of availability and one of the challenges always was the operations part of like how do you you know run a service that's uh always up and that you know contributes to hair loss uh Lo loss of sleep and many other sort of side effects so uh we uh sort of like hey this is a pain we know really well uh llms are going to you know coding is an obvious sort of uh I I say developers love coding it's what comes after it is what they hate so let's go after that and solve that sort of the Genesis of the company I always need to ask when I hear uh AI observability do you think of yourself as primarily observability for AI or AI for observability yeah uh it's it's um you know the whole ml Ops and AI Ops and AI for it's all convoluted like rather you know that that isn't like a standard way to sort of describe but we're we're basically taking the pain from the developers which is uh taking all the absorb data which is like metrics traces logs events and making meaning out of that so when a when you get that page page uh that something is broken we tell you exactly what is broken and why is it broken so uh uh that's the that's the definition I would say I guess maybe another way to ask the question is like who the target developer profile like are you like going after someone who would be using a weights and biases or are you going after someone who would be using a Splunk or a honeycomb yeah uh it's the latter so it's the flying AI to like the traditional devops it observability problem so exactly yes very cool very cool awesome and so uh like what like how do you see your unique approach to that given that there are the Honeycombs and the spunks of the world that are already out there data dogs and who knows very true um I think what's interesting is uh majority of the companies don't use a single Tool uh so your data is spread across different tools so they they end up using all the names you cribed and ultimately when something is broken people have to go look at all of these sources and sort of Stitch that story together as to you know little breadcrumbs all along uh and and that's a pretty tedious process and what we are able to do is sort of be that intelligence layer on top of all of these tools and do the querying for you do the uh data rangling and understanding and reasoning okay this is broken because these two other things are broken and they're putting pressure so um you know it's typically when something breaks it's not necessarily the service that you're sort of debugging is it's something that is Downstream five levels down that is broken and it's extremely hard to find uh even with the existing tools um so that's what we make simple and and on top of that like what we've done is we built our own llm from the ground up uh and trained it on you know 100 billion tokens of like devops data so it's very domain specific um so it's you know it doesn't need to know what Napoleon does or hasn't done uh it's very focused on just devops and and and and really trying to solve that pain um so that's the unique approach and then we're able to deploy uh you know in your VPC on Prem wherever the entire like flip stack deploys uh which is really important for for a customer profile uh uh being Enterprises you want all of the data governance because the this data can have sensitive information so we give our customers all the necessary guard rails in terms of uh controlling the data flow and governance what does training data look like for a devops focused llm are you just like throwing log files at uh an llm and then like what's the value of an llm that can you know predict you know the next madeup number in a log file right uh you know um we actually went after training um we've done a pro different kinds of training uh but one of the I like to say you know machines talk to each other with apis and they express pain in logs uh and that's that's that's essentially what we model that's essentially what we are modeling is is that pain that uh the machines are expressing um so it's a it's it's sort of English but it's not really it's its own language um so it's very it's important like you know even with the latest models and releases they'll still fumble they don't understand that they're not necessarily it's sort of like you know Pig Latin sort of like you know it's it's a half language sort of so you you sort of need to fundamentally understand that Lang language uh to be able to interpret that so which is why we train it um you know ground up but it it's also not just logs right like so there's metrics like you're tracking metrics you have Trace which is graph data but it's also got code honestly because uh there are pieces of code in in log as exception stack traces when things blow up correct right like so you you sort of have to understand so uh we sort of Co this term we call it co melt so we just code and like melt data melt there's like metrics and what's that AR metric events events logs and traces traces traces yeah so we add a CO with code uh uh and and so that's what we train ground up so our training data is all of these uh modalities uh so to speak um and a lot of this is available on the internet uh but a lot of uh but it's more generic like uh but we've taken uh sort of an approach of like curating data and I I come from the old school ml like been doing ml for I don't know 15 16 years uh so back when you know uh well not back when I still label data so it's it's been a it's been a practice that I haven't lost touch with um so it's important to curate your data expert labeling make sure that data so sort of use uh whatever is available on the internet as this uh way of um pre-training sort of understanding the domain and then you sort of start specializing by using data sets that are highly curated uh labeled by experts um but sometimes like yeah I want you to dig into that more because my first thought when you said training in llm on this coel data was like I I was trying to think through like how and why that would be valuable because I would think that uh in order to do what I would imagine you'd want to do like in the observability domain would require you know not unsupervised you know learning of the structure of a log file but like supervised like when this happens you know that's correct you know this class of problem correct so talk a little bit about like how that you know that end to end yeah comes together yeah so the the first phase of sort of training you can think about the pre-training is just understanding what sort of uh you know just regular pre-training right like what's the next word or predicting the masked word and so on uh but then um as I said like sort of you know I sort of jokingly said about the pain being in logs but it's actually also in metrics and other places and and you sort of have to look at both of them right like with this graph is showing a certain data uh and then you've got logs that are showing so they have to be uh they're often telling you uh slightly different sides of the story but you got to use both to complete that story and so we use like you know joint training uh of this data uh to be able to uh make continuous meaning out of that um and and another um sort of thing that uh is unique what we do is uh I like to say like you know pre-training is like graduating from high school and then supervised fine-tuning is graduating from college and um but you know you can't put the smartest person who graduated from college and handle production incidents because that you need sort of real life scars and we sort of induced that to our llms by putting them in like a training gym and and this is sort of my reinforcement learning background coming in so we have this uh uh training where we actually bring up applications and we use another llm to break these applications so we actually sort of simulate code injection or fault injections into the infrastructure and because we know what we've created we sort of use that to oh did you get it right did you predict like or did you did you predict like was that the issue and so we can then use reinforcement learning to sort of help you uh help guide the model in making the right decisions and and that's super important because uh there's only so much data you're going to find on the internet or you can label you need an automated way to sort of scale your training um so this is sort of uh like our chaos gym that we've built for for the models to get like as good a zero shot as possible um and architecturally uh as well like what we've done is we recognize that each modality needs to be treated differently um is you know say code and logs are say predominantly you know you can still use the same time series llms are really bad at time series like they just they just don't understand numbers uh and when you look at like you know rcas and Report there's a lot of time and well this happened at this time this number went up this number went down gota do a lot of that um so we we sort of came up with okay you know you can't have a single model how do you sort of build this mixture of expert and and we sort of went with this hybrid approach of like well time series is not going to be it has to be its own little component but then attached to the mixture of experts so time series is uh uh not a say a traditional Transformer but the rest of the parts are Transformers so we end up like I don't know I call it like a uh you know single like it's a multi- decoder approach like so you have different decoders of the data or the interpretation uh to be able to uh make the most meaning out of it um so yeah it's a lot of it's taken a lot of experimentation over the last uh two and a half years to get to where we are interesting interesting when you talk about integrating time series and llms are you like there are folks that are trying to do time series with Transformers and you know depending on who you talk to like the reports of success are either high or low but you know that notwithstanding um it sounds like you're like incorporating in more traditional things like that's the impression like ARA that kind of stuff or like what what no not quite ARA like a lot more advanced but like um it's using them as an input to inform the llm so I would say um you're sort of using like a collection of traditional models to uh give more meaningful input to the LM rather than a raw time series so um you know one of the fundamental problems I feel with like let's say llm is doing math like that's a really popular sort of topic and you know you have benchmarks and um you know uh more like bench marketing I would like to say uh and and one one of the things is like no like the llms don't understand numbers together right like you know five is a different number 54 is a different number and as in it's 54 you know it's Collective and and that sort of doesn't quite exist because a tokenization uh of um you know just fundamentally doesn't understand and I I honestly feel like um I have this theory is like I call it the sun cost fallacy of llms where uh we we've LMS are so good and have done so and so close to you know breakthroughs with like numbers that we're like Hey we're not going to go back and fundamentally rethink that we need a new tokenizer something that understands numbers fundamentally so we can actually build this the right way uh you know uh with the right building blocks because we feel like we're so close so um everybody is like maybe if I just do this one thing head down push harder right like maybe if I add the period or exclamation at the end of my prompt maybe it gets everything correct right like uh and I think time and again like the paper coming out like I mean even Apple had a recent paper on on on debugging uh GSM 8K with like perturbations right like you know you see like every llm is like oh my God like the results like the variance of the results are just super widespread uh so I think you know you just need to rethink like what are llms good at so uh we sort of take the traditional ones and convert them into uh what lm's good at is Translating that into actual text that are more meaningful like if you actually so it's actually getting close to the dimension that llms can understand and use that in all of that uh so that's one aspect the other aspect is um when you want to deal with number you got to you got to like we have existing things that are can give you a finite answer so why do you why are you going and suddenly changing your stack and adding non-determinism right uh and I think what's great is we you know tool use or function calling so a lot of what we do is um flip ends up generating its own DSL with the llms and the DSL does have things like well here go call this function function or use this tool to do the math do the aspects so that gives us really good results in terms of not screwing up the numbers because you don't want to uh you have a very sort of number heavy output at the end where you know the database connections went up by this much which put pressure on uh you know this tier where you started seeing higher latency and which ultimately caused X to happen so that entire Stitch is a lot of numbers and by doing what I mentioned uh uh two approaches that I mention we end up getting very high accuracy here you know when you were speaking earlier about um the relationship between logs and graphs and telling a story you know you went on to talk about time series data which is like what underlies those graphs but but uh I'm also wondering if you have experimented with using vm's Vision language models to um take the graphs themselves as data and do you see any promise in that yeah we we did that um we experimented uh with that I think we got some decent results but ultimately one of the challenges with WMS uh it's more like the data that how you curate is the resolution of the images what happens is you can get like so suddenly like as you sort of zoom out suddenly there appears to be a peak when in reality you got to take the interpretation of the well it's only going from 0.1 to 0.15 versus it's going from 0.1 to nine right like it can still so that sort of focusing on that little piece of information which is slightly different to me like and and it's very sparse like unlike WMS and text that where you see where there more of a description Etc um uh uh we we we didn't quite get it right I would say uh uh but there's always This Promise of where you get like sort of infinite resolution so to speak with actually having the raw data uh which is much easier and more um uh uh you know reliable in terms of operating on uh so hence like we sort of uh you know paused the whole WM approach uh however the WM approach I I do think could be really interesting in finding visual patterns uh that are much harder to find you know because it's again it's a uh it's a much higher level representation of data condensing when you operate with raw data you're not going to have you have way too many dimensions again it acts against you especially when you want to find patterns yeah it's an interesting give and take between um you know using inspiration from what you know how the human would do it and you know doing it the way computers are best at doing it like a human would drown in the Raw data that's why we have the charts and the charts help us drill in on on what's actually happening but you know maybe that's not the right approach for a computer Bingo like that was certainly an intuition in terms of like oh we should be using VMS and what we found was um you know for a small subset they were good but like as you start like actually interpreting the data the scales and uh uh sort of like uh think about like visually well that as you compress data and you decompress like the timeline starts getting a little confusing you mentioned uh the DSL and Tool use as part of the way the the tool operates the you know the the product offer operates when I hear tool use I start to think of like agentic behaviors agentic workflows do you think of the system in an agentic way or as an agentic tool or absolutely and less in this the marketing sense in that everybody's saying as an agent now yeah I I'd like to think uh I I was calling them agentic before again it's a very reinforcement reinforcement learning term for me so to me it's uh it's very goal oriented right like ultimately like what we do is well we gotta find these rcad that's the goal uh we we gotta go uh find the RCA right like uh root cause analysis of um an incident has a root cause analysis so that's the ultimate goal as we' found okay why this happened uh and along the way we're sort of collecting you know observations uh from the observa uh you know data and and uh processing that and then guiding what is the next step we need to take which is very much a reinforcement thing agents uh you know changing states uh through their observations and getting rewards from the environment uh to guide them and and get to uh get to the role right uh sorry goal um so that that to me was sort of uh agentic in nature uh but we didn't necessarily think of this as like agents as as they appear today it was more like um okay we have we have a bunch of tasks that need to be done we got to break the tasks down uh now we we need to make sure the Hops between these tasks or the handh holding between these Tas tasks need to happen uh in a more reliable way so uh so in many ways like we actually came up with this framework uh we call us like agents which are the lowest level of abstraction we have actors and uh actors of this next level of abstraction which are sort of Cooperative agents like so the actors are able to take um a collection of agents and perform a task and then you have the director which is the orchestration layer as to hey um here are I'm going to assemble these actors to do something so in a more practical sense like um we can think of the directors as uh like well you got an aw sorry actors is like an AWS actor a Kafka actor uh so those are experts in in these things and then they have lower level agents like well I'm going to call query spung I'm going to uh uh I'm going to go look at uh uh you know some other piece of data and so on so really uh building repeatable patterns right and and well- defined interfaces so uh it's like build like I'd say like we sort of heavily applied software engineering principles into um how to build these uh say uh you know quote unquote agents and uh them to operate together and so is the the director is that more the like the infrastructure oriented piece or is that it's more of the planner like think about like some some yeah it's uh okay so because what what happens is you're going to discover information along the way so you need to backtrack and perhaps like change your hypothesis or or explore a different hypothesis that you have as to what went wrong uh so you need something that AR Straits saying well doesn't look like AWS had an issue uh maybe we should look at redis or Kafka or other piece in your infrastructure I guess where I was going with infrastructure was like you you are describing this like multi collaborative agentic thing and you know part of the question that jumped up for me is like are those things all running you know in one process or in one box or something or are they running like in different places and like what tells what to run where like is that even an interesting and important factor here yeah they do run in different processes because you want to parallelize uh the execution and and they sort of almost run asynchronous because uh you know while you're fetching data you know we're doing some other computation and so on so we sort of um uh a quote unquote like a a dag like system um practically speaking like dag generation and execution is hard uh so uh we we sort of like we sort of flattened the dag in the sense that uh it's okay to sort of uh redo certain things as you go on because if you if you're what heads up is if you're keeping all this memory State like now you introduce all of this concurrency and other you know uh pattern that that that slow you down or can induce more bugs so you want to have like so again like just simplifying the the part by rather than a complete dag execution uh if you need to revisit something you could probably revisit it again all of your stuffs are item potent so just run it again and exactly exactly it's okay that you operated on the data you made a copy of the data think of it as copy on write like okay I I I take that data I if I modify it now now I have another copy Etc if you actually end up keeping or or operating on the same copy and modifying now you got everybody to coordinate uh uh you got so many asynchronous parts uh sort of wanting to access and it would it would work in a great single threaded environment but if you're doing multiprocessing and so many things just introducing more complexity uh that you need in like so we keep the software engineering part simple and leave the complexity of the agent actor orchestration to itself I guess taking a step back to you know this RCA that you're trying to identify um you know historically tools like Splunk and and those before Splunk even like are trying to do like correlation statistical correlation and like it's a very different like way of looking at the problem than like an llm and a reasoning system like how do how do you like connect those ideas yeah um I think um you know statistics that you know sort of give you hints uh but they they don't necessarily give you okay the sort of uh causal nature of things right like so I think what what was important um you know as as we were going and and building this was uh mimic the flow like to me I mean I've been fortunate enough to sort of build many different ml systems and the only way I sort of could think was okay how does a human do it replicate that right like so to me it's like the simplest thing that I could come up with in terms of a framework and then maybe the machine can do it differently and better but almost the first approach always is hey what do the humans do so in a way what we did did is well if you think about a water room situation where somebody's solving these incidents you always have bunch of experts everybody's chiming in on this is what I see this is what I don't see this is and then uh you know a core group of people are making determinations on how all of this data stitched together hey okay that's unusual that your service is having this issue I don't think it should and then suddenly you realize well that's a shared component that both of us use and now that becomes the culprit right so that doesn't necessarily show up with like basic statistics unless you sort of have like basic statistics will show every like when an incident everything is broken so everything must be wrong and which is not true so now you need to go into the causal connections of like all right you are inflicting pain on me but it's not you it's somebody else who is this chain of uh you know uh in inflicting pain uh so that's that's sort of why uh uh the approach that we take and and is turning out to be you know Superior to you know traditional approaches continuing with that war room analogy like the individual you know teams are seeing anomalies popping up in their systems or events and you know they're bringing that up to you know kind of this Core group and I think of one of the things that that core group is doing is that they've got like this context in their heads of you know what's essentially like the runbook like how the different things connect and so does your system like does it need to have that visibility into that run book and if so like do I have to Define it in some you know do you have to be a runbook system are we at a point now with like you know for given domains like an AWS app or a kubernetes app like the that's all like you know softwar defined architectures that you can infer all that is that stuff you look at like how does that all come together yeah uh very much the latter where we uh are automatic in terms of generating that runbook um uh because I think these patterns are known and we've been able to generalize and with all the training of course if you have a very nuan system that you know uh isn't necessarily a common pattern then probably we need tuning uh but I would say for vast majority of cases it's uh our de our Our runbook Generation suffices and we're able to do zero shot uh uh for majority of our customers uh well it wasn't the case a year ago uh now it is very much so it's it's uh We've is it zero shot just from observation or is it zero shot including you know slurping up some configuration that's able to be found somewhere yeah uh both actually so um and what helps is a lot of these observ systems have a standard way of defining things um uh so uh you know how to get the service Maps or you know how you you know so you can sort of build on top of existing stuff or AWS you know there's a there's a thousand ways but there's just thousand you know it's a finite thousand ways of doing things uh so uh you're able to sort of uh build on top of that uh yeah and zero shot I guess I was I was mostly trying to get at like it is does the system you know either you know need or take advantage of some definition of how services are interconnected to form the the landscape and it sounds like yes yeah yeah and and um you know we're good at reading say cloud formation or terraform Etc which which sort of gives you the definition so you can use that as a base knowledge uh while you operate so um it's good that a lot of things are standardized and and effort is being put but it's just extremely complex for people to keep everything in in their head you kind of alluded to the set of architectures being somewhat standardized nowadays like how uh standardized are things I would expect that there's actually quite a lot of variety in the way people would choose to deploy things but maybe if you just say hey we support kubernetes or we support AWS or you have to have you know honeycomb or something then that's ifies the scope the standardization I was referring to more was the way people Express their infrastructure and and so on like you know your cloud formation or terraform or cdk and you know as referring to that and also like an observ and correct exactly so um that but but then there are variations and of course uh uh in terms of like you choosing you know there are a thousand different Q system you can choose or no SQL databases you can choose so so it sort of starts exploding from that point of view uh but also there's a certain um uh and and um you know there's still like a cache system behavior is consistent across whatever caching you're using uh or a database is fundamentally always a database that needs to do something so there there certain sort of aspects of things that you can take advantage of um uh uh but uh but yeah um purely speaking if you zoom out from a architecture standard practice uh well that that that varies for sure right like we have customers who have good old school Mainframe sort of uh things uh uh versus to all the way to like modern like really proliferated like microservices architecture uh uh those result in different problems and so you have this kind of collection of agents that you're running you know across many customers like um um I can only assume that like you it's kind of groundup built you're not using some off-the-shelf agent framework you know those aren't there yet and you'd want to control all the the pieces like I mean we started building before these Frameworks existed so we sort of gotten used maybe an interesting question to ask would be like if you were to take what you had and build a framework based on what you know like what would you be thinking about and how does that differ from what you see out in the framework landscape we're very specialized for what we do with perhaps like I mean one of the maybe the emerging patterns that that sort of I see is um again tying back to software engineering architectures and uh there two aspects to that one is um defining clear roles and boundaries and interfaces so to me like the way I think about when you build like really scalable systems is API interfaces are well defined uh you know the request response um you know the faulty behaviors so I think that part is what's lacking today in most agentic sort of workflows or orchestration systems and and that's what we said is like no we got to get this right because uh it's not going to be like it can't be like out of 10 runs we S one magic it has to work nine out of 10 times right like or want to get 10 to 10 out of 10 but like that's at least a start we cannot be one out of 10 um so it was really that effort that we put in into okay what are the fundamental pieces so we need like agents are very well defined input output structure right they'll always reliably give you the output or uh saying well I failed and then you know how to retry so the system sort of ends up becoming just like you're calling a library function uh Etc so the developer like who's using to build these workflows in our system know exactly the behavior to expect and underneath we did all the hard work in terms of and this is why fine tune tuning is so important and a strong proponent of that is now you get to control like your failure chances and cases go uh uh really low because you fine-tuned the model um so that's like and ju just to make sure understand that is are you saying that I I can imagine you you're saying several different things one is that um part of the reason why you have so many failures when you're calling generic llms is that they don't understand you know sufficiently or like aren't trained to answer the question that you're asking and sometimes it just takes a weird path and so if you find tune sufficiently you get an answer more consistently another interpretation could be that part of the fine-tuning you're talking about is more like controllability steerability rhf style as opposed to you know knowledge training and you are training for um response characteristics more so than knowledge production it's it's got to be all of the above right like when you think about an llm you're asking so first thing is does it know the answer right like does it is it even domain like so the domain knowledge needs to exist so that's the first part the second is now you're prompt you could ask the same thing a million different ways are you going to get it like the right answer and I think the agentic today uh in open source or even like proprietary llm flow that people build uh it's like the prompt Evolution right like you're trying to ask the same thing in a different way until you actually get the answer you want uh and and you take that out of the equation by fine-tuning because I can ask the question in exactly the same way uh and I know that this is this is the question that I'm supposed to ask so you sort of take that Dimension out of the equation and the final part is the response structure like now you don't need to say hey please always like uh please please please give me J format exactly or cats going to die exactly so now now now you're free from uh you know uh Co blackmailing yeah yeah uh the llm because the response structure the llm always know this is what I expect or not the llm but the particular task right which is why the specialization part is is is is what we do this doesn't come easy right like this comes with like does that you said not the llm but the task like you know without the resources to tr to you know or you know the the the time to like fine-tune an LM for a given task like one of the things that I've done is just like retry this 10 times and like hope that you know like are you doing that also or does fine-tuning solve all your problems and you're not doing that we we we do have like it's never perfect like you have to have fallbacks um but we don't need to do 10 tries a couple of fallbacks are sufficient uh uh yeah uh but I would say the good thing is I think our hit rate is probably uh 99% or greater in terms of here you getting it right on the first sort of attempt um yeah that's the that's how like that's a reality right like because when you're training hundreds of different tasks and really large llms uh they're always going to you know uh you can't expect like uh them to be always correct and I think it's sort of like you have to take the old distributed like you know say the early two th 2000s where you know there was a shift in thinking about distributed systems and building and saying no no I'm going to assume that everything is going to break all the time uh and I'm going to design for it and and that's how we're you know we have the Googles and Amazon of the world working wonderfully and I think yeah I your mention of chaos lab or chaos something or other earlier gy yeah yeah yeah chaos gym speaking to that chaos monkey type of learning correct uh so you mentioned uh hundreds of tasks like is there another imp application there that you've kind of learned that um a best practice is is to be very specific and fine-tuning and llm like on a a microtask as opposed to you know trying to fine-tune a single LM to do a bunch of different things um well uh we started out with like one llm multiple tasks and and then we sort of like as the mixture of experts uh things sort of started to gain more prominence and also uh different sort of uh models have different strengths uh and and we sort of uh said hey um you know we got to use what's best out there rather than and fine tune on top rather than like sort of uh relying on one thing and the good thing is we took that early call where we said look models are going to get better architectures are going to CH come and go and change so from an agent perspective like the entire system as we swap our mixture of export to even something else tomorrow nothing else needs to change of the stack like all of the things are going to work because the interfaces are well defined when you say mixture of exer experts in your case are you saying that uh you know very specifically like an end to end train Moe architecture or colloquially we've got a bunch of agents and like you know we do things like uh uh put send the task to the best agent for or a bunch of for now it's the farmer it's the farmer on on like an end to end uh end to end Moe uh but um again different tasks go to different parts of theoe uh right so so there is a there's a routing layer that actually uh goes in goes to the best sort of uh place um uh and and what I was what I meant was uh from an agent like let's say let's take the example of a agent um you know summarizing a log that that is seen now that sort of agent doesn't need to uh know the underlying sort of architecture in interface like the interface like the input output is well defined so let's say we takee out and llama 5 comes along and that's like the best thing out there and we can just swap that architecture none of the other parts of our system need to change uh uh because they have no knowledge of anything specific uh in the abstraction layer below assuming llama 5 is good at all the things that thee does I mean the good thing is we have training data right like so uh we we'll we'll fine tune if if if that model the base model is that good we can always fine tune on that and then it gets the knowledge and and more importantly I I I like to say we've curated a really good test set to be able to say what models is good or not should we graduate or not and and this is sort of a a saying that I've had for a good part of my career is like the training set never matters and I think people obsess over training set I'm like no no no you should obsess over the test set because if you know the ass Set uh is really good and representative of what you want it to be then you know it works or not the training set is just a matter of I'm not arguing for training set to be bad I'm just saying the obsession should be more on the test Set uh uh because then you can sort of quantify and know am I making a choice on X or Y and and is it informed or not so you've got these tasks specific models kind of baked into thise and you um are running you know these this kind of multi-tiered agentic like pattern or architecture like um what are some of the real world realities of like trying to run a a gentic system at scale like what do you run into yeah um I think uh we had to learn our lessons a lot on patterns of um uh you know how granular do you break the task and you know uh like if you go to granular then you end up with too many calls if you go too broad then you're asking the llm to do five things in a single call and and and and we sort of trial and error in terms of where and and I think the closest analogy I can sort of give to this is uh from The Good Old World of rdbms versus no SQL systems so you can think about like the rdb system where you sort of uh normalize the data uh and and and it has all the knowledge that you want and so all you need to do is issue this query and then it gives you the answer right now yes or you know I don't know I'm I'm I I I used to be good at SQL maybe 100 times in my life and I keep forgetting right like it's and it's analogous to a prompt right like where once you get that right like magically the llm mag answer right right like you get that answer and then uh that's sort of like one approach and then the other extreme is sort of the nosql approach where you denormalize the data you sort of bring the necessary parts and then you have you control the compute layer to stitch and then finally sort of serve the answer and and and and you know it's it's a it's a hard choice and and and I feel like the answer is somewhere in between uh uh and and you need to know which task is and again it comes down to which task is the llm uh inherently capable of doing it at a at a more broader level versus what needs to be what's a hard problem it needs to be chunked enough that you now need to sort of make few calls and then sort of uh see what what it makes sense together uh so I think that's the that's sort of the analogy I would say like uh in a practical sense that I think people have to understand is like hey what is the llm really good at right so I I can use the rdbms pattern and what is where I'm not going to get the answer I know I have to break it down and it needs to be more like a nosql pattern so I think really sort of thinking through uh these two patterns together I think helped us uh um uh because we we did end up going the latter way which is like we went with the whole task breakdown we ended up being two nosql and and what that means is too many calls uh more chances of failure more chances of retry and and the answers being slow and then we're like okay that's not quite the right pattern uh we know it's useful it has a play is we now need to start and uh sort of aggregating uh so I think uh maybe 70 30 sort of split between the two types of uh patterns that I see in the code Base today is there a similar um tradeoff when planning uh for Tool use like you know strikes me to you're you're defining apis they can be you know granular or uh you know you know Co or or broad like do you think about the same things yeah uh uh you bang on there like I mean this is again very similar to software engineering where people are like oh don't um what's the right pattern right like some people like say you know your function should not be larger than 50 lines of uh code and you got to break down and then suddenly like now you're like as you're looking through the Cod base you're like clicking your ID through each to know exact then you end up with this you know too much abstraction and and and suddenly is it's unreadable versus having like a thousand line function that does everything so I think the answer is always somewhere in between and I think uh uh with tool use and function call like you just have to be like all right what is the right sort of um uh way and it's again um software engineering practices it's what's most testable uh is this going to be reused in other parts of the code base and so on so that sort of defines how fragmented you go versus like how much you pack into a single sort of function call or API call and you mentioned test set um but it it strikes me that you know it's more than just a test set it's like having a really solid endtoend evaluation system so that you can easily kick those off and like um you know evaluate all of these it sounds like you're evaluating all of these decisions you know relative to one another very much so yeah I used test that in a more uh canonical way but in reality when you have a goal oriented system uh you got a yes the individual pieces are good but they it's again think of it as an integration test right right like you can write all the unit tests you want but it has to integrate uh together for your cicdc to work uh and we take the same approach and we use now we use the chaos gym in terms of more of a integration test environment where we're bringing up applications in different languages different architectures different clouds different orchestration layers you know kubernetes ECS and other you know General VM architectures and then sort of running through okay does the system sort of work end to end uh does it give you what you expect uh so that's sort of our integration pipeline uh by using our chaos gym um uh to sort of certify well all right uh everything works as expected um and I think I think that's more important like if if I sort of zoom out in terms of uh lessons uh to other agentic sort of Frameworks and so on is like really focusing on uh both like the unit test and integration test sort of mentality in terms of yes you need to check like the individual steps but it's really important to see how the goal that that that you're trying to do is uh is sort of affected and I'll bring back my RL hat again and and the the difficulty in a lot of these systems is you don't get sufficient rewards there's a whole notion of sparse rewards in reinforcement learning which leads to not being able to train is like you don't get like enough rewards through your pipeline to sort of improve and I think um you know uh as an agent is trying to book like you know your next flight uh how do you know well the first search results are sufficiently uh you know uh enough for you to uh proceed and so on so I think bringing that sort of discipline and into the system does help uh you sort of focus and uh allows you to uh make an educated call on where in the system is the weakness and does this really work you talked a little bit about the um I'm trying to remember the the three layers there's actor director what's the lowest level agent that's your agent yeah so you talked about this three- layer system you talked about the agent you talked about the actor I don't think you spent much time on the director um are for the kinds of tasks that you're targeting are uh llm reasoning based systems sufficient are they you know optimal do you also incorporate more traditional rules based you know ristics decision trees that kind of thing like I think at the at the planning layer I would say uh it's you know reasoning like llms again like you know you have to take uh a more practical approach so I like to say um you know think about like if you sort of distill llm or any agentic workflow think about it as like a really amazing or Labyrinth of IFL statements right like at whatever you can decompose into you can decompose into that it's humanly not possible so the way I see that is like a laborant of opaque FL statements correct like and and and and the llms are sort of taking these pieces and and S like replacing a lot of these statements so they're making the IFL statement sort of go away and and you know finally there will be a time when the IFL sort of uh disappears and I think especially with with with some some of these orchestration and so on um some of the playbooks and so on like you can rely on what humans have done before and these are sort of playbooks and stuff that uh so we are able to train on existing data that that that users have so you can sort of mimic that to a a good extent um what is hard is uh and generally like these are more or less very broad statements it's like if IT director ends up being simplistic in the sense of go look there go look here if uh well if you don't have data traces then do this to sort of get that uh so the uh Direct ends up being uh sort of like this canonical or the more common example of booking a flight right like booking a flight traditionally has this well- defined 10 steps but there are nuances in those steps and those nuances can be sort of learned uh by training on uh uh on data but if you really distill them out then they look almost like pretty common well-known steps right uh so that's what I mean uh by the whole if else Labyrinth and lolms are replacing because the Nuance of this one step of determining well should I be looking at reddis or should I be looking at Kafka or something else that's the hard part and and uh uh just saying I need to look at three places is a is more of an easy thing so the director ends up not being a very complex piece of software uh it's the latter two that actually um uh become like uh the real sort of workhorses in the system it sounds like what you're saying makes the director part work is fine-tuning on a lot of task specific data and just really solid prompt engineering correct yeah yeah it's a patterns of like how would you debug a cash issue debug a database you know locking issue and so on so it's very specific in in that sense um so that uh that plan more or less you know uh can be generated pretty easily uh but it's the other part of like hey is reddis broken or Kafka broken like that's the that's the hard reasoning part that has pushed down to uh the uh actor layer and and just the fine-tuning aspect May or may not Sol that and are you you talked about unit tests versus system tests like are do you unit test the director individually in a sense of like you know you just saw this um you know given what you know how would you resolve that yeah exactly like well generate me a plan to debug database locking issue right like we we just we know what that sort of should look like and so we can say like you generated this or not and are those plans like are those difficult to evaluate are you like doing like groundedness testing like you're looking for n pieces of information to show up in this text text that's produced yeah yeah the those those yeah those verification ends up being uh fairly fairly standard uh ways of doing yeah in terms of the model like we talked about thee would you say that that is fairly accessible you know in terms of being like you know well defined how to train Ane and like computationally accessible like not not so much actually I didn't get that sense yeah the fun thing is uh on the internet like 99.9% of the content is very much sort of experimental and Tinker not experimental it's more educational and tinkering so you end up finding all p and all other Laura Cura you know like people want to Tinker people want to get a feel and learn it's a lot of material like there isn't a lot of material in terms of how do you train sort of things like you know like this ground up it's it's still very much in a few sort of places and and you know we've had to go uh it pays off to have good friends in different places and uh including like pytorch and being able to sort of talk to them about like well we're trying to do this like it's not quite what's the best way to and then Discord channels and so on uh so those have been helpful in uh you know uh actually being able to train um yeah very much uh the content in um out there in the internet is very much like I'm going to Tinker something on my laptop or something very basic but not really like something at scale and then you you you mentioned a future version of llama uh in the context of talking about thee stuff like how do you think about model selection and like you know a lot of people I talk to like they'll start with open AI like the you know the most capable model and kind of Whittle down to something very specific to make it you know better faster cheaper in some way is that the way you think about that side of things uh not really because we sort of have a compute profile in mind in terms of uh okay ultimately we have to deploy this for our customers or deploying in the environment we really mindful like you know uh you don't have like still very hard to get 00s uh or you know let alone hunds and so on so you got to be like very mindful of uh profiles uh or compute profiles and also you know your customers are not necessarily going to be like well I'm going to spend a million dollars to uh stand up this inference cluster to do my Dev right like it's it's just a different um so uh the compute profile is very fixed and and that's a reason why obsess over fine-tuning is that we can keep our models small uh we we can Define the compute profile we know the throughput that we can get so uh we're very practical uh from that sense so if if llama 5 ends up being again just speculating like it's a one trillion par parameter model probably is a non-starter uh uh uh so um that's sort of how we think in terms of models uh but I'm very I'm not a I don't really care about what architectures evolve and so on like I'm intellectually I am but not as like say flip like how I'm going to incorporate that into the product uh uh with all the abstractions that we have model is just a drop in sort of plug-and play part of it but the effici is an important part of it and so you're you tend toward smaller as opposed to bigger like totally yeah and also a speed aspect right like a scale I mean you're deploying in a large Enterprise uh you got to have so many debugs happening at the same time uh you can't quite have a profile of a open ai1 or gp4 or even a la llama 70b like like uh those become very hard and and we don't run quantized models um you know it's it's it again introduces more chances of failures and and finickiness uh so uh we try to sort of like emphasize on the robustness of the model respecting the whole chain of input output structures and so on uh and and at the same fit the cost profile is thee is that like is that all of the layers or is that one of the layers meaning is that the actor layer only that's the Moe or is it um you know is it everything how many models are managed in your system yeah like I would say the Moe has about five distinct Parts um uh uh but everything is abstracted like in the sense of um the agents call like all of the agents are the llm interface so the agents call um the llm um and then so think about that as like the agents have certain prompts or set of prompts to uh carry out the layers are prompts and the way they think about the task and semantics and the llm is the llm but it's correct this one llm for all of the layers correct correct yes got it and that thing is trained you know it's just ground up it's not there's no part of it that's off the shelf uh yeah actually um we did start off with ground up and and again let's say 2022 and so on what we found is uh we needed to incorporate a lot of data for the model to learn English and then uh along with log data because it's as you generate summaries and reports and like understanding the English part was also important and then uh as like um you know the open- source models started becoming better uh we sort of said well we don't need to train English again uh to these models so we we sort of take uh offthe shelf models now and then train our do our training uh sort of extended pre-training fine-tuning and and Chaos training on top of those components um so what that has done is allowed us to have the model not relearn so you're trying to have thise where the experts have some correlation to task does that that doesn't necessarily mean that you can take an off-the-shelf like a mraw 8 by8 or whatever and right like that's going to have any correlation to what you're trying to do like so you are taking individual correct you know 3bs or 8bs and likeing them some kind of way yes exactly yes yes uh because uh correct I didn't know you can do that I thought that you that yeah I it was not obvious to me that you can take existing trained Stalone models and like end to end Moe train them yeah we had to do some interesting stuff there uh hence the hence theing off layers and like wiring stuff together they have different tokenizers so you have to give the you have reuse the say tokenizers and and and and so on um and then you got to pass like the output of one to the output of the other for certain tasks uh so had to do um well also like um you know python it's a flexible language so luckily we've been able to sort of get away with with with that um and and yeah exactly like we we had to be creative in in terms of what we did uh uh because uh like you know if you recall the time series The Time series is a different model and we had to take the output of that then feeding to the llm to be able to should of Reason through well here is a spike that's happening this is what it is mean this is what it means and so on I didn't even realize that you would be doing that like it within the model as opposed to the time series model is surfacing something and then you're re-injecting that in a prompt I mean they end up being equivalent in the sense like if you define a single sort of task then you're just forwarding that information to the next layer so uh uh think of it as like you interpreted Something In Time series and then you can send it to a model to spit out the English version of that right like you're taking that representation and and now the model is spitting out quote unquote English or you can go up but that so that means you're doing that like summarize this log enough that it warrants making it its own kind of FL or model then go and down it's like do this thing and then it flows through the uh underlying layer uh it also means that you can decouple things easily right like that's the because we fundamentally think the models will change uh uh so you cannot be tightly coupled to whatever llm is that exists underneath yeah that makes sense very cool very cool stuff yeah it's more of a practical approach right like we we had to uh like I said like compute profile the certain uh robustness Etc uh so we took all of that and all right we're gonna we're gonna treat this as more like Engineers uh rather than you know pure scientists who are like thrilled by all right I'm going to make this one llm do everything yeah awesome awesome well Sunil thanks so much for jumping on and uh catching us up with what you're doing it's been great to reconnect it's been too long um and it's a lot of interesting stuff in here absolutely no this was delightful and thanks for informed questions and uh yeah I I thoroughly enjoyed uh chatting and reconnecting with you thank youone of the emerging patterns is defining clear roles and boundaries and interfaces that part is what's lacking today in most agentic sort of workflows or orchestration systems and and that's what we said is like no we got to get this right because it can't be like out of 10 runs we S one magic it has to work nine out of 10 times right like or want to get 10 to 10 out of 10 but like that's at least a start we cannot be one out of 10 so it was really that effort that we put in into okay what are the fundamental pieces all right everyone welcome to another episode of the tml AI podcast I am your host Sam charington and today I'm joined by Sunil MAA Sunil is CTO and co-founder of flip AI before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's Show Sunil welcome to the podcast thanks Sam uh good to see you again after many years good to see you for sure it has been a while uh there are going to be a few folks listening who were at our twion event back in 2019 uh and they will remember that that you put on an amazing deep racer demo SLC contest for us at the conference uh that was a lot of fun um and you did that because you were on that deep racer team at at AWS um tell us a little bit about what you've been up to since then yeah uh deep Razer was a crazy ride and still seems to be going strong um so it's amazing that they continue to host contests around the world around deep racer and like 120,000 people that apparently last year uh you know participated that's crazy wow wow yeah yeah uh since then um sort of uh uh you know ventured into NLP uh which was sort of gaining a lot of traction uh with all of uh sort of uh you know the fine-tuning uh or building those foundations was sort of emerging with uml fit and you know ber sort of coming into that picture and that sort of got me thinking that hey this is a this is sort of a rocket ship that's that's that's a uh you know building up and and I was so wrong it wasn't it was more like a a voyager uh or you know sort of outer space mission like that you know that that nobody could have predicted so uh it was a sort of a lucky break to be on that uh ship I would say and you did that by switching over from Deep learning from Deep RAC from reinforc compr team at AWS yeah so I took over the comprehend team and and um then eventually laid the foundation for what's now Bedrock um so it was a quite a crazy journey to sort of you know Bert was considered llm in in our world back then the term didn't exist but it it was like you still had to put all that compute to train that and then suddenly uh you know uh you can probably train Bert on your laptop Now with uh you know five minutes of compute time that's awesome and since then you've gone on to found a company co-found a company what what's flip AI up to yeah flip AI uh we are observability AI um so the thesis around this is um you know all majority of our sort of team has been building software at scale maintaining five9 of availability and one of the challenges always was the operations part of like how do you you know run a service that's uh always up and that you know contributes to hair loss uh Lo loss of sleep and many other sort of side effects so uh we uh sort of like hey this is a pain we know really well uh llms are going to you know coding is an obvious sort of uh I I say developers love coding it's what comes after it is what they hate so let's go after that and solve that sort of the Genesis of the company I always need to ask when I hear uh AI observability do you think of yourself as primarily observability for AI or AI for observability yeah uh it's it's um you know the whole ml Ops and AI Ops and AI for it's all convoluted like rather you know that that isn't like a standard way to sort of describe but we're we're basically taking the pain from the developers which is uh taking all the absorb data which is like metrics traces logs events and making meaning out of that so when a when you get that page page uh that something is broken we tell you exactly what is broken and why is it broken so uh uh that's the that's the definition I would say I guess maybe another way to ask the question is like who the target developer profile like are you like going after someone who would be using a weights and biases or are you going after someone who would be using a Splunk or a honeycomb yeah uh it's the latter so it's the flying AI to like the traditional devops it observability problem so exactly yes very cool very cool awesome and so uh like what like how do you see your unique approach to that given that there are the Honeycombs and the spunks of the world that are already out there data dogs and who knows very true um I think what's interesting is uh majority of the companies don't use a single Tool uh so your data is spread across different tools so they they end up using all the names you cribed and ultimately when something is broken people have to go look at all of these sources and sort of Stitch that story together as to you know little breadcrumbs all along uh and and that's a pretty tedious process and what we are able to do is sort of be that intelligence layer on top of all of these tools and do the querying for you do the uh data rangling and understanding and reasoning okay this is broken because these two other things are broken and they're putting pressure so um you know it's typically when something breaks it's not necessarily the service that you're sort of debugging is it's something that is Downstream five levels down that is broken and it's extremely hard to find uh even with the existing tools um so that's what we make simple and and on top of that like what we've done is we built our own llm from the ground up uh and trained it on you know 100 billion tokens of like devops data so it's very domain specific um so it's you know it doesn't need to know what Napoleon does or hasn't done uh it's very focused on just devops and and and and really trying to solve that pain um so that's the unique approach and then we're able to deploy uh you know in your VPC on Prem wherever the entire like flip stack deploys uh which is really important for for a customer profile uh uh being Enterprises you want all of the data governance because the this data can have sensitive information so we give our customers all the necessary guard rails in terms of uh controlling the data flow and governance what does training data look like for a devops focused llm are you just like throwing log files at uh an llm and then like what's the value of an llm that can you know predict you know the next madeup number in a log file right uh you know um we actually went after training um we've done a pro different kinds of training uh but one of the I like to say you know machines talk to each other with apis and they express pain in logs uh and that's that's that's essentially what we model that's essentially what we are modeling is is that pain that uh the machines are expressing um so it's a it's it's sort of English but it's not really it's its own language um so it's very it's important like you know even with the latest models and releases they'll still fumble they don't understand that they're not necessarily it's sort of like you know Pig Latin sort of like you know it's it's a half language sort of so you you sort of need to fundamentally understand that Lang language uh to be able to interpret that so which is why we train it um you know ground up but it it's also not just logs right like so there's metrics like you're tracking metrics you have Trace which is graph data but it's also got code honestly because uh there are pieces of code in in log as exception stack traces when things blow up correct right like so you you sort of have to understand so uh we sort of Co this term we call it co melt so we just code and like melt data melt there's like metrics and what's that AR metric events events logs and traces traces traces yeah so we add a CO with code uh uh and and so that's what we train ground up so our training data is all of these uh modalities uh so to speak um and a lot of this is available on the internet uh but a lot of uh but it's more generic like uh but we've taken uh sort of an approach of like curating data and I I come from the old school ml like been doing ml for I don't know 15 16 years uh so back when you know uh well not back when I still label data so it's it's been a it's been a practice that I haven't lost touch with um so it's important to curate your data expert labeling make sure that data so sort of use uh whatever is available on the internet as this uh way of um pre-training sort of understanding the domain and then you sort of start specializing by using data sets that are highly curated uh labeled by experts um but sometimes like yeah I want you to dig into that more because my first thought when you said training in llm on this coel data was like I I was trying to think through like how and why that would be valuable because I would think that uh in order to do what I would imagine you'd want to do like in the observability domain would require you know not unsupervised you know learning of the structure of a log file but like supervised like when this happens you know that's correct you know this class of problem correct so talk a little bit about like how that you know that end to end yeah comes together yeah so the the first phase of sort of training you can think about the pre-training is just understanding what sort of uh you know just regular pre-training right like what's the next word or predicting the masked word and so on uh but then um as I said like sort of you know I sort of jokingly said about the pain being in logs but it's actually also in metrics and other places and and you sort of have to look at both of them right like with this graph is showing a certain data uh and then you've got logs that are showing so they have to be uh they're often telling you uh slightly different sides of the story but you got to use both to complete that story and so we use like you know joint training uh of this data uh to be able to uh make continuous meaning out of that um and and another um sort of thing that uh is unique what we do is uh I like to say like you know pre-training is like graduating from high school and then supervised fine-tuning is graduating from college and um but you know you can't put the smartest person who graduated from college and handle production incidents because that you need sort of real life scars and we sort of induced that to our llms by putting them in like a training gym and and this is sort of my reinforcement learning background coming in so we have this uh uh training where we actually bring up applications and we use another llm to break these applications so we actually sort of simulate code injection or fault injections into the infrastructure and because we know what we've created we sort of use that to oh did you get it right did you predict like or did you did you predict like was that the issue and so we can then use reinforcement learning to sort of help you uh help guide the model in making the right decisions and and that's super important because uh there's only so much data you're going to find on the internet or you can label you need an automated way to sort of scale your training um so this is sort of uh like our chaos gym that we've built for for the models to get like as good a zero shot as possible um and architecturally uh as well like what we've done is we recognize that each modality needs to be treated differently um is you know say code and logs are say predominantly you know you can still use the same time series llms are really bad at time series like they just they just don't understand numbers uh and when you look at like you know rcas and Report there's a lot of time and well this happened at this time this number went up this number went down gota do a lot of that um so we we sort of came up with okay you know you can't have a single model how do you sort of build this mixture of expert and and we sort of went with this hybrid approach of like well time series is not going to be it has to be its own little component but then attached to the mixture of experts so time series is uh uh not a say a traditional Transformer but the rest of the parts are Transformers so we end up like I don't know I call it like a uh you know single like it's a multi- decoder approach like so you have different decoders of the data or the interpretation uh to be able to uh make the most meaning out of it um so yeah it's a lot of it's taken a lot of experimentation over the last uh two and a half years to get to where we are interesting interesting when you talk about integrating time series and llms are you like there are folks that are trying to do time series with Transformers and you know depending on who you talk to like the reports of success are either high or low but you know that notwithstanding um it sounds like you're like incorporating in more traditional things like that's the impression like ARA that kind of stuff or like what what no not quite ARA like a lot more advanced but like um it's using them as an input to inform the llm so I would say um you're sort of using like a collection of traditional models to uh give more meaningful input to the LM rather than a raw time series so um you know one of the fundamental problems I feel with like let's say llm is doing math like that's a really popular sort of topic and you know you have benchmarks and um you know uh more like bench marketing I would like to say uh and and one one of the things is like no like the llms don't understand numbers together right like you know five is a different number 54 is a different number and as in it's 54 you know it's Collective and and that sort of doesn't quite exist because a tokenization uh of um you know just fundamentally doesn't understand and I I honestly feel like um I have this theory is like I call it the sun cost fallacy of llms where uh we we've LMS are so good and have done so and so close to you know breakthroughs with like numbers that we're like Hey we're not going to go back and fundamentally rethink that we need a new tokenizer something that understands numbers fundamentally so we can actually build this the right way uh you know uh with the right building blocks because we feel like we're so close so um everybody is like maybe if I just do this one thing head down push harder right like maybe if I add the period or exclamation at the end of my prompt maybe it gets everything correct right like uh and I think time and again like the paper coming out like I mean even Apple had a recent paper on on on debugging uh GSM 8K with like perturbations right like you know you see like every llm is like oh my God like the results like the variance of the results are just super widespread uh so I think you know you just need to rethink like what are llms good at so uh we sort of take the traditional ones and convert them into uh what lm's good at is Translating that into actual text that are more meaningful like if you actually so it's actually getting close to the dimension that llms can understand and use that in all of that uh so that's one aspect the other aspect is um when you want to deal with number you got to you got to like we have existing things that are can give you a finite answer so why do you why are you going and suddenly changing your stack and adding non-determinism right uh and I think what's great is we you know tool use or function calling so a lot of what we do is um flip ends up generating its own DSL with the llms and the DSL does have things like well here go call this function function or use this tool to do the math do the aspects so that gives us really good results in terms of not screwing up the numbers because you don't want to uh you have a very sort of number heavy output at the end where you know the database connections went up by this much which put pressure on uh you know this tier where you started seeing higher latency and which ultimately caused X to happen so that entire Stitch is a lot of numbers and by doing what I mentioned uh uh two approaches that I mention we end up getting very high accuracy here you know when you were speaking earlier about um the relationship between logs and graphs and telling a story you know you went on to talk about time series data which is like what underlies those graphs but but uh I'm also wondering if you have experimented with using vm's Vision language models to um take the graphs themselves as data and do you see any promise in that yeah we we did that um we experimented uh with that I think we got some decent results but ultimately one of the challenges with WMS uh it's more like the data that how you curate is the resolution of the images what happens is you can get like so suddenly like as you sort of zoom out suddenly there appears to be a peak when in reality you got to take the interpretation of the well it's only going from 0.1 to 0.15 versus it's going from 0.1 to nine right like it can still so that sort of focusing on that little piece of information which is slightly different to me like and and it's very sparse like unlike WMS and text that where you see where there more of a description Etc um uh uh we we we didn't quite get it right I would say uh uh but there's always This Promise of where you get like sort of infinite resolution so to speak with actually having the raw data uh which is much easier and more um uh uh you know reliable in terms of operating on uh so hence like we sort of uh you know paused the whole WM approach uh however the WM approach I I do think could be really interesting in finding visual patterns uh that are much harder to find you know because it's again it's a uh it's a much higher level representation of data condensing when you operate with raw data you're not going to have you have way too many dimensions again it acts against you especially when you want to find patterns yeah it's an interesting give and take between um you know using inspiration from what you know how the human would do it and you know doing it the way computers are best at doing it like a human would drown in the Raw data that's why we have the charts and the charts help us drill in on on what's actually happening but you know maybe that's not the right approach for a computer Bingo like that was certainly an intuition in terms of like oh we should be using VMS and what we found was um you know for a small subset they were good but like as you start like actually interpreting the data the scales and uh uh sort of like uh think about like visually well that as you compress data and you decompress like the timeline starts getting a little confusing you mentioned uh the DSL and Tool use as part of the way the the tool operates the you know the the product offer operates when I hear tool use I start to think of like agentic behaviors agentic workflows do you think of the system in an agentic way or as an agentic tool or absolutely and less in this the marketing sense in that everybody's saying as an agent now yeah I I'd like to think uh I I was calling them agentic before again it's a very reinforcement reinforcement learning term for me so to me it's uh it's very goal oriented right like ultimately like what we do is well we gotta find these rcad that's the goal uh we we gotta go uh find the RCA right like uh root cause analysis of um an incident has a root cause analysis so that's the ultimate goal as we' found okay why this happened uh and along the way we're sort of collecting you know observations uh from the observa uh you know data and and uh processing that and then guiding what is the next step we need to take which is very much a reinforcement thing agents uh you know changing states uh through their observations and getting rewards from the environment uh to guide them and and get to uh get to the role right uh sorry goal um so that that to me was sort of uh agentic in nature uh but we didn't necessarily think of this as like agents as as they appear today it was more like um okay we have we have a bunch of tasks that need to be done we got to break the tasks down uh now we we need to make sure the Hops between these tasks or the handh holding between these Tas tasks need to happen uh in a more reliable way so uh so in many ways like we actually came up with this framework uh we call us like agents which are the lowest level of abstraction we have actors and uh actors of this next level of abstraction which are sort of Cooperative agents like so the actors are able to take um a collection of agents and perform a task and then you have the director which is the orchestration layer as to hey um here are I'm going to assemble these actors to do something so in a more practical sense like um we can think of the directors as uh like well you got an aw sorry actors is like an AWS actor a Kafka actor uh so those are experts in in these things and then they have lower level agents like well I'm going to call query spung I'm going to uh uh I'm going to go look at uh uh you know some other piece of data and so on so really uh building repeatable patterns right and and well- defined interfaces so uh it's like build like I'd say like we sort of heavily applied software engineering principles into um how to build these uh say uh you know quote unquote agents and uh them to operate together and so is the the director is that more the like the infrastructure oriented piece or is that it's more of the planner like think about like some some yeah it's uh okay so because what what happens is you're going to discover information along the way so you need to backtrack and perhaps like change your hypothesis or or explore a different hypothesis that you have as to what went wrong uh so you need something that AR Straits saying well doesn't look like AWS had an issue uh maybe we should look at redis or Kafka or other piece in your infrastructure I guess where I was going with infrastructure was like you you are describing this like multi collaborative agentic thing and you know part of the question that jumped up for me is like are those things all running you know in one process or in one box or something or are they running like in different places and like what tells what to run where like is that even an interesting and important factor here yeah they do run in different processes because you want to parallelize uh the execution and and they sort of almost run asynchronous because uh you know while you're fetching data you know we're doing some other computation and so on so we sort of um uh a quote unquote like a a dag like system um practically speaking like dag generation and execution is hard uh so uh we we sort of like we sort of flattened the dag in the sense that uh it's okay to sort of uh redo certain things as you go on because if you if you're what heads up is if you're keeping all this memory State like now you introduce all of this concurrency and other you know uh pattern that that that slow you down or can induce more bugs so you want to have like so again like just simplifying the the part by rather than a complete dag execution uh if you need to revisit something you could probably revisit it again all of your stuffs are item potent so just run it again and exactly exactly it's okay that you operated on the data you made a copy of the data think of it as copy on write like okay I I I take that data I if I modify it now now I have another copy Etc if you actually end up keeping or or operating on the same copy and modifying now you got everybody to coordinate uh uh you got so many asynchronous parts uh sort of wanting to access and it would it would work in a great single threaded environment but if you're doing multiprocessing and so many things just introducing more complexity uh that you need in like so we keep the software engineering part simple and leave the complexity of the agent actor orchestration to itself I guess taking a step back to you know this RCA that you're trying to identify um you know historically tools like Splunk and and those before Splunk even like are trying to do like correlation statistical correlation and like it's a very different like way of looking at the problem than like an llm and a reasoning system like how do how do you like connect those ideas yeah um I think um you know statistics that you know sort of give you hints uh but they they don't necessarily give you okay the sort of uh causal nature of things right like so I think what what was important um you know as as we were going and and building this was uh mimic the flow like to me I mean I've been fortunate enough to sort of build many different ml systems and the only way I sort of could think was okay how does a human do it replicate that right like so to me it's like the simplest thing that I could come up with in terms of a framework and then maybe the machine can do it differently and better but almost the first approach always is hey what do the humans do so in a way what we did did is well if you think about a water room situation where somebody's solving these incidents you always have bunch of experts everybody's chiming in on this is what I see this is what I don't see this is and then uh you know a core group of people are making determinations on how all of this data stitched together hey okay that's unusual that your service is having this issue I don't think it should and then suddenly you realize well that's a shared component that both of us use and now that becomes the culprit right so that doesn't necessarily show up with like basic statistics unless you sort of have like basic statistics will show every like when an incident everything is broken so everything must be wrong and which is not true so now you need to go into the causal connections of like all right you are inflicting pain on me but it's not you it's somebody else who is this chain of uh you know uh in inflicting pain uh so that's that's sort of why uh uh the approach that we take and and is turning out to be you know Superior to you know traditional approaches continuing with that war room analogy like the individual you know teams are seeing anomalies popping up in their systems or events and you know they're bringing that up to you know kind of this Core group and I think of one of the things that that core group is doing is that they've got like this context in their heads of you know what's essentially like the runbook like how the different things connect and so does your system like does it need to have that visibility into that run book and if so like do I have to Define it in some you know do you have to be a runbook system are we at a point now with like you know for given domains like an AWS app or a kubernetes app like the that's all like you know softwar defined architectures that you can infer all that is that stuff you look at like how does that all come together yeah uh very much the latter where we uh are automatic in terms of generating that runbook um uh because I think these patterns are known and we've been able to generalize and with all the training of course if you have a very nuan system that you know uh isn't necessarily a common pattern then probably we need tuning uh but I would say for vast majority of cases it's uh our de our Our runbook Generation suffices and we're able to do zero shot uh uh for majority of our customers uh well it wasn't the case a year ago uh now it is very much so it's it's uh We've is it zero shot just from observation or is it zero shot including you know slurping up some configuration that's able to be found somewhere yeah uh both actually so um and what helps is a lot of these observ systems have a standard way of defining things um uh so uh you know how to get the service Maps or you know how you you know so you can sort of build on top of existing stuff or AWS you know there's a there's a thousand ways but there's just thousand you know it's a finite thousand ways of doing things uh so uh you're able to sort of uh build on top of that uh yeah and zero shot I guess I was I was mostly trying to get at like it is does the system you know either you know need or take advantage of some definition of how services are interconnected to form the the landscape and it sounds like yes yeah yeah and and um you know we're good at reading say cloud formation or terraform Etc which which sort of gives you the definition so you can use that as a base knowledge uh while you operate so um it's good that a lot of things are standardized and and effort is being put but it's just extremely complex for people to keep everything in in their head you kind of alluded to the set of architectures being somewhat standardized nowadays like how uh standardized are things I would expect that there's actually quite a lot of variety in the way people would choose to deploy things but maybe if you just say hey we support kubernetes or we support AWS or you have to have you know honeycomb or something then that's ifies the scope the standardization I was referring to more was the way people Express their infrastructure and and so on like you know your cloud formation or terraform or cdk and you know as referring to that and also like an observ and correct exactly so um that but but then there are variations and of course uh uh in terms of like you choosing you know there are a thousand different Q system you can choose or no SQL databases you can choose so so it sort of starts exploding from that point of view uh but also there's a certain um uh and and um you know there's still like a cache system behavior is consistent across whatever caching you're using uh or a database is fundamentally always a database that needs to do something so there there certain sort of aspects of things that you can take advantage of um uh uh but uh but yeah um purely speaking if you zoom out from a architecture standard practice uh well that that that varies for sure right like we have customers who have good old school Mainframe sort of uh things uh uh versus to all the way to like modern like really proliferated like microservices architecture uh uh those result in different problems and so you have this kind of collection of agents that you're running you know across many customers like um um I can only assume that like you it's kind of groundup built you're not using some off-the-shelf agent framework you know those aren't there yet and you'd want to control all the the pieces like I mean we started building before these Frameworks existed so we sort of gotten used maybe an interesting question to ask would be like if you were to take what you had and build a framework based on what you know like what would you be thinking about and how does that differ from what you see out in the framework landscape we're very specialized for what we do with perhaps like I mean one of the maybe the emerging patterns that that sort of I see is um again tying back to software engineering architectures and uh there two aspects to that one is um defining clear roles and boundaries and interfaces so to me like the way I think about when you build like really scalable systems is API interfaces are well defined uh you know the request response um you know the faulty behaviors so I think that part is what's lacking today in most agentic sort of workflows or orchestration systems and and that's what we said is like no we got to get this right because uh it's not going to be like it can't be like out of 10 runs we S one magic it has to work nine out of 10 times right like or want to get 10 to 10 out of 10 but like that's at least a start we cannot be one out of 10 um so it was really that effort that we put in into okay what are the fundamental pieces so we need like agents are very well defined input output structure right they'll always reliably give you the output or uh saying well I failed and then you know how to retry so the system sort of ends up becoming just like you're calling a library function uh Etc so the developer like who's using to build these workflows in our system know exactly the behavior to expect and underneath we did all the hard work in terms of and this is why fine tune tuning is so important and a strong proponent of that is now you get to control like your failure chances and cases go uh uh really low because you fine-tuned the model um so that's like and ju just to make sure understand that is are you saying that I I can imagine you you're saying several different things one is that um part of the reason why you have so many failures when you're calling generic llms is that they don't understand you know sufficiently or like aren't trained to answer the question that you're asking and sometimes it just takes a weird path and so if you find tune sufficiently you get an answer more consistently another interpretation could be that part of the fine-tuning you're talking about is more like controllability steerability rhf style as opposed to you know knowledge training and you are training for um response characteristics more so than knowledge production it's it's got to be all of the above right like when you think about an llm you're asking so first thing is does it know the answer right like does it is it even domain like so the domain knowledge needs to exist so that's the first part the second is now you're prompt you could ask the same thing a million different ways are you going to get it like the right answer and I think the agentic today uh in open source or even like proprietary llm flow that people build uh it's like the prompt Evolution right like you're trying to ask the same thing in a different way until you actually get the answer you want uh and and you take that out of the equation by fine-tuning because I can ask the question in exactly the same way uh and I know that this is this is the question that I'm supposed to ask so you sort of take that Dimension out of the equation and the final part is the response structure like now you don't need to say hey please always like uh please please please give me J format exactly or cats going to die exactly so now now now you're free from uh you know uh Co blackmailing yeah yeah uh the llm because the response structure the llm always know this is what I expect or not the llm but the particular task right which is why the specialization part is is is is what we do this doesn't come easy right like this comes with like does that you said not the llm but the task like you know without the resources to tr to you know or you know the the the time to like fine-tune an LM for a given task like one of the things that I've done is just like retry this 10 times and like hope that you know like are you doing that also or does fine-tuning solve all your problems and you're not doing that we we we do have like it's never perfect like you have to have fallbacks um but we don't need to do 10 tries a couple of fallbacks are sufficient uh uh yeah uh but I would say the good thing is I think our hit rate is probably uh 99% or greater in terms of here you getting it right on the first sort of attempt um yeah that's the that's how like that's a reality right like because when you're training hundreds of different tasks and really large llms uh they're always going to you know uh you can't expect like uh them to be always correct and I think it's sort of like you have to take the old distributed like you know say the early two th 2000s where you know there was a shift in thinking about distributed systems and building and saying no no I'm going to assume that everything is going to break all the time uh and I'm going to design for it and and that's how we're you know we have the Googles and Amazon of the world working wonderfully and I think yeah I your mention of chaos lab or chaos something or other earlier gy yeah yeah yeah chaos gym speaking to that chaos monkey type of learning correct uh so you mentioned uh hundreds of tasks like is there another imp application there that you've kind of learned that um a best practice is is to be very specific and fine-tuning and llm like on a a microtask as opposed to you know trying to fine-tune a single LM to do a bunch of different things um well uh we started out with like one llm multiple tasks and and then we sort of like as the mixture of experts uh things sort of started to gain more prominence and also uh different sort of uh models have different strengths uh and and we sort of uh said hey um you know we got to use what's best out there rather than and fine tune on top rather than like sort of uh relying on one thing and the good thing is we took that early call where we said look models are going to get better architectures are going to CH come and go and change so from an agent perspective like the entire system as we swap our mixture of export to even something else tomorrow nothing else needs to change of the stack like all of the things are going to work because the interfaces are well defined when you say mixture of exer experts in your case are you saying that uh you know very specifically like an end to end train Moe architecture or colloquially we've got a bunch of agents and like you know we do things like uh uh put send the task to the best agent for or a bunch of for now it's the farmer it's the farmer on on like an end to end uh end to end Moe uh but um again different tasks go to different parts of theoe uh right so so there is a there's a routing layer that actually uh goes in goes to the best sort of uh place um uh and and what I was what I meant was uh from an agent like let's say let's take the example of a agent um you know summarizing a log that that is seen now that sort of agent doesn't need to uh know the underlying sort of architecture in interface like the interface like the input output is well defined so let's say we takee out and llama 5 comes along and that's like the best thing out there and we can just swap that architecture none of the other parts of our system need to change uh uh because they have no knowledge of anything specific uh in the abstraction layer below assuming llama 5 is good at all the things that thee does I mean the good thing is we have training data right like so uh we we'll we'll fine tune if if if that model the base model is that good we can always fine tune on that and then it gets the knowledge and and more importantly I I I like to say we've curated a really good test set to be able to say what models is good or not should we graduate or not and and this is sort of a a saying that I've had for a good part of my career is like the training set never matters and I think people obsess over training set I'm like no no no you should obsess over the test set because if you know the ass Set uh is really good and representative of what you want it to be then you know it works or not the training set is just a matter of I'm not arguing for training set to be bad I'm just saying the obsession should be more on the test Set uh uh because then you can sort of quantify and know am I making a choice on X or Y and and is it informed or not so you've got these tasks specific models kind of baked into thise and you um are running you know these this kind of multi-tiered agentic like pattern or architecture like um what are some of the real world realities of like trying to run a a gentic system at scale like what do you run into yeah um I think uh we had to learn our lessons a lot on patterns of um uh you know how granular do you break the task and you know uh like if you go to granular then you end up with too many calls if you go too broad then you're asking the llm to do five things in a single call and and and and we sort of trial and error in terms of where and and I think the closest analogy I can sort of give to this is uh from The Good Old World of rdbms versus no SQL systems so you can think about like the rdb system where you sort of uh normalize the data uh and and and it has all the knowledge that you want and so all you need to do is issue this query and then it gives you the answer right now yes or you know I don't know I'm I'm I I I used to be good at SQL maybe 100 times in my life and I keep forgetting right like it's and it's analogous to a prompt right like where once you get that right like magically the llm mag answer right right like you get that answer and then uh that's sort of like one approach and then the other extreme is sort of the nosql approach where you denormalize the data you sort of bring the necessary parts and then you have you control the compute layer to stitch and then finally sort of serve the answer and and and and you know it's it's a it's a hard choice and and and I feel like the answer is somewhere in between uh uh and and you need to know which task is and again it comes down to which task is the llm uh inherently capable of doing it at a at a more broader level versus what needs to be what's a hard problem it needs to be chunked enough that you now need to sort of make few calls and then sort of uh see what what it makes sense together uh so I think that's the that's sort of the analogy I would say like uh in a practical sense that I think people have to understand is like hey what is the llm really good at right so I I can use the rdbms pattern and what is where I'm not going to get the answer I know I have to break it down and it needs to be more like a nosql pattern so I think really sort of thinking through uh these two patterns together I think helped us uh um uh because we we did end up going the latter way which is like we went with the whole task breakdown we ended up being two nosql and and what that means is too many calls uh more chances of failure more chances of retry and and the answers being slow and then we're like okay that's not quite the right pattern uh we know it's useful it has a play is we now need to start and uh sort of aggregating uh so I think uh maybe 70 30 sort of split between the two types of uh patterns that I see in the code Base today is there a similar um tradeoff when planning uh for Tool use like you know strikes me to you're you're defining apis they can be you know granular or uh you know you know Co or or broad like do you think about the same things yeah uh uh you bang on there like I mean this is again very similar to software engineering where people are like oh don't um what's the right pattern right like some people like say you know your function should not be larger than 50 lines of uh code and you got to break down and then suddenly like now you're like as you're looking through the Cod base you're like clicking your ID through each to know exact then you end up with this you know too much abstraction and and and suddenly is it's unreadable versus having like a thousand line function that does everything so I think the answer is always somewhere in between and I think uh uh with tool use and function call like you just have to be like all right what is the right sort of um uh way and it's again um software engineering practices it's what's most testable uh is this going to be reused in other parts of the code base and so on so that sort of defines how fragmented you go versus like how much you pack into a single sort of function call or API call and you mentioned test set um but it it strikes me that you know it's more than just a test set it's like having a really solid endtoend evaluation system so that you can easily kick those off and like um you know evaluate all of these it sounds like you're evaluating all of these decisions you know relative to one another very much so yeah I used test that in a more uh canonical way but in reality when you have a goal oriented system uh you got a yes the individual pieces are good but they it's again think of it as an integration test right right like you can write all the unit tests you want but it has to integrate uh together for your cicdc to work uh and we take the same approach and we use now we use the chaos gym in terms of more of a integration test environment where we're bringing up applications in different languages different architectures different clouds different orchestration layers you know kubernetes ECS and other you know General VM architectures and then sort of running through okay does the system sort of work end to end uh does it give you what you expect uh so that's sort of our integration pipeline uh by using our chaos gym um uh to sort of certify well all right uh everything works as expected um and I think I think that's more important like if if I sort of zoom out in terms of uh lessons uh to other agentic sort of Frameworks and so on is like really focusing on uh both like the unit test and integration test sort of mentality in terms of yes you need to check like the individual steps but it's really important to see how the goal that that that you're trying to do is uh is sort of affected and I'll bring back my RL hat again and and the the difficulty in a lot of these systems is you don't get sufficient rewards there's a whole notion of sparse rewards in reinforcement learning which leads to not being able to train is like you don't get like enough rewards through your pipeline to sort of improve and I think um you know uh as an agent is trying to book like you know your next flight uh how do you know well the first search results are sufficiently uh you know uh enough for you to uh proceed and so on so I think bringing that sort of discipline and into the system does help uh you sort of focus and uh allows you to uh make an educated call on where in the system is the weakness and does this really work you talked a little bit about the um I'm trying to remember the the three layers there's actor director what's the lowest level agent that's your agent yeah so you talked about this three- layer system you talked about the agent you talked about the actor I don't think you spent much time on the director um are for the kinds of tasks that you're targeting are uh llm reasoning based systems sufficient are they you know optimal do you also incorporate more traditional rules based you know ristics decision trees that kind of thing like I think at the at the planning layer I would say uh it's you know reasoning like llms again like you know you have to take uh a more practical approach so I like to say um you know think about like if you sort of distill llm or any agentic workflow think about it as like a really amazing or Labyrinth of IFL statements right like at whatever you can decompose into you can decompose into that it's humanly not possible so the way I see that is like a laborant of opaque FL statements correct like and and and and the llms are sort of taking these pieces and and S like replacing a lot of these statements so they're making the IFL statement sort of go away and and you know finally there will be a time when the IFL sort of uh disappears and I think especially with with with some some of these orchestration and so on um some of the playbooks and so on like you can rely on what humans have done before and these are sort of playbooks and stuff that uh so we are able to train on existing data that that that users have so you can sort of mimic that to a a good extent um what is hard is uh and generally like these are more or less very broad statements it's like if IT director ends up being simplistic in the sense of go look there go look here if uh well if you don't have data traces then do this to sort of get that uh so the uh Direct ends up being uh sort of like this canonical or the more common example of booking a flight right like booking a flight traditionally has this well- defined 10 steps but there are nuances in those steps and those nuances can be sort of learned uh by training on uh uh on data but if you really distill them out then they look almost like pretty common well-known steps right uh so that's what I mean uh by the whole if else Labyrinth and lolms are replacing because the Nuance of this one step of determining well should I be looking at reddis or should I be looking at Kafka or something else that's the hard part and and uh uh just saying I need to look at three places is a is more of an easy thing so the director ends up not being a very complex piece of software uh it's the latter two that actually um uh become like uh the real sort of workhorses in the system it sounds like what you're saying makes the director part work is fine-tuning on a lot of task specific data and just really solid prompt engineering correct yeah yeah it's a patterns of like how would you debug a cash issue debug a database you know locking issue and so on so it's very specific in in that sense um so that uh that plan more or less you know uh can be generated pretty easily uh but it's the other part of like hey is reddis broken or Kafka broken like that's the that's the hard reasoning part that has pushed down to uh the uh actor layer and and just the fine-tuning aspect May or may not Sol that and are you you talked about unit tests versus system tests like are do you unit test the director individually in a sense of like you know you just saw this um you know given what you know how would you resolve that yeah exactly like well generate me a plan to debug database locking issue right like we we just we know what that sort of should look like and so we can say like you generated this or not and are those plans like are those difficult to evaluate are you like doing like groundedness testing like you're looking for n pieces of information to show up in this text text that's produced yeah yeah the those those yeah those verification ends up being uh fairly fairly standard uh ways of doing yeah in terms of the model like we talked about thee would you say that that is fairly accessible you know in terms of being like you know well defined how to train Ane and like computationally accessible like not not so much actually I didn't get that sense yeah the fun thing is uh on the internet like 99.9% of the content is very much sort of experimental and Tinker not experimental it's more educational and tinkering so you end up finding all p and all other Laura Cura you know like people want to Tinker people want to get a feel and learn it's a lot of material like there isn't a lot of material in terms of how do you train sort of things like you know like this ground up it's it's still very much in a few sort of places and and you know we've had to go uh it pays off to have good friends in different places and uh including like pytorch and being able to sort of talk to them about like well we're trying to do this like it's not quite what's the best way to and then Discord channels and so on uh so those have been helpful in uh you know uh actually being able to train um yeah very much uh the content in um out there in the internet is very much like I'm going to Tinker something on my laptop or something very basic but not really like something at scale and then you you you mentioned a future version of llama uh in the context of talking about thee stuff like how do you think about model selection and like you know a lot of people I talk to like they'll start with open AI like the you know the most capable model and kind of Whittle down to something very specific to make it you know better faster cheaper in some way is that the way you think about that side of things uh not really because we sort of have a compute profile in mind in terms of uh okay ultimately we have to deploy this for our customers or deploying in the environment we really mindful like you know uh you don't have like still very hard to get 00s uh or you know let alone hunds and so on so you got to be like very mindful of uh profiles uh or compute profiles and also you know your customers are not necessarily going to be like well I'm going to spend a million dollars to uh stand up this inference cluster to do my Dev right like it's it's just a different um so uh the compute profile is very fixed and and that's a reason why obsess over fine-tuning is that we can keep our models small uh we we can Define the compute profile we know the throughput that we can get so uh we're very practical uh from that sense so if if llama 5 ends up being again just speculating like it's a one trillion par parameter model probably is a non-starter uh uh uh so um that's sort of how we think in terms of models uh but I'm very I'm not a I don't really care about what architectures evolve and so on like I'm intellectually I am but not as like say flip like how I'm going to incorporate that into the product uh uh with all the abstractions that we have model is just a drop in sort of plug-and play part of it but the effici is an important part of it and so you're you tend toward smaller as opposed to bigger like totally yeah and also a speed aspect right like a scale I mean you're deploying in a large Enterprise uh you got to have so many debugs happening at the same time uh you can't quite have a profile of a open ai1 or gp4 or even a la llama 70b like like uh those become very hard and and we don't run quantized models um you know it's it's it again introduces more chances of failures and and finickiness uh so uh we try to sort of like emphasize on the robustness of the model respecting the whole chain of input output structures and so on uh and and at the same fit the cost profile is thee is that like is that all of the layers or is that one of the layers meaning is that the actor layer only that's the Moe or is it um you know is it everything how many models are managed in your system yeah like I would say the Moe has about five distinct Parts um uh uh but everything is abstracted like in the sense of um the agents call like all of the agents are the llm interface so the agents call um the llm um and then so think about that as like the agents have certain prompts or set of prompts to uh carry out the layers are prompts and the way they think about the task and semantics and the llm is the llm but it's correct this one llm for all of the layers correct correct yes got it and that thing is trained you know it's just ground up it's not there's no part of it that's off the shelf uh yeah actually um we did start off with ground up and and again let's say 2022 and so on what we found is uh we needed to incorporate a lot of data for the model to learn English and then uh along with log data because it's as you generate summaries and reports and like understanding the English part was also important and then uh as like um you know the open- source models started becoming better uh we sort of said well we don't need to train English again uh to these models so we we sort of take uh offthe shelf models now and then train our do our training uh sort of extended pre-training fine-tuning and and Chaos training on top of those components um so what that has done is allowed us to have the model not relearn so you're trying to have thise where the experts have some correlation to task does that that doesn't necessarily mean that you can take an off-the-shelf like a mraw 8 by8 or whatever and right like that's going to have any correlation to what you're trying to do like so you are taking individual correct you know 3bs or 8bs and likeing them some kind of way yes exactly yes yes uh because uh correct I didn't know you can do that I thought that you that yeah I it was not obvious to me that you can take existing trained Stalone models and like end to end Moe train them yeah we had to do some interesting stuff there uh hence the hence theing off layers and like wiring stuff together they have different tokenizers so you have to give the you have reuse the say tokenizers and and and and so on um and then you got to pass like the output of one to the output of the other for certain tasks uh so had to do um well also like um you know python it's a flexible language so luckily we've been able to sort of get away with with with that um and and yeah exactly like we we had to be creative in in terms of what we did uh uh because uh like you know if you recall the time series The Time series is a different model and we had to take the output of that then feeding to the llm to be able to should of Reason through well here is a spike that's happening this is what it is mean this is what it means and so on I didn't even realize that you would be doing that like it within the model as opposed to the time series model is surfacing something and then you're re-injecting that in a prompt I mean they end up being equivalent in the sense like if you define a single sort of task then you're just forwarding that information to the next layer so uh uh think of it as like you interpreted Something In Time series and then you can send it to a model to spit out the English version of that right like you're taking that representation and and now the model is spitting out quote unquote English or you can go up but that so that means you're doing that like summarize this log enough that it warrants making it its own kind of FL or model then go and down it's like do this thing and then it flows through the uh underlying layer uh it also means that you can decouple things easily right like that's the because we fundamentally think the models will change uh uh so you cannot be tightly coupled to whatever llm is that exists underneath yeah that makes sense very cool very cool stuff yeah it's more of a practical approach right like we we had to uh like I said like compute profile the certain uh robustness Etc uh so we took all of that and all right we're gonna we're gonna treat this as more like Engineers uh rather than you know pure scientists who are like thrilled by all right I'm going to make this one llm do everything yeah awesome awesome well Sunil thanks so much for jumping on and uh catching us up with what you're doing it's been great to reconnect it's been too long um and it's a lot of interesting stuff in here absolutely no this was delightful and thanks for informed questions and uh yeah I I thoroughly enjoyed uh chatting and reconnecting with you thank you\n"

An Agentic Mixture of Experts for DevOps with Sunil Mallya - 708

Random Videos