Keynote - Why You Should Think Twice Before Paying for an Evaluation Tool - Chip Huyen, Voltron Data

The Importance of Understanding System Failures: A Key to Effective Evaluation and Improvement

In order to effectively evaluate and improve systems, it is crucial to understand where they fail. This understanding can help identify the root cause of the failure, allowing for more targeted and effective solutions. One common pitfall in system evaluation is introducing tools too early, before a thorough understanding of the system's functionality and limitations has been achieved. This can lead to wasted time and resources on implementing tools that may not provide the desired benefits.

When evaluating systems, it is essential to go beyond simply detecting failures and instead aim to understand the underlying causes of these failures. This involves identifying where in the system the failure occurs and taking steps to address this issue. For instance, if a system fails to extract relevant information from user input, it may be necessary to examine the data parsing component of the system more closely.

Tools can play a significant role in evaluating systems, but they must be used judiciously. While tools can provide valuable insights into system performance and functionality, introducing them too early can lead to confusion and distraction from the main problem at hand. Moreover, different tool vendors often employ varying taxonomies and methodologies for evaluation, which can make it challenging to compare results across different systems.

One specific example of a challenge in evaluating systems is faithfulness, or the degree to which a system's output accurately reflects the original query context. When using AI as a judge for faithfulness, both the model itself and a prompt are necessary. However, these elements can vary significantly between different tools, making direct comparison and evaluation more difficult.

Moreover, evaluation alone may not be enough to mitigate failures in systems. In some cases, blame for system failures must be assigned to individuals or processes beyond the immediate system design. For instance, if an intern fails at their job despite adequate training, it is essential to consider whether the hiring process or managerial oversight was inadequate rather than simply blaming the individual.

In order to effectively evaluate and improve systems, a holistic approach that involves both technical analysis and user education is necessary. This may involve working closely with system developers to make the system more robust, tracing failures in the system design, and educating users on how to use the system effectively. By taking this comprehensive approach, it is possible to identify and address system failures more efficiently, ensuring that systems are designed to mitigate failures and provide reliable performance.

The author's experience working at Von has provided valuable insights into the challenges of evaluating systems. The company has developed a benchmarking tool that demonstrates significant improvements in data processing speed and efficiency when moving large amounts of data from traditional CPU-based systems like Spark to their own GPU-powered system. This achievement highlights the importance of innovative solutions and careful evaluation of system performance.

Finally, the author is excited to share their forthcoming book on engineering with readers. The publication promises to delve into the specifics of system design and development, providing valuable insights for engineers working in this field.

"WEBVTTKind: captionsLanguage: enhello everyone uh I'm chip let me see how does this work okay um thank you so much for having me here I be like a big fan of Page for many years so I'm really excited to be here today so uh The Talk today is why you should think twice before adopting an evaluation tool uh true a warning um anyone here is building an evaluation tool okay so I I was worried about this talk a little was like would I offend people and a dried run with like sumith yesterday he was like no go for it so if you upset blame sumith um okay so the I do think just like even though a lot of people might adapt AI for hype I think the vast majority of business decisions when still are still driven by return investment it's very hard for company to deploy an applications unless I can see clear returns so I I I con this like evaluation driven development it's like companies could focus on applications so they can evaluate and measure outcome so it's not a surprise that like the most popular AI use cases today are the US cases where you can actually see very clear returns for example recommend system right like you can recommend and you can tell whether it's working if you see like increased engagement or like higher purchase through rate fraud detection is like if you can stop fraud you can save money and another thing is like coding think coding is definitely like popular for many reasons but one of the reason is that you can actually evaluate based on functional correctness like the AI can Jer code and you can just run it through a lot of unit uh unit test or like integration test very come engineering so I do believe that like why is this like a good approach it can also lead us to situation where you try to like find the key under the lamp post because this is where you can see the light so we are constrained by what applications that we can evaluate and I do believe that like being able to like evaluate better when unlock many many interesting use case however L evaluation is hard so it hard for like many reasons but like I think just go for two of them one is that the more intelligent AI becomes the harder it is to evaluate them so anyone like most people can evaluate whether a first grader math solution is wrong but it's much harder to like look into a PhD level math Solutions I'm not show you saw the recent post by teren style a fuse medalist and he was just like okay it's like the experience is similar to advising a mediocre but not completely incompetent graduate student and I I really love the comment after that it was like if we are already at the point where we need the brightest mind to evaluate AI then we would soon be like out of qualified people to evaluate future AI iterations so another reason is that open open-ended evaluation is like much much harder than like traditionally like classifications so before right now if you have like three options and you choose the other option you know it's wrong but with open-ended evaluation it can go in many many different ways so think about the use case of book summar book summarizations so if a book summary sounds okay like s coherent you don't really know if it's bad or not until like maybe you have to read the entire book to evaluate it or if it ask to like jry some arguments about like why vaccines does not cause autism right you need to like go and verify fact check on the reference that it generates so it become extremely hard um so because this is hard that there's been explosions of like evaluation tools built to help company do this so this is some fun project I did so it was like tracking like all the tools like B build for evaluation and you can see clear like jump in evaluation tool just after chbd came out so given that evaluation is really hard and see availability of evaluation tools is very tempting for people to just jump like straight to adapting tools um so it's I do believe that I'm a big fan of tools and I actually like doing building tools in my free time um and I do think that like the right tools can make the life a lot easier I do believe that sometimes introducing tools too early can actually cause more problems than not so for some like before we adapt a tool like evaluations maybe we should take a step back and try to understand like what are we trying to do like what what kind of things we trying to EV evaluate so evaluations is to mitigate risk and uncover opportunities so to mitigate risk we need to understand what the risk are so one one thing very so like uh I have a lot of companies coming to me it's like like okay evaluation is hard um our evaluation system is not good what do we do and I was like why do you say the evaluation is bad because when you say evaluation is bad it usually means that there are certain failures that you should have C caught but you didn't catch so was asking them like so what what kind failur that you wanted to catch but you felt like the system didn't catch and a lot of time just like um I don't know um so one on First Things is like we need to be able to Define like very clearly what failures look like so it's a very interesting case study from LinkedIn and they said that like on their Reflections after a year of building a ation and they said that the hardest thing is to define the criteria for like what good responses are and what bad responses are because a lot of time a response can be good like be correct but entirely like not helpful so for example they build this app uh chatbot to help candidate assess whether they're good fit for a job right and the response is like your terrible fit is a correct response but it's not a good response and need defies that failure mode so another the very common failure modes that I see is like hallucinations um so who here thinks that Hallucination is a big issue for LM in Productions okay about half uh so you know what is coming uh it's a trick questions so who here can tell me what Hallucination is really I'm not going to cor on you I'm just like asking um okay so I see this like one one like very h hand thank you um so that's very common um so there a lot of things we still don't know about hallucinations but also there are certain things that we do know about hallucinations hallucination even though it has captured a lot of imaginations recently is not a new problem so if you come um from like NLP background like in the nlg like natural language Generations Hallucination is a pretty old long-standing problem it's related to natural language inference so the idea is that given a statement and um a given hypothesis in a statement can the statement be derived from the hypothesis so like the ENT we got entailment means that you can derive the statement from the hypothesis uh if it's contradiction means that like it's completely contradictory and the neutral means that no there's no relation that you can't really determine yes or no uh so so there are things that we know about hallucinations right uh so let's try a quick quiz here like so I have like two queries I want to ask the model so when was so the first one is when was the International Mathematic Olympiad in 20 uh in 2007 and the other is the same questions but about the Vietnam mathematical Olympiad so who thinks that the model is likely more likely to hallucinate on the IMO question so who so nobody so who thinks that uh the model is more likely to hallucinate on the VMO questions okay so more people right so so um I I wish I could ask people why um so I think just like it's a pretty well study and well research problem just a model is more likely to hallucinate when being ask manich questions like the informations that like unlikely or like very rarely appear in the training data and you can Prett see that for example like yes it did um so it as like a one preview and it did hallucinate on V o but like got it correct on the IMO so it's not entirely like all the way true because propriate nature if you can you can actually like Bully it into like saying the wrong things because if you keep asking it's like really really it would change his answer so actually like a fun Benchmark I want you to do is like how bable in AI it like if you can just like Gaslight it how how often it changes answer so so yeah so we do understand something about hallucination and what what I want to say is that it's very important like if we think Hallucination is an issue we need to like look into the failures like different instance of hallucinations and understand For What types of queries the model is more likely to hallucinate on and build our Benchmark evalu evaluation metrics around those and it's going to be like pain and suffering it's not something that you can just like ultimately Outsource to like a tool um so there are many different ways that A system can fail but in general um like that based purely on the respon I'm not talking about latency cost on a security here that's based purely on the responses um I us to categorize the failures into information based failures it's like for example like when it's factually incorrect informations or the information that could have been factually correct at some point in the past but now it's like outdated uh the other kind very common uh kind failure is going as like Behavior based failures it's like one thing is like one example is when the response is facially correct but just not relevant so example of like um like here an example like if you ask AI why reto is important for uh why ret voice is important for AI applications it might answer but like retal is a technique for search and recess so which is correct but it's not relevant because it doesn't answer why so another very common form of like um Behavior based failure is just a based on uh format like if you ask is to like generate a Json five with two key title and body and it's just like Jared like not possible by Json it's just wrong key so it's not it's like it's a failure mode so you can define a lot of different failure modes and it can help you build a matrix around that for example like maybe the last iterations uh it Jed like 80% correctly formatted responses but this iteration is like 92% so once you define own this Sky failure modes it's a lot easier to define the metric that we can use tools to like uh automate and like build dashboard for uh another things like important to understand is like where the system fell so an example that I actually see quite a lot uh so here I hear your resume but you can also like use like in documents as an example so first example like uh if you the user give the system a resume and a questions like where has the candidate worked and the systems output like meta and Google and missing an employee and missing an employer so we need to understand just like where in the system does this fail so you need to look deeper into the system and different components so let's say like this system use like first a resume parsel and then it extract the text and then it us extracted text to generat the response so maybe you want to check first like how accurate the resume parer maybe this the problem is that the resume parsel just like didn't get the correct text or the problem could be with the parser because like the text the extracted text is correct but the answer is incorrect so different kind of um understand of where the the system fails and have you like uh localize the problem and like build like improve the system so I do think that evaluation is um it's just not about like detecting uh failures but also help you like figure out the right solutions for the failures so I do think this like tools can be immensely helpful um however like introducing tools to early before you understand the system can actually distract you from the main problem and sometime I can like introduce more variants because every tun developers has their own own taxonomy and ways of doing things and when you adapt a to you can have to get used to it have to learn it and sometime I can point you in the wrong directions So like um I've seen a lot of evaluation T demo at some point in my life was seeing like two or three every every week and one of them like a lot of them use some variations of like a like LM as a j technique like using AI to like evaluate models and a lot of them have some like similar sting criteria so like one example like uh faithfulness so like if you have the rack system you you retrieve the context and you generate answer and you want to like measure like how faithful the answer is to the context and like and the original query um but like using AI as a judge right you need not just a model the AI model as a judge you also need a prompt so we're trying to look into like how these different tools um evaluate faithful n and just realizes like for every single of those evaluation tools The Prompt is different and even the scoring system is different so like if you just use a tool you kind bought into this whole ways of doing things and it could actually make it harder it can make it harder for you to understand what the core problems are um also um evaluation is to mitigate like Risk and failures but failures might not be swn by evaluation alone so one one example I usually ask people is that like let's say you hire an intern to do some work and that intern fail at their job like who do you blame so one option is just blame the hiring process for not being able to like identify the right intern or you can blame the intern for not trying hard enough or it can blame the manager who just doesn't know like how to assign a task that the intern can do so I do think it's like the first one is like the hiring process is when you like put the responsibilities on evalu ations and if you blame the intern it's like you are blaming the like system developers who build model who people who build the applications or as a manager is when you blame the users for just not knowing how to use the systems right way so evaluation can certainly is one way to have you mitigate failures but you can also do it by other way so you can work with a system developers so make a system more robust your failures or like make it more observable so you can understand and Trace that when something fails or I can work with you users and educate them so that they can know how to use a system better so yeah so evaluation is a system problem and I do think the first thing is to figure out what failures look like and understand what failures uh when the system failure uh happen and you can design the system to mitigate failures and then you design Matrix to get failures that the system design cannot solve and only after that that we adapt the right tools um oh sorry I think I have the obligatory I work at Von and we on um data processing on gpus and we have this really cool Benchmark to show that if you move so large amount of data from CPU like spark to our system is a lot faster and a lot cheaper um and I also have a book coming out soonish on engineering but that is pretty much it uh thank you so much everyonehello everyone uh I'm chip let me see how does this work okay um thank you so much for having me here I be like a big fan of Page for many years so I'm really excited to be here today so uh The Talk today is why you should think twice before adopting an evaluation tool uh true a warning um anyone here is building an evaluation tool okay so I I was worried about this talk a little was like would I offend people and a dried run with like sumith yesterday he was like no go for it so if you upset blame sumith um okay so the I do think just like even though a lot of people might adapt AI for hype I think the vast majority of business decisions when still are still driven by return investment it's very hard for company to deploy an applications unless I can see clear returns so I I I con this like evaluation driven development it's like companies could focus on applications so they can evaluate and measure outcome so it's not a surprise that like the most popular AI use cases today are the US cases where you can actually see very clear returns for example recommend system right like you can recommend and you can tell whether it's working if you see like increased engagement or like higher purchase through rate fraud detection is like if you can stop fraud you can save money and another thing is like coding think coding is definitely like popular for many reasons but one of the reason is that you can actually evaluate based on functional correctness like the AI can Jer code and you can just run it through a lot of unit uh unit test or like integration test very come engineering so I do believe that like why is this like a good approach it can also lead us to situation where you try to like find the key under the lamp post because this is where you can see the light so we are constrained by what applications that we can evaluate and I do believe that like being able to like evaluate better when unlock many many interesting use case however L evaluation is hard so it hard for like many reasons but like I think just go for two of them one is that the more intelligent AI becomes the harder it is to evaluate them so anyone like most people can evaluate whether a first grader math solution is wrong but it's much harder to like look into a PhD level math Solutions I'm not show you saw the recent post by teren style a fuse medalist and he was just like okay it's like the experience is similar to advising a mediocre but not completely incompetent graduate student and I I really love the comment after that it was like if we are already at the point where we need the brightest mind to evaluate AI then we would soon be like out of qualified people to evaluate future AI iterations so another reason is that open open-ended evaluation is like much much harder than like traditionally like classifications so before right now if you have like three options and you choose the other option you know it's wrong but with open-ended evaluation it can go in many many different ways so think about the use case of book summar book summarizations so if a book summary sounds okay like s coherent you don't really know if it's bad or not until like maybe you have to read the entire book to evaluate it or if it ask to like jry some arguments about like why vaccines does not cause autism right you need to like go and verify fact check on the reference that it generates so it become extremely hard um so because this is hard that there's been explosions of like evaluation tools built to help company do this so this is some fun project I did so it was like tracking like all the tools like B build for evaluation and you can see clear like jump in evaluation tool just after chbd came out so given that evaluation is really hard and see availability of evaluation tools is very tempting for people to just jump like straight to adapting tools um so it's I do believe that I'm a big fan of tools and I actually like doing building tools in my free time um and I do think that like the right tools can make the life a lot easier I do believe that sometimes introducing tools too early can actually cause more problems than not so for some like before we adapt a tool like evaluations maybe we should take a step back and try to understand like what are we trying to do like what what kind of things we trying to EV evaluate so evaluations is to mitigate risk and uncover opportunities so to mitigate risk we need to understand what the risk are so one one thing very so like uh I have a lot of companies coming to me it's like like okay evaluation is hard um our evaluation system is not good what do we do and I was like why do you say the evaluation is bad because when you say evaluation is bad it usually means that there are certain failures that you should have C caught but you didn't catch so was asking them like so what what kind failur that you wanted to catch but you felt like the system didn't catch and a lot of time just like um I don't know um so one on First Things is like we need to be able to Define like very clearly what failures look like so it's a very interesting case study from LinkedIn and they said that like on their Reflections after a year of building a ation and they said that the hardest thing is to define the criteria for like what good responses are and what bad responses are because a lot of time a response can be good like be correct but entirely like not helpful so for example they build this app uh chatbot to help candidate assess whether they're good fit for a job right and the response is like your terrible fit is a correct response but it's not a good response and need defies that failure mode so another the very common failure modes that I see is like hallucinations um so who here thinks that Hallucination is a big issue for LM in Productions okay about half uh so you know what is coming uh it's a trick questions so who here can tell me what Hallucination is really I'm not going to cor on you I'm just like asking um okay so I see this like one one like very h hand thank you um so that's very common um so there a lot of things we still don't know about hallucinations but also there are certain things that we do know about hallucinations hallucination even though it has captured a lot of imaginations recently is not a new problem so if you come um from like NLP background like in the nlg like natural language Generations Hallucination is a pretty old long-standing problem it's related to natural language inference so the idea is that given a statement and um a given hypothesis in a statement can the statement be derived from the hypothesis so like the ENT we got entailment means that you can derive the statement from the hypothesis uh if it's contradiction means that like it's completely contradictory and the neutral means that no there's no relation that you can't really determine yes or no uh so so there are things that we know about hallucinations right uh so let's try a quick quiz here like so I have like two queries I want to ask the model so when was so the first one is when was the International Mathematic Olympiad in 20 uh in 2007 and the other is the same questions but about the Vietnam mathematical Olympiad so who thinks that the model is likely more likely to hallucinate on the IMO question so who so nobody so who thinks that uh the model is more likely to hallucinate on the VMO questions okay so more people right so so um I I wish I could ask people why um so I think just like it's a pretty well study and well research problem just a model is more likely to hallucinate when being ask manich questions like the informations that like unlikely or like very rarely appear in the training data and you can Prett see that for example like yes it did um so it as like a one preview and it did hallucinate on V o but like got it correct on the IMO so it's not entirely like all the way true because propriate nature if you can you can actually like Bully it into like saying the wrong things because if you keep asking it's like really really it would change his answer so actually like a fun Benchmark I want you to do is like how bable in AI it like if you can just like Gaslight it how how often it changes answer so so yeah so we do understand something about hallucination and what what I want to say is that it's very important like if we think Hallucination is an issue we need to like look into the failures like different instance of hallucinations and understand For What types of queries the model is more likely to hallucinate on and build our Benchmark evalu evaluation metrics around those and it's going to be like pain and suffering it's not something that you can just like ultimately Outsource to like a tool um so there are many different ways that A system can fail but in general um like that based purely on the respon I'm not talking about latency cost on a security here that's based purely on the responses um I us to categorize the failures into information based failures it's like for example like when it's factually incorrect informations or the information that could have been factually correct at some point in the past but now it's like outdated uh the other kind very common uh kind failure is going as like Behavior based failures it's like one thing is like one example is when the response is facially correct but just not relevant so example of like um like here an example like if you ask AI why reto is important for uh why ret voice is important for AI applications it might answer but like retal is a technique for search and recess so which is correct but it's not relevant because it doesn't answer why so another very common form of like um Behavior based failure is just a based on uh format like if you ask is to like generate a Json five with two key title and body and it's just like Jared like not possible by Json it's just wrong key so it's not it's like it's a failure mode so you can define a lot of different failure modes and it can help you build a matrix around that for example like maybe the last iterations uh it Jed like 80% correctly formatted responses but this iteration is like 92% so once you define own this Sky failure modes it's a lot easier to define the metric that we can use tools to like uh automate and like build dashboard for uh another things like important to understand is like where the system fell so an example that I actually see quite a lot uh so here I hear your resume but you can also like use like in documents as an example so first example like uh if you the user give the system a resume and a questions like where has the candidate worked and the systems output like meta and Google and missing an employee and missing an employer so we need to understand just like where in the system does this fail so you need to look deeper into the system and different components so let's say like this system use like first a resume parsel and then it extract the text and then it us extracted text to generat the response so maybe you want to check first like how accurate the resume parer maybe this the problem is that the resume parsel just like didn't get the correct text or the problem could be with the parser because like the text the extracted text is correct but the answer is incorrect so different kind of um understand of where the the system fails and have you like uh localize the problem and like build like improve the system so I do think that evaluation is um it's just not about like detecting uh failures but also help you like figure out the right solutions for the failures so I do think this like tools can be immensely helpful um however like introducing tools to early before you understand the system can actually distract you from the main problem and sometime I can like introduce more variants because every tun developers has their own own taxonomy and ways of doing things and when you adapt a to you can have to get used to it have to learn it and sometime I can point you in the wrong directions So like um I've seen a lot of evaluation T demo at some point in my life was seeing like two or three every every week and one of them like a lot of them use some variations of like a like LM as a j technique like using AI to like evaluate models and a lot of them have some like similar sting criteria so like one example like uh faithfulness so like if you have the rack system you you retrieve the context and you generate answer and you want to like measure like how faithful the answer is to the context and like and the original query um but like using AI as a judge right you need not just a model the AI model as a judge you also need a prompt so we're trying to look into like how these different tools um evaluate faithful n and just realizes like for every single of those evaluation tools The Prompt is different and even the scoring system is different so like if you just use a tool you kind bought into this whole ways of doing things and it could actually make it harder it can make it harder for you to understand what the core problems are um also um evaluation is to mitigate like Risk and failures but failures might not be swn by evaluation alone so one one example I usually ask people is that like let's say you hire an intern to do some work and that intern fail at their job like who do you blame so one option is just blame the hiring process for not being able to like identify the right intern or you can blame the intern for not trying hard enough or it can blame the manager who just doesn't know like how to assign a task that the intern can do so I do think it's like the first one is like the hiring process is when you like put the responsibilities on evalu ations and if you blame the intern it's like you are blaming the like system developers who build model who people who build the applications or as a manager is when you blame the users for just not knowing how to use the systems right way so evaluation can certainly is one way to have you mitigate failures but you can also do it by other way so you can work with a system developers so make a system more robust your failures or like make it more observable so you can understand and Trace that when something fails or I can work with you users and educate them so that they can know how to use a system better so yeah so evaluation is a system problem and I do think the first thing is to figure out what failures look like and understand what failures uh when the system failure uh happen and you can design the system to mitigate failures and then you design Matrix to get failures that the system design cannot solve and only after that that we adapt the right tools um oh sorry I think I have the obligatory I work at Von and we on um data processing on gpus and we have this really cool Benchmark to show that if you move so large amount of data from CPU like spark to our system is a lot faster and a lot cheaper um and I also have a book coming out soonish on engineering but that is pretty much it uh thank you so much everyone\n"