Making Algorithms Trustworthy with David Spiegelhalter - TWiML Talk #212

The Power of Data Sheets and Statistical Transparency in AI

Taking ideas from adjacent fields like electrical engineering, some folks have referred to data sheets or model cards as different ways of documenting the characteristics or biases of various AI datasets or systems. This concept is part of a growing trend that applies statistical methods associated with clinical trials and medical research to the way we communicate around AI systems and machine learning models.

This approach has been explored in the pharmaceutical industry, where building, evaluating, and implementing prognostic systems have become crucial. Statisticians are known for their meticulous attention to detail, particularly when it comes to probabilities. In medical research, it's essential that these probabilities be meaningful, with a clear understanding of the uncertainty and accuracy associated with them. If a study claims a 70% probability, there must be a tangible basis for this number, rather than simply adding up past occurrences.

The pharmaceutical industry has been obsessed with evaluating and improving the accuracy of their systems, often to the point of being pedantic. This attention to detail is crucial in medical research, where the stakes are high. A single misstep can have devastating consequences, making it essential that researchers be transparent about their methods and findings.

In contrast, the way AI systems are communicated to the public is often lacking in transparency. Pharmaceutical companies often use complex language and lengthy terms of service agreements to convey information about their products, leaving users confused and uncertain. This approach is antithetical to good communication, as it prioritizes profit over clarity.

Proprietary systems used in courts to determine recidivism risk or bail are particularly egregious examples of this lack of transparency. These systems are often opaque, with no clear explanation for the data being used or the methods employed to arrive at conclusions. This kind of secrecy undermines trust and perpetuates a system that can be biased and unfair.

Key Takeaways from David's Talk

David's talk highlighted several key points that underscore the importance of statistical transparency in AI systems. One major takeaway is the need for interdisciplinary approaches, combining insights from statistics, philosophy, psychology, and other fields to develop more comprehensive understanding of machine learning models. By leveraging these diverse perspectives, researchers can create more robust and trustworthy AI systems.

Another crucial theme is the emphasis on fairness and transparency. As David noted, simply being transparent is not enough; it must be done for its own sake, rather than as a means to an end. This approach recognizes that transparency has inherent value, rather than viewing it solely as a way to achieve some other goal.

The work of experts like Norah Neal and others highlights the importance of objective definitions of transparency and how it can be applied in various contexts. By learning from these insights, researchers can develop more effective strategies for communicating AI systems in a clear and trustworthy manner.

Conclusion

David's talk emphasized the critical need for statistical transparency in AI systems, highlighting the importance of interdisciplinary approaches and fairness in our understanding of machine learning models. By prioritizing transparency and taking a nuanced approach to probability calculations, we can create more robust and trustworthy AI systems that serve the public interest.

"WEBVTTKind: captionsLanguage: enhello and welcome to another episode of twimble talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Charrington in this the second episode of our new rip series were joined by David Spiegel halter chair of the Winton Center for a risk and evidence communication at Cambridge University and president of the royal statistical society David who is an invited speaker at nerves presented on making algorithms trustworthy what can statistical science contribute to transparency explanation and validation in our conversation David and I explore the nuance difference between being trusted and being trustworthy and its implications for those building AI systems we also dig into how we can evaluate trustworthiness which David breaks into four phases the inspiration for which he drew from British philosopher o'nora O'Neill's ideas around intelligent transparency enjoy all right everyone I am here in Montreal for the nerves conference and I've got the pleasure of being seated with David Spiegel halter David is chair of the Winton Center for risk and evidence communication at Cambridge as well as president of the Royal statistical Society and he was one of the invited speakers here at nerves talking on making algorithms trustworthy David welcome to this week in machine learning in a I know thank you for having me it's our pleasure before we jump into the topic of your talk please share a little bit of your background and how you got involved in statistics machine learning and kind of the confluence of the two exactly well I I'm a statistician as you can tell and I was around at one of the last summers of AI in the 1980s and I was very interested in computer aided diagnosis such as it was then and interested in statistical approaches to that using simple Bayesian methods or logistic regressions the standard stuff and then and that was an exciting time and I got very interested in this new idea of Bayesian networks and and graphical models and so in the 1980s I really worked and developed this thing called the louse and speaker halter algorithm that was for exact propagation in Bayesian networks we did a lot of work in there and then I went into Bayesian graphical modeling developing the bug software for Bayesian Monteleone Markov chain Monte Carlo analysis and and so on and you know worked all the time in this sort of intersection of Michael machine learning in AI and statistics for the last ten years I've been much more to do with communication and I've got a job that involves communicating statistics and a risk and evidence and now we've got a center this strange Center in the math department at Cambridge with a great gang of psychologists and communication specialists X BBC people web designers I'm very interested in producing trustworthy material that communicates numbers and statistics and risks and predictions and so on okay oh that's really interesting I was wondering what the meaning of risk in evidence communication was almost anything to do with numbers it's better than public communication of statistics right right right okay fantastic and so you're here at nurbs talking about making algorithms trustworthy what does that mean yeah the issue of Trust is very important I've been very influenced by this wonderful philosopher in the UK Nora Anil who studied count and has come up with this very important idea which is have been very influential that organizations and developers a system shouldn't be trying to be trusted no there's the wrong objective to try to be trusted what they should be doing what we all should be doing is trying to be trustworthy in other words to earn that trust because that is within our control to be trustworthy and this idea of being trustworthy has a big impact in the UK the National Statistics code now puts trustworthiness as its number one objective why is that nuanced important between trust being trusted and being trustworthy because big Trust is something you want but other people can only offer it up to you being trustworthy is something within your control okay then that means really analyzing what it means to be trustworthy okay and so what does that mean from a statistical perspective or how can statistics inform trustworthiness well I think the in the talk I break trustworthiness he of an algorithm or any sort of system into two components that the system itself should be trustworthy that claims it makes should be trustworthy you should be able to rely on them or if you can't rely them it can tell you how confident it is the other thing is that what is very important is that the claims made about the system are trustworthy by the developers by the commercial entity or whatever so you could not only believe the system but you've got to believe what said about the system now what that leads you into very quickly is the importance of evaluation and in my talk I draw an analogy with the highly developed evaluation phases that a usain drug development in pharmaceuticals which statisticians I've worked in that area for decades and they're just very briefly four phases phase one is safely on a few party people phase two is proof of concepts done on some selected people to try to optimize the dosage Phase three other big controlled trials in which you actually compare it with a comparator and that allows you to market the drug and phase four is post marketing surveillance and what I did was draw an analogy with developing algorithms that are going to go into practice that phase one is just the digital testing that people do you need on a set of test cases phase two is laboratory tests where you actually compare it's a word doctors if you've got a medical system and do the user centered design for the interface and phase three is with our field tests where it actually goes out there and you we actually evaluate what is impact is which might be beneficial but it could be harmful you never know what side effects you might have and phase four then is the post once the thing is out there monitoring to make sure it's not degrading and that it's not making mistakes and so I suppose what I'm saying is that on the whole when I read about evaluations they rarely get past phase one I just sort of accuracy on test cases some of them moving into Phase two comparison with you know diagnostic systems with medical and where they experts and things like that almost nothing about Phase three what actually is the benefit impact when these things are put into practice in society and properly evaluated and I think that the you know in order for claims about a system to be trustworthy then you need a much more rigorous evaluation in all of the claims about a system a trustworthy you need to have a much more rigorous evaluation my sense is that we're very far from that today exactly that's what I saying this field is developed so wonderfully so the stuff at the conference is so amazing but it's still for all that fantastic technical capacity at a very early stage because when these things start moving into society you find your people saying hey come on you know I wanted to mind it's not you know it's not immediately obvious that this is gonna be a good thing in all areas and so one needs to I think you know this area how to do to mature into something which does rigorous evaluations it's interesting so one of the controversies at last year's Europe's then nips was kind of a call for increased theoretical rigor around deep learning in particular but you know our current approaches to AI in general this is a call for rigor also but a very different one one from Marvis and statistical perspective and it's about it's a very rigorous test of what does it mean to actually implement this really both because you need the rigorous sort of internal analysis in order to demonstrate that what it says is trustworthy mm-hm so because part of the trustworthiness of course this is where we get to explanation is to be able to say why it's come up with its conclusion to be able to justify that conclusion and promote the other statistical perspective I take very strongly because statisticians are obsessed with uncertainty getting the error bars right you know where's much concerned with the uncertainties we are about the point estimate and so that's what we bring and I think again if first claim is going to be made and especially when it's made with some uncertainty or lack of confidence you've got to understand what that means you could go to rely not on the on the and claimed confidence of that of what is a lot to say what an algorithm comes up with and you talk to you provide examples of this the kinds of claims that you envision this kind of model being applied to and you know what you'd expect to see or what you've seen in kind of passing a claim through these filters well in the talk I just give various examples at different phases of how some statistical ideas can come in just at the early phase when you're comparing algorithms on your database to see decide which is the best one now I talk about ranking algorithms and how using some bootstrap methods on the test set you can get a probability that any algorithm is actually the best rather than just producing a simple league table again there's been a lot of statistical work on league tables and essentially taking them apart because just because something happens to rank the best on one particular set of data does not mean it's the best algorithm unison even for the full team just because a football team is top of the league doesn't mean it's the best team because there's always luck involved and we're rather good at trying to put numbers on luck so there's that aspect the phase two and again the you know recent critique of systems that have made comparisons with doctors saying diagnostic systems which are at C being it's like you know pulled apart because of their lack of statistical rigor you know it's very good they got to that stage but actually they're not doing them very well and I'll drink to the under standard of rigor that one would expect and for Phase three I talked about in old trial I was involved in the diagnostic system is a terrible system but it actually helped when it was put into practice and it's because it wasn't because what the system that computed it was saying is because it just changed the culture of data collection and encouraging people to make early diagnosis and be more confident about their work so there's all sorts of unintended ways that systems might benefit but also unintended ways in which they might harm mm-hmm and so you know I was I went through those so those those applications but then I went on to this idea of transparency which is that which I know is an element of trust trustworthiness and this philosopher Anora Neil has got some great things to say about transparency she thinks transparency is could be can be really dangerous it's not an end in itself especially in the sense of disclosure in that you know you can be very transparent and yet nobody can understand what's going on if you do that you could release the code or something of that where I transparent but I think people is hopeless right in hopeless so and she's really pulled apart transparency and so this was she's making this appeal for in telogen openness which means that any information you give and this is a really good checklist and the information you give should be accessible people have got to be able to guess it it's got to be intelligible they've got to understand it it's got to be usable it's got to meet their needs and it's got to be accessible we mean somebody needs to be able to check the working not everybody but somebody out there you know needs to be able to check the working if necessary know when you're using deep learning methods yeah that's really quite a hard challenge to counter I did mention what I thought some very nice work being done by google deepmind with in London with more fields Hospital on analyzing eye scans in which they've you know you've deliberately trained this network so to provide intermediate steps my segmentation map so that it doesn't just go straight to a diagnosis it's got a probabilistic diagnosis it's actually putting in the intermediate steps which seems a really cool thing obviously and that's because that project seems to be very strongly influenced by the clinicians themselves who want that that's the the way which they're used to thinking about it and they wanted to map their way of thinking and it cuz many people were sort of claiming know that and you don't necessarily have to make a trade-off in performance in order to get and a much more interpretive board model that actually you know there's vast numbers of models and vast numbers of options it gives very similar performance especially on that especially as actually a lot of the differences and performance a largely illusory that was what I talked about earlier okay and so we running you know actually you know that the the struggle to fit among the great space of models you can use to choose one that actually enables a much more transparent much more in a better explanation makes it more trustworthy because people can see the reason you ran off several of the qualities that Nora Neel outlined but they're all very subjective and and seem to be in some ways that ours with this statistical rigor that well know yeah yeah but no no exactly but that's I think I'm I spend all my time working the psychologist now yes I've been very influenced by this and I'm not even going to try to define exactly what explainable or interpretive bullet right means and be but you can be quite rigorous about your evaluation of some of these aspects for example in the interfaces for the systems we build we evaluate three things in which they have an impact on people their cognitive do they understand that the behavioral what does it do to their their behavior and their intentions and the affective how does it affect their emotions and we want to measure all of those and they could be completely different so it's very important to get that feeling of what you know what do people get from it and in each of them for example through surveys yeah yes and the psychologists would do you know actual direct face-to-face interviews on people this is the phase two evaluations within a laboratory we get patients in getting to talk through a system actually do some eye tracking as well see how they're using it using it and then RIT and and evaluating on or on these things the metrics are quite difficult to do you know the satisfaction with which a decision has been made is quite quite a tricky thing to evaluate but you need to try to be able to do it so you know these rather Neuse things it is worth the effort of trying to measure them as accurately as possible I used as an applicator an idea system we've we put a front end on can predict which is for women newly diagnosed with breast cancer who can decide what other therapies to have apart from surgery and which is based on the fairly stats in a basic statistical analysis of the database of four thousand cases and it produces survival curves up to 15 years as for women and then looks at what would be the effect these are personalized to various attributes of the tumor and the woman and then you know how that survival would change if you take particular therapies and those the effect of the therapies all that data is based on clinical trial causal data from randomized clinical trials and that's fine our idea is that the system which is currently used by doctors and will be used by doctors talking to patients it already is and even by patients themselves and support groups but we using exactly the same system for all these different groups and that means very very good explanation facilities both for the terms but also ways of portraying the risks to patients and this is serious stuff this is the chart people are gonna be alive in 10 years time but with very careful use of wording and imagery and even the color and exactly we can do it and and the point about this is that for explanation like that is that two things one size does not fit all different people have got different needs they've got different levels of understanding about numbers and graphics and so what you need is both multi-layered explanation that in a very simple level at the top through to much deeper level where we put the maths in if you want to you can see a PDF with all the maths in you can get the code if you really want it so you've got all those layers of explanation vertically but also horizontally so when we're explaining 15 year survival we can provide bar charts and survival curves and icon arrays and tables and text etc all of those are options depending on what people prefer to see so you've got both vertical and horizontal explanation choices there's no correct way to do it but you can try to evaluate all of it and they're only some people want to see the stuff at the bottom but they should be able to see it because that's part of the assessable openness that example is a compelling one I find that you know often time we're dealing with with physicians there's this there's a presumption of trust or trustworthiness that you know may work for a lot of people but sometimes you want a little bit more data and they're not always prepared to change in people of making much not everybody you know I don't know I just completely I've come off top of my head i'd-- so your heart about half people who are quite prepared still go along with a very paternalistic you you know thank you very much tell me what to do right just tell me what to do I don't want to know anything else but an increasing proportion you know are asking questions and actually wanting to exercise some of their own rights I've known I've got friends who have used the system that we've been working on in order to challenge their doctors and saying okay I'm not it's a tiny benefit I know I'm going to get terrible side effects I'm not going to have it yet and they've using that to challenge its empowered them but empowered them I think this is very valuable not only that but in the UK now there's there's almost much stronger legal stretcher on and what must be explained to people in order to get informed consent for treatment and I know in other words all lists you know should be explained to people and so what we're providing is actually some of the tools to allow doctors to to carry out their work better okay there's a thread in the AI community around taking ideas from adjacent fields like you know electrical engineering the idea of data sheets or model cards some folks have called them and basically different ways of documenting the characteristics or biases of different AI datasets or systems and it sounds like a part of what you're doing is a similar idea but applying ideas from clinical trials and the statistical methods associated with clinical trials and medical and pharmaceutical field to the way we talk about and communicate around AI systems and machine learning models yes and not just this pharmaceutical area but there's been a shoe have been involved for years in building prognostic systems for people and then both evaluating them and putting them into practice and one of the crucial things about the evaluation the people would get really obsessed about you know the sort of pedantic way that statisticians intend to operate in is that the probabilities must be meaningful if you say 70 percent probability for something or 70 out of a hundred chance there's got to mean that meaning that suddenly you know who added the number of all the times you say that it should happen in seventy percent of the time the under calibrated probabilities in other words that the uncertainty the accuracy of the uncertainty right it's as important as the accuracy of the other main number right now is a very static alaih dia and yet it's very I think it's very important because always you get these grossly overconfident things oh I'm 99% sure that this is the diagnosis now that it's grossly misleading really is terrible so and that's I think a very another very important white thing that can be brought from statistics we just worked a lot on you know how to evaluate the calibration of probabilities of sort test statistics to use and so on in order to check that element of trustworthiness of the claim right right this are calls to mind at least in the US I don't know if it's somewhere in the when you're a versa advertising pharmaceuticals there's like you have your 30 minute ad and then your long have to do I mean that's just a regulation what you what should we know but at the same time you know that's kind of a summary of a datasheet yeah that's not that's not trustworthy communication that's like that's like having to sign you know we're getting some software in those terms and conditions' 16 pages of terms and conditions yes that is not intelligent openness in no way right is that accessible usable you know comprehensible accessible is bright breaks every rule and it's terrible that sort of communication well it's just a a law but it is a complete sham in terms of good communication I agree I agree at the same time it is one step better than you know yes yeah yeah the computer says yes exactly and of course the worst systems are all our proprietary systems that are used in courts to decide about recidivism risk or bail right right it's not that as shocking because they're proprietary that totally done transparent you've got no idea what information is being used in them I mean that that's absolutely shocking again they break every rule and you know everything I'm trying to talk I'm talking about is broken by that kind of system so key takeaways from your talk oh yeah well I suppose basic statistical ideas experienced in other words I've got a lot to offer a lot to offer but also I'm not just taking ideas from you know from statistics I'm taking ideas from philosophy and psychology and pericle testing and things that that really in this maturing disciplines unbelievably important discipline I think could take a lot more account off great great that's very much in line with some of the key themes that I'm hearing it this year's nerve so you know it's in fact two of them one is the importance of you know fairness transparency etc and the other is kind of the importance of interdisciplinary approaches you're kind of bringing variety because and there's some wonderful work going on you know this morning I really featured the fatter now you know Social Impact Statement II know this that they've got partly because they do not identify transparency and there's an objective they've they've learned themselves that transparency just a means to an end you know there's no good just being transparent unless you obey and nor are O'Neil's ideas of what transparency means mmm-hmm well we'll definitely provide a pointer to Norah Neal and your work as well of course the talks up on Facebook as well just stick Oh fantastic fantastic well David thank you so much for taking the time no slide show a real pleasure thank you very much for asking me alright everyone that's our show for today for more information on david or any of the topics covered in this episode visit to MLA Icom slash talks last to 12 you can also follow along with our nerve series at twill a Icom slash Europe's 2018 as always thanks so much for listening and catch you next timehello and welcome to another episode of twimble talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Charrington in this the second episode of our new rip series were joined by David Spiegel halter chair of the Winton Center for a risk and evidence communication at Cambridge University and president of the royal statistical society David who is an invited speaker at nerves presented on making algorithms trustworthy what can statistical science contribute to transparency explanation and validation in our conversation David and I explore the nuance difference between being trusted and being trustworthy and its implications for those building AI systems we also dig into how we can evaluate trustworthiness which David breaks into four phases the inspiration for which he drew from British philosopher o'nora O'Neill's ideas around intelligent transparency enjoy all right everyone I am here in Montreal for the nerves conference and I've got the pleasure of being seated with David Spiegel halter David is chair of the Winton Center for risk and evidence communication at Cambridge as well as president of the Royal statistical Society and he was one of the invited speakers here at nerves talking on making algorithms trustworthy David welcome to this week in machine learning in a I know thank you for having me it's our pleasure before we jump into the topic of your talk please share a little bit of your background and how you got involved in statistics machine learning and kind of the confluence of the two exactly well I I'm a statistician as you can tell and I was around at one of the last summers of AI in the 1980s and I was very interested in computer aided diagnosis such as it was then and interested in statistical approaches to that using simple Bayesian methods or logistic regressions the standard stuff and then and that was an exciting time and I got very interested in this new idea of Bayesian networks and and graphical models and so in the 1980s I really worked and developed this thing called the louse and speaker halter algorithm that was for exact propagation in Bayesian networks we did a lot of work in there and then I went into Bayesian graphical modeling developing the bug software for Bayesian Monteleone Markov chain Monte Carlo analysis and and so on and you know worked all the time in this sort of intersection of Michael machine learning in AI and statistics for the last ten years I've been much more to do with communication and I've got a job that involves communicating statistics and a risk and evidence and now we've got a center this strange Center in the math department at Cambridge with a great gang of psychologists and communication specialists X BBC people web designers I'm very interested in producing trustworthy material that communicates numbers and statistics and risks and predictions and so on okay oh that's really interesting I was wondering what the meaning of risk in evidence communication was almost anything to do with numbers it's better than public communication of statistics right right right okay fantastic and so you're here at nurbs talking about making algorithms trustworthy what does that mean yeah the issue of Trust is very important I've been very influenced by this wonderful philosopher in the UK Nora Anil who studied count and has come up with this very important idea which is have been very influential that organizations and developers a system shouldn't be trying to be trusted no there's the wrong objective to try to be trusted what they should be doing what we all should be doing is trying to be trustworthy in other words to earn that trust because that is within our control to be trustworthy and this idea of being trustworthy has a big impact in the UK the National Statistics code now puts trustworthiness as its number one objective why is that nuanced important between trust being trusted and being trustworthy because big Trust is something you want but other people can only offer it up to you being trustworthy is something within your control okay then that means really analyzing what it means to be trustworthy okay and so what does that mean from a statistical perspective or how can statistics inform trustworthiness well I think the in the talk I break trustworthiness he of an algorithm or any sort of system into two components that the system itself should be trustworthy that claims it makes should be trustworthy you should be able to rely on them or if you can't rely them it can tell you how confident it is the other thing is that what is very important is that the claims made about the system are trustworthy by the developers by the commercial entity or whatever so you could not only believe the system but you've got to believe what said about the system now what that leads you into very quickly is the importance of evaluation and in my talk I draw an analogy with the highly developed evaluation phases that a usain drug development in pharmaceuticals which statisticians I've worked in that area for decades and they're just very briefly four phases phase one is safely on a few party people phase two is proof of concepts done on some selected people to try to optimize the dosage Phase three other big controlled trials in which you actually compare it with a comparator and that allows you to market the drug and phase four is post marketing surveillance and what I did was draw an analogy with developing algorithms that are going to go into practice that phase one is just the digital testing that people do you need on a set of test cases phase two is laboratory tests where you actually compare it's a word doctors if you've got a medical system and do the user centered design for the interface and phase three is with our field tests where it actually goes out there and you we actually evaluate what is impact is which might be beneficial but it could be harmful you never know what side effects you might have and phase four then is the post once the thing is out there monitoring to make sure it's not degrading and that it's not making mistakes and so I suppose what I'm saying is that on the whole when I read about evaluations they rarely get past phase one I just sort of accuracy on test cases some of them moving into Phase two comparison with you know diagnostic systems with medical and where they experts and things like that almost nothing about Phase three what actually is the benefit impact when these things are put into practice in society and properly evaluated and I think that the you know in order for claims about a system to be trustworthy then you need a much more rigorous evaluation in all of the claims about a system a trustworthy you need to have a much more rigorous evaluation my sense is that we're very far from that today exactly that's what I saying this field is developed so wonderfully so the stuff at the conference is so amazing but it's still for all that fantastic technical capacity at a very early stage because when these things start moving into society you find your people saying hey come on you know I wanted to mind it's not you know it's not immediately obvious that this is gonna be a good thing in all areas and so one needs to I think you know this area how to do to mature into something which does rigorous evaluations it's interesting so one of the controversies at last year's Europe's then nips was kind of a call for increased theoretical rigor around deep learning in particular but you know our current approaches to AI in general this is a call for rigor also but a very different one one from Marvis and statistical perspective and it's about it's a very rigorous test of what does it mean to actually implement this really both because you need the rigorous sort of internal analysis in order to demonstrate that what it says is trustworthy mm-hm so because part of the trustworthiness of course this is where we get to explanation is to be able to say why it's come up with its conclusion to be able to justify that conclusion and promote the other statistical perspective I take very strongly because statisticians are obsessed with uncertainty getting the error bars right you know where's much concerned with the uncertainties we are about the point estimate and so that's what we bring and I think again if first claim is going to be made and especially when it's made with some uncertainty or lack of confidence you've got to understand what that means you could go to rely not on the on the and claimed confidence of that of what is a lot to say what an algorithm comes up with and you talk to you provide examples of this the kinds of claims that you envision this kind of model being applied to and you know what you'd expect to see or what you've seen in kind of passing a claim through these filters well in the talk I just give various examples at different phases of how some statistical ideas can come in just at the early phase when you're comparing algorithms on your database to see decide which is the best one now I talk about ranking algorithms and how using some bootstrap methods on the test set you can get a probability that any algorithm is actually the best rather than just producing a simple league table again there's been a lot of statistical work on league tables and essentially taking them apart because just because something happens to rank the best on one particular set of data does not mean it's the best algorithm unison even for the full team just because a football team is top of the league doesn't mean it's the best team because there's always luck involved and we're rather good at trying to put numbers on luck so there's that aspect the phase two and again the you know recent critique of systems that have made comparisons with doctors saying diagnostic systems which are at C being it's like you know pulled apart because of their lack of statistical rigor you know it's very good they got to that stage but actually they're not doing them very well and I'll drink to the under standard of rigor that one would expect and for Phase three I talked about in old trial I was involved in the diagnostic system is a terrible system but it actually helped when it was put into practice and it's because it wasn't because what the system that computed it was saying is because it just changed the culture of data collection and encouraging people to make early diagnosis and be more confident about their work so there's all sorts of unintended ways that systems might benefit but also unintended ways in which they might harm mm-hmm and so you know I was I went through those so those those applications but then I went on to this idea of transparency which is that which I know is an element of trust trustworthiness and this philosopher Anora Neil has got some great things to say about transparency she thinks transparency is could be can be really dangerous it's not an end in itself especially in the sense of disclosure in that you know you can be very transparent and yet nobody can understand what's going on if you do that you could release the code or something of that where I transparent but I think people is hopeless right in hopeless so and she's really pulled apart transparency and so this was she's making this appeal for in telogen openness which means that any information you give and this is a really good checklist and the information you give should be accessible people have got to be able to guess it it's got to be intelligible they've got to understand it it's got to be usable it's got to meet their needs and it's got to be accessible we mean somebody needs to be able to check the working not everybody but somebody out there you know needs to be able to check the working if necessary know when you're using deep learning methods yeah that's really quite a hard challenge to counter I did mention what I thought some very nice work being done by google deepmind with in London with more fields Hospital on analyzing eye scans in which they've you know you've deliberately trained this network so to provide intermediate steps my segmentation map so that it doesn't just go straight to a diagnosis it's got a probabilistic diagnosis it's actually putting in the intermediate steps which seems a really cool thing obviously and that's because that project seems to be very strongly influenced by the clinicians themselves who want that that's the the way which they're used to thinking about it and they wanted to map their way of thinking and it cuz many people were sort of claiming know that and you don't necessarily have to make a trade-off in performance in order to get and a much more interpretive board model that actually you know there's vast numbers of models and vast numbers of options it gives very similar performance especially on that especially as actually a lot of the differences and performance a largely illusory that was what I talked about earlier okay and so we running you know actually you know that the the struggle to fit among the great space of models you can use to choose one that actually enables a much more transparent much more in a better explanation makes it more trustworthy because people can see the reason you ran off several of the qualities that Nora Neel outlined but they're all very subjective and and seem to be in some ways that ours with this statistical rigor that well know yeah yeah but no no exactly but that's I think I'm I spend all my time working the psychologist now yes I've been very influenced by this and I'm not even going to try to define exactly what explainable or interpretive bullet right means and be but you can be quite rigorous about your evaluation of some of these aspects for example in the interfaces for the systems we build we evaluate three things in which they have an impact on people their cognitive do they understand that the behavioral what does it do to their their behavior and their intentions and the affective how does it affect their emotions and we want to measure all of those and they could be completely different so it's very important to get that feeling of what you know what do people get from it and in each of them for example through surveys yeah yes and the psychologists would do you know actual direct face-to-face interviews on people this is the phase two evaluations within a laboratory we get patients in getting to talk through a system actually do some eye tracking as well see how they're using it using it and then RIT and and evaluating on or on these things the metrics are quite difficult to do you know the satisfaction with which a decision has been made is quite quite a tricky thing to evaluate but you need to try to be able to do it so you know these rather Neuse things it is worth the effort of trying to measure them as accurately as possible I used as an applicator an idea system we've we put a front end on can predict which is for women newly diagnosed with breast cancer who can decide what other therapies to have apart from surgery and which is based on the fairly stats in a basic statistical analysis of the database of four thousand cases and it produces survival curves up to 15 years as for women and then looks at what would be the effect these are personalized to various attributes of the tumor and the woman and then you know how that survival would change if you take particular therapies and those the effect of the therapies all that data is based on clinical trial causal data from randomized clinical trials and that's fine our idea is that the system which is currently used by doctors and will be used by doctors talking to patients it already is and even by patients themselves and support groups but we using exactly the same system for all these different groups and that means very very good explanation facilities both for the terms but also ways of portraying the risks to patients and this is serious stuff this is the chart people are gonna be alive in 10 years time but with very careful use of wording and imagery and even the color and exactly we can do it and and the point about this is that for explanation like that is that two things one size does not fit all different people have got different needs they've got different levels of understanding about numbers and graphics and so what you need is both multi-layered explanation that in a very simple level at the top through to much deeper level where we put the maths in if you want to you can see a PDF with all the maths in you can get the code if you really want it so you've got all those layers of explanation vertically but also horizontally so when we're explaining 15 year survival we can provide bar charts and survival curves and icon arrays and tables and text etc all of those are options depending on what people prefer to see so you've got both vertical and horizontal explanation choices there's no correct way to do it but you can try to evaluate all of it and they're only some people want to see the stuff at the bottom but they should be able to see it because that's part of the assessable openness that example is a compelling one I find that you know often time we're dealing with with physicians there's this there's a presumption of trust or trustworthiness that you know may work for a lot of people but sometimes you want a little bit more data and they're not always prepared to change in people of making much not everybody you know I don't know I just completely I've come off top of my head i'd-- so your heart about half people who are quite prepared still go along with a very paternalistic you you know thank you very much tell me what to do right just tell me what to do I don't want to know anything else but an increasing proportion you know are asking questions and actually wanting to exercise some of their own rights I've known I've got friends who have used the system that we've been working on in order to challenge their doctors and saying okay I'm not it's a tiny benefit I know I'm going to get terrible side effects I'm not going to have it yet and they've using that to challenge its empowered them but empowered them I think this is very valuable not only that but in the UK now there's there's almost much stronger legal stretcher on and what must be explained to people in order to get informed consent for treatment and I know in other words all lists you know should be explained to people and so what we're providing is actually some of the tools to allow doctors to to carry out their work better okay there's a thread in the AI community around taking ideas from adjacent fields like you know electrical engineering the idea of data sheets or model cards some folks have called them and basically different ways of documenting the characteristics or biases of different AI datasets or systems and it sounds like a part of what you're doing is a similar idea but applying ideas from clinical trials and the statistical methods associated with clinical trials and medical and pharmaceutical field to the way we talk about and communicate around AI systems and machine learning models yes and not just this pharmaceutical area but there's been a shoe have been involved for years in building prognostic systems for people and then both evaluating them and putting them into practice and one of the crucial things about the evaluation the people would get really obsessed about you know the sort of pedantic way that statisticians intend to operate in is that the probabilities must be meaningful if you say 70 percent probability for something or 70 out of a hundred chance there's got to mean that meaning that suddenly you know who added the number of all the times you say that it should happen in seventy percent of the time the under calibrated probabilities in other words that the uncertainty the accuracy of the uncertainty right it's as important as the accuracy of the other main number right now is a very static alaih dia and yet it's very I think it's very important because always you get these grossly overconfident things oh I'm 99% sure that this is the diagnosis now that it's grossly misleading really is terrible so and that's I think a very another very important white thing that can be brought from statistics we just worked a lot on you know how to evaluate the calibration of probabilities of sort test statistics to use and so on in order to check that element of trustworthiness of the claim right right this are calls to mind at least in the US I don't know if it's somewhere in the when you're a versa advertising pharmaceuticals there's like you have your 30 minute ad and then your long have to do I mean that's just a regulation what you what should we know but at the same time you know that's kind of a summary of a datasheet yeah that's not that's not trustworthy communication that's like that's like having to sign you know we're getting some software in those terms and conditions' 16 pages of terms and conditions yes that is not intelligent openness in no way right is that accessible usable you know comprehensible accessible is bright breaks every rule and it's terrible that sort of communication well it's just a a law but it is a complete sham in terms of good communication I agree I agree at the same time it is one step better than you know yes yeah yeah the computer says yes exactly and of course the worst systems are all our proprietary systems that are used in courts to decide about recidivism risk or bail right right it's not that as shocking because they're proprietary that totally done transparent you've got no idea what information is being used in them I mean that that's absolutely shocking again they break every rule and you know everything I'm trying to talk I'm talking about is broken by that kind of system so key takeaways from your talk oh yeah well I suppose basic statistical ideas experienced in other words I've got a lot to offer a lot to offer but also I'm not just taking ideas from you know from statistics I'm taking ideas from philosophy and psychology and pericle testing and things that that really in this maturing disciplines unbelievably important discipline I think could take a lot more account off great great that's very much in line with some of the key themes that I'm hearing it this year's nerve so you know it's in fact two of them one is the importance of you know fairness transparency etc and the other is kind of the importance of interdisciplinary approaches you're kind of bringing variety because and there's some wonderful work going on you know this morning I really featured the fatter now you know Social Impact Statement II know this that they've got partly because they do not identify transparency and there's an objective they've they've learned themselves that transparency just a means to an end you know there's no good just being transparent unless you obey and nor are O'Neil's ideas of what transparency means mmm-hmm well we'll definitely provide a pointer to Norah Neal and your work as well of course the talks up on Facebook as well just stick Oh fantastic fantastic well David thank you so much for taking the time no slide show a real pleasure thank you very much for asking me alright everyone that's our show for today for more information on david or any of the topics covered in this episode visit to MLA Icom slash talks last to 12 you can also follow along with our nerve series at twill a Icom slash Europe's 2018 as always thanks so much for listening and catch you next time\n"

Making Algorithms Trustworthy with David Spiegelhalter - TWiML Talk #212

Random Videos