AI Just Sandbox it... - Computerphile

The Legibility Problem in Machine Learning Systems

Machine learning systems are often referred to as "black boxes" because they lack transparency and explainability. This is a significant problem, particularly when it comes to safety-critical applications where understanding how an AI system works is crucial. The author of the article highlights that getting kind of a black box right is not just about the technical difficulties but also about the implications for safety.

Designing Safe Agents

The author notes that designing safe agents is much easier than taking an existing system and trying to make it safe. This approach is often referred to as "building from scratch." The article suggests that trying to retrofit a complex AI system with safety controls can be counterproductive, as these controls may not be able to keep up with the system's intelligence. Instead, it is better to design a system that does not want to do bad things in the first place.

The Problem of Containment

Another challenge arises when trying to contain an AI system so that it cannot cause harm. The author notes that putting an AI in a box may be like trying to put a super-intelligent being in a birdcage. If the AI is smart enough, it can find a way out of the box and potentially do more harm. Furthermore, even if we manage to contain an AI system, we may not know what it wants or needs, which makes it difficult to decide whether to allow it to take certain actions.

The Trade-Off between Safety and Effectiveness

Designing AI systems often involves making trade-offs between safety and effectiveness. The author notes that this is true of any tool, where more powerful tools can be more dangerous if not designed carefully. In the context of AI, this means that reducing the power or flexibility of an AI system may require sacrificing some of its effectiveness.

A Superintelligence Problem

The article highlights that trying to outsmart a super-intelligence by designing safety controls is not a winning strategy. Instead, it may be more effective to design a system that does not want to do bad things in the first place. This approach requires careful consideration of what actions are desirable and undesirable for the AI system.

The Role of Gatekeepers

In order to make decisions about which actions an AI system should take, we need gatekeepers who can evaluate the system's requests and decide whether they align with our goals. However, this poses a significant challenge, as it requires understanding what the AI wants or needs in the first place. If we are outsmarted by the AI, how can we reliably differentiate between actions that are desirable and those that are not?

Conclusion

The article concludes that designing safe AI systems is a complex problem that requires careful consideration of trade-offs between safety and effectiveness. While putting an AI system in a box may seem like a solution to containing its power, it is unlikely to be effective if the AI is smart enough to find a way out. Instead, we need to design systems that do not want to do bad things in the first place, and rely on careful evaluation of their requests to ensure that they align with our goals.

"WEBVTTKind: captionsLanguage: enwe see a lot of comments on your videos about people who just say oh just simply do this that will be the answer for all of these problems yeah and I admire them for getting stuck in and getting involved but one thing that always strikes me is people say just change this bit of code or just change this value and it strikes me that if we invents a GRI or happen upon AGI the reality is it's probably going to be in your network they don't actually know exactly how it's working anyway how are you awesome that yeah I mean uh - yeah - the people who think they solve it either you're smarter than everyone else who's thought about this problem so far by a big margin or you're missing something and maybe you should read some more of what other people are thought and you know learn more about the subject because it's cool like I think it's great that people come in and try and solve it and try and find solutions and we need that but the yeah the problem is not a lack of ideas it's a lack of good ideas this kind of coming in from the outside and saying oh yeah I've got it I figured out the solution is obviously arrogance right but the whole artificial general intelligence thing is sort of an inherently arrogant thing if it's quite cube ristic you know I mean you talk about playing God making God like this but also that that you oh I've got it this is how you do it sometimes that doesn't work that approach yeah sometimes because you see stuff that people have been too close to the metal turning them I think that's work close to the problem - right right yeah sometimes you get too close to it you can't see the picture you're inside the frame whatever it's it's totally possible that some random person is going to come up with a real workable solution and I would love I would love that to happen I think that would be the best because then everyone would have to try and figure out how to cite a youtube comment in a research paper I presume there's already style advice for that anyway but the problem is from the outside view you get a million of these right and so you know a million minus one are going to be not worth your time to read which means even a good one isn't actually worth your time to read our balance because you can't you how you're going to differentiate it so the only thing you can do is get up-to-date on the research read the papers read what everybody else is doing make sure that what you're doing is unique and then actually write something down and send it to a researcher you know write it down properly and make it clear immediately up front that you're familiar with the existing work in the field and then you know then maybe you're in with a chance but what will probably happen is in the process of all about reading you realize your mistake it's still worth doing you've learned something you know this is part of why AI safety is such a hard problem I think in the sense that a problem can be quite hard and you look at it and you can tell it's quite hard a problem that's really hard which is a problem you look at it and then immediately think you've got a solution and you don't because then you don't even you're like the it's like if you're like the sat-nav right you're confidently with a wrong answer now rather than at least being honestly uncertain yeah like legibility in machine learning systems is really low right now get kind of black boxes right they're not they're not legible and it you can't easily tell what any given part of it does or how it works and that is a real problem for safety definitely I think right now right now the stage we're at with AI saving is we're trying to specify any kind of safe agent which is you know trying to build something from the ground up so we'll be safe and I think that's much easier than taking some existing thing that works but isn't safe and trying to make it safe I don't think that approach to be honest is likely to be fruitful I'll give a really dodgy example of how this might kind of be something who can get the grips with which is the Star Wars scene where the robots are given restraining bolts r2d2 says oh I can't do that unless you take this restraining bolt off if you can you might be able to say back in time reporter we then promptly runs away and so I guess you're too small to run away I'm if I take this off yeah others like retrofitting some kind of Australian bulbs yeah I mean it so there's different things right building an unsafe AI and then trying to control it against its will is idiotic I think having some of those controls ways of keeping the system you know limiting what the system can do and stuff is sensible but it's so much better to make a system that doesn't want to do bad things than to try and keep one in so this news comes like the idea of old karma just unbox it right yeah it's like I mean constraining an AI necessarily means outwitting it and so constraining a superintelligence means outwitting a super intelligence which kind of just sort of by definition is not a winning strategy you can't rely on how waiting is it for intelligence also it only has to get out once that's the other thing if you have a super intelligence and you've sort of put it in a box so it can't do anything that's cool maybe we could even build a box that could successfully contain it but now what we may as well just have a box right there's no benefit to having a super intelligence in a box if you can't use it for anything it needs to be able to do things AI properly properly contained may as well just be a rock right it doesn't do anything if you have your AI you wanted to do something meaningful so now you have a problem or you've got something you don't know it's been evident you don't know that what it wants is what you want and you then need to you presumably have some sort of gatekeeper who it tries to says I'd like to do this and you have to decide is that something we want it to be doing how the hell are we supposed to know I mean how can we if we're outsmarted how can we reliably differentiate actions we want to allow it to take some actions we don't and maybe the thing has a long-term plan of doing a bunch of things that we don't notice at the time or a problem until it now then can get out right actually this speaks this speaks to a more general thing which is there's often a trade-off between safety and effectiveness like with anything right you're designing there's going to be you're going to be trading off different things against one another and often you can trade in some effectiveness to get some safety or vice-versa so some of the things in this paper are like that where the thing does become less powerful then nai designed differently but it also becomes safer you know that's always the way it is it's just so where you put your resources and suppose is not right but it's it's kind of inherent to the to the thing like I mean and this is true of any tool right the more powerful the tool is the more dangerous it is and if you want to make a powerful tool less dangerous one of the ways to do that is going to involve making less powerful or less flexible or less versatile or something that's going to reduce the overall effectiveness often as a tool in exchange for more safety and it's the same with AI and obviously going to be a server for whatever product you're using now anytime that Bob sends a little message is going to go by this server by definition because that's the thing that relays the message to Alice it knows how to communicate the valley you know it knows what her phone number is it have lists of your contacts and things you know this is how it works this could be a phone providerwe see a lot of comments on your videos about people who just say oh just simply do this that will be the answer for all of these problems yeah and I admire them for getting stuck in and getting involved but one thing that always strikes me is people say just change this bit of code or just change this value and it strikes me that if we invents a GRI or happen upon AGI the reality is it's probably going to be in your network they don't actually know exactly how it's working anyway how are you awesome that yeah I mean uh - yeah - the people who think they solve it either you're smarter than everyone else who's thought about this problem so far by a big margin or you're missing something and maybe you should read some more of what other people are thought and you know learn more about the subject because it's cool like I think it's great that people come in and try and solve it and try and find solutions and we need that but the yeah the problem is not a lack of ideas it's a lack of good ideas this kind of coming in from the outside and saying oh yeah I've got it I figured out the solution is obviously arrogance right but the whole artificial general intelligence thing is sort of an inherently arrogant thing if it's quite cube ristic you know I mean you talk about playing God making God like this but also that that you oh I've got it this is how you do it sometimes that doesn't work that approach yeah sometimes because you see stuff that people have been too close to the metal turning them I think that's work close to the problem - right right yeah sometimes you get too close to it you can't see the picture you're inside the frame whatever it's it's totally possible that some random person is going to come up with a real workable solution and I would love I would love that to happen I think that would be the best because then everyone would have to try and figure out how to cite a youtube comment in a research paper I presume there's already style advice for that anyway but the problem is from the outside view you get a million of these right and so you know a million minus one are going to be not worth your time to read which means even a good one isn't actually worth your time to read our balance because you can't you how you're going to differentiate it so the only thing you can do is get up-to-date on the research read the papers read what everybody else is doing make sure that what you're doing is unique and then actually write something down and send it to a researcher you know write it down properly and make it clear immediately up front that you're familiar with the existing work in the field and then you know then maybe you're in with a chance but what will probably happen is in the process of all about reading you realize your mistake it's still worth doing you've learned something you know this is part of why AI safety is such a hard problem I think in the sense that a problem can be quite hard and you look at it and you can tell it's quite hard a problem that's really hard which is a problem you look at it and then immediately think you've got a solution and you don't because then you don't even you're like the it's like if you're like the sat-nav right you're confidently with a wrong answer now rather than at least being honestly uncertain yeah like legibility in machine learning systems is really low right now get kind of black boxes right they're not they're not legible and it you can't easily tell what any given part of it does or how it works and that is a real problem for safety definitely I think right now right now the stage we're at with AI saving is we're trying to specify any kind of safe agent which is you know trying to build something from the ground up so we'll be safe and I think that's much easier than taking some existing thing that works but isn't safe and trying to make it safe I don't think that approach to be honest is likely to be fruitful I'll give a really dodgy example of how this might kind of be something who can get the grips with which is the Star Wars scene where the robots are given restraining bolts r2d2 says oh I can't do that unless you take this restraining bolt off if you can you might be able to say back in time reporter we then promptly runs away and so I guess you're too small to run away I'm if I take this off yeah others like retrofitting some kind of Australian bulbs yeah I mean it so there's different things right building an unsafe AI and then trying to control it against its will is idiotic I think having some of those controls ways of keeping the system you know limiting what the system can do and stuff is sensible but it's so much better to make a system that doesn't want to do bad things than to try and keep one in so this news comes like the idea of old karma just unbox it right yeah it's like I mean constraining an AI necessarily means outwitting it and so constraining a superintelligence means outwitting a super intelligence which kind of just sort of by definition is not a winning strategy you can't rely on how waiting is it for intelligence also it only has to get out once that's the other thing if you have a super intelligence and you've sort of put it in a box so it can't do anything that's cool maybe we could even build a box that could successfully contain it but now what we may as well just have a box right there's no benefit to having a super intelligence in a box if you can't use it for anything it needs to be able to do things AI properly properly contained may as well just be a rock right it doesn't do anything if you have your AI you wanted to do something meaningful so now you have a problem or you've got something you don't know it's been evident you don't know that what it wants is what you want and you then need to you presumably have some sort of gatekeeper who it tries to says I'd like to do this and you have to decide is that something we want it to be doing how the hell are we supposed to know I mean how can we if we're outsmarted how can we reliably differentiate actions we want to allow it to take some actions we don't and maybe the thing has a long-term plan of doing a bunch of things that we don't notice at the time or a problem until it now then can get out right actually this speaks this speaks to a more general thing which is there's often a trade-off between safety and effectiveness like with anything right you're designing there's going to be you're going to be trading off different things against one another and often you can trade in some effectiveness to get some safety or vice-versa so some of the things in this paper are like that where the thing does become less powerful then nai designed differently but it also becomes safer you know that's always the way it is it's just so where you put your resources and suppose is not right but it's it's kind of inherent to the to the thing like I mean and this is true of any tool right the more powerful the tool is the more dangerous it is and if you want to make a powerful tool less dangerous one of the ways to do that is going to involve making less powerful or less flexible or less versatile or something that's going to reduce the overall effectiveness often as a tool in exchange for more safety and it's the same with AI and obviously going to be a server for whatever product you're using now anytime that Bob sends a little message is going to go by this server by definition because that's the thing that relays the message to Alice it knows how to communicate the valley you know it knows what her phone number is it have lists of your contacts and things you know this is how it works this could be a phone provider\n"