The Power of Alpha Zero: A Step Towards Creating a Universal AI Agent
What's the interesting aspect there that required to make that happen? I think the remarkable observation which we saw with alpha zero was that actually without modifying the algorithm at all it was able to play and crack some of a i's greatest previous challenges in particular. We dropped it into the game of chess and unlike the previous systems like deep blue which had been worked on for you know years and years and we were able to beat the world's strongest computer chess program convincingly using a system that was fully discovered by its own from from scratch with its own principles and in fact one of the nice things that that we found was that in fact we also achieved the same result in in Japanese chess a variant of chess where where you get to capture pieces and then place them back down on your on your own side as an extra piece so much more complicated variant of chess and we also beat the world's strongest programs and reach superhuman performance in that game too. And it was the very first time that we'd ever run the system on that particular game was the version that we published in their paper on alpha zero it just works out of the box literally no no no touching it we didn't have to do anything and and there it was superhuman performance no tweaking no no twiddling.
I think there's something beautiful about that principle that you can take an algorithm and without twiddling anything it just it just works now to go beyond alpha zero what's required alpha zero is is just a step and there's a long way to go beyond that to really crack the deep problems of AI but one of the important steps is to acknowledge that the world is a really messy place you know it's this rich complex beautiful but messy environment that we live in and no one gives us the rules like no one knows the rules of the world at least maybe we understand that it operates according to Newtonian or quantum mechanics at the micro level all according to relativity at the macro level but that's not a model that's used to useful for us as people to to operate in it somehow. The agent needs to understand the world for itself in a way where no one tells it the rules of the game and yet it can still figure out what to do in that world deal with this stream of observations coming in rich sensory input coming in actions going out in a way that allows it to reason in the way that alphago or or alpha zero can reason in the way that these go and chess-playing programs can reason but in a way that allows it to take actions in that messy world to to achieve its goals.
This led us to the most recent step in the story of alphago which was a system called mu 0. Mu zero is a system which learns for itself even when the rules are not given to it. It actually can be dropped into a system with messy perceptual inputs we actually tried it in the in some Atari games the canonical domains of Atari that have been used for reinforcement learning and and this system learned to build a model of these Atari games there was sufficient the rich and useful enough for it to be able to plan successfully and in fact that system not only went on to to beat the state-of-the-art in Atari but the same system without modification was able to reach the same level of superhuman performance in go chess and shogi that we'd seen in alpha zero showing that even without the rules the system can learn for itself just by trial and error just by playing this game of go and no one tells you what the rules are but you just get to the end and and someone says you know win or loss you play this game of chess and someone says we're not lost so you play a game of breakout in Atari and someone just tells you you know your score at the end. The system for itself figures out essentially the rules of the system the dynamics of the world how the world works and that not in any explicit way but just implicitly enough understanding for it to be able to plan in that in that system in order to achieve its goals.
And that's the fundamental price they have to go through when you're facing any uncertain kind of environment they would in the real world. It's figuring out the sort of the rules the basic rules of the game that's right. So there's a lot I mean the ad that that allows it to be applicable to basically any domain that could be digitized in the way that it needs to in order to be consumable sort of in order for the reinforcement learning framework to be able to sense the environment to be able to act anywhere and so on. The full reinforcement learning problem needs to deal with with worlds that are unknown and and complex and and the agent needs to learn for itself how to deal with that and so mu 0 was as a step a further step in that direction.
"WEBVTTKind: captionsLanguage: enso the next incredible step right really the profound step is probably alphago zero I mean it's arguable I kind of see them all as the same place but really and perhaps you were already thinking that alphago zeros the natural it was always going to be the next step but it's removing the reliance on human expert games for pre-training as you mentioned so how big of an intellectual leap was this that that self play could achieve superhuman level performance in its own and maybe could you also say what is self play I kind of mentioned if you tell us but so let me start with self play so the idea of self play is something which is really about systems learning for themselves but in the situation where there's more than one agent and so if you're in a game the game is a played between two players then self play is really about understanding that game just by playing games against yourself rather than against any actual real opponent and so it's a way to kind of um discover strategies without having to actually need to go out and play against any particular human player for example the main idea of alpha zero was really to you know try and step back from any of the knowledge that we'd put into the system and ask the question is it possible to come up with a single elegant principle by which a system can learn for itself all of the knowledge which it requires to play to play a game such as go importantly by taking knowledge out you not only make the system less brittle in the sense that perhaps the knowledge you were putting in was was just getting in the way and maybe stopping the system learning for itself but also you make it more general the more knowledge you put in the harder it is for a system to actually be placed taken out of the system in which it's kind of been designed and placed in some other system that maybe would need a completely different knowledge base to to understand and perform well and so the real goal here is to strip out all of the knowledge that we put in to the point that we can just plug it into something totally different and that to me is really you know the the promise of AI is that we can have systems such as that which you know no matter what the goal is no matter what goal we set to the system we can come up with we have an algorithm which can be placed into that world into that environment and can succeed in achieving that goal and then that that's to me is almost the the essence of intelligence if we can achieve that and so alpha zero is a step towards that and it's a step that was taken in the context of two-player perfect information games like go and chess we also applied it to Japanese chess so just to clarify the first step was alphago zero the first step was to try and take all of the knowledge out of alphago in such a way that it could play in a in a fully self discovered way purely from self play and to me the the motivation for that was always that we could then plug it into other domains but we saved that bat until later well in in fact I mean just for fun I could tell you exactly the moment where where the idea for alpha zero occurred to me because I think there's maybe a lesson there for for researchers who kind of too deeply embedded in their in their research and you know working 24/7 to try and come up with the next idea which is actually occurred to me on honeymoon like it's my most fully relaxed state really enjoying myself and and just being this like the algorithm for alpha zero just appeared I come and in in its full form and this was actually before we played against Lisa Dahl but we we just didn't I think we were so busy trying to make sure we could beat the the world champion that it was only later that we had the opportunity to step back and start examining that that sort of deeper scientific question of whether this could really work so nevertheless so self play is probably one of the most sort of profound ideas that it represents to me at least artificial intelligence but the fact that you could use that kind of mechanism to again be more class players that's very surprising so we kind of to be it feels like you have to train in a large number of experts so was it surprising to you what was the intuition can you sort of think not necessarily at that time even now what's your intuition why this thing works so well I was able to learn from scratch well let me first say why we tried it so we tried it both because I feel that it was the deeper scientific question to to be asking to make progress towards AI and also because in general in my research I don't like to do research on questions for which we already know the likely outcome I don't see much value in running an experiment where you're 95% confident that that you will succeed and so we could have tried you know maybe to to take alphago and do something which we we knew for sure it would succeed on but much more interesting to me was to try try it on the things which we weren't sure about and one of the big questions on our minds back then was you know could you really do this with self play alone how far could that go would it be as strong and honestly we weren't sure yeah it was 50/50 I think you know we I really if you'd asked me I wasn't confident that it could reach the same level as these systems but it felt like the right question to ask and even if even if it had not achieved the same level I felt that that was an important direction to be studying and so then lo and behold it actually ended up our performing the version of of alphago and indeed was able to beat it by 100 games to zero so what's the intuition as to as to why I think that the intuition to me is clear that whenever you have errors in a in a system as we did in alphago alphago suffered from these delusions occasionally it would misunderstand what was going on in a position and Miss evaluate it how can how can you remove all of these these errors errors arise from many sources for us they were arising both from you know it started from the human data but also from the from the nature of the search and the nature of the algorithm itself but the only way to address them in any complex system is to give the system the ability to correct its own errors it must be able to correct them it must be able to learn for itself when it's doing something wrong and correct for it and so it seemed to me that the way to correct delusions was indeed to have more iterations of reinforcement learning that you know no matter where you start you should be able to correct those errors until it gets to play that out and understand oh well I thought that I was going to win in this situation but then I ended up losing that suggests that I was miss evaluating something and there's a hole in my knowledge and now now the system can correct for itself and and understand how to do better now if you take that same idea and trace it back all the way to the beginning it should be able to take you from no knowledge from completely random starting point all the way to the highest levels of knowledge that you can achieve in it in a domain and the principle is the same that if you give if you bestow a system with the ability to correct its own errors then it can take you from random to something slightly better than random because it sees the stupid things that the random is doing and it can correct them and then it can take you from that slightly better system and understand what what's that doing wrong and it takes you on to the next level and the next level and and this progress it can go on indefinitely and indeed you know what would have happened if we'd carried on training alphago zero for longer we saw no sign of it slowing down it's in improvements or at least it was certainly carrying to improve and presumably if you had the computational resources this this could lead to better and better systems that discover more and more so your intuition is fundamentally there's not a ceiling to this process the one of the surprising things just like you said is the process of patching errors it's intuitively makes sense they this is a reinforcement learning should be part of that process but what is surprising is in the process of patching your own lack of knowledge you don't open up other patches you go you keep sort of like there's a monotonic decrease of your weaknesses well let me let me back this up you know I think science always should make falsifiable hypotheses yes so let me let me back out this claim with a falsifiable hypothesis which is that if someone was to in the future take alpha zero as an algorithm and run it on with greater computational resources that we had available today then I predict that they would be able to beat the previous system 100 games to zero and that if they were then to do the same thing a couple of years later that that would be that previous system hundred games to zero and that that process would continue indefinitely throughout at least my human lifetime presumably the game of girl would set the ceiling I mean the game of go would set the ceiling but the game of grow has 10 to the hundred and seventy states in it so he so the ceiling isn't unreachable by any computational device that can be built out of the you know 10 to the 80 atoms in the universe you asked a really good question which is you know do you not open up other errors when you when you correct your previous ones and the answer is is yes you do and so so it's a remarkable fact about about this class of two-player game and also true of single agent games that essentially progress will always lead you to if you have sufficient representational resource like imagine you had could represent every state in a big table of the game then we we know for sure that a progress of self-improvement will lead all the way in the single agent case to the optimal possible behavior and in the two-player case to the minimax optimal behavior that is that the best way that I can play knowing that you're playing perfectly against me and so so for those cases we know that even if you do open up some new error that in some sense you've made progress you've you're progressing towards the the best that can be done so alphago was initially trained on expertise with some self play alphago zero removed the need to be trained and experts and then another incredible step for me because I just love chess is to generalize that further to be in alpha zero to be able to play the game of go beating alphago zero and alphago and then also being able to play the check at the game of chess and others so what was that step like what's the interesting aspects there that required to make that happen I think the remarkable observation which we saw with alpha zero was that actually without modifying the algorithm at all it was able to play and crack some of a i's greatest previous challenges in particular we dropped it into the game of chess and unlike the previous systems like deep blue which had been worked on for you know years and years and we were able to beat the world's strongest computer chess program convincingly using a system that was fully discovered by its own from from scratch with its own principles and in fact one of the nice things that that we found was that in fact we also achieved the same result in in Japanese chess a variant of chess where where you get to capture pieces and then place them back down on your on your own side as an extra piece so much more complicated variant of chess and we also beat the world's strongest programs and reach superhuman performance in that game too and it was the very first time that we'd ever run the system on that particular game was the version that we published in their paper on on alpha zero it just works out of the box literally no no no touching it we didn't have to do anything and and there it was superhuman performance no tweaking no no twiddling and so I think there's something beautiful about that principle that you can take an algorithm and without twiddling anything it just it just works now to go beyond alpha zero what's required alpha zero is is just a step and there's a long way to go beyond that to really crack the deep problems of AI but one of the important steps is to acknowledge that the world is a really messy place you know it's this rich complex beautiful but messy environment that we live in and no one gives us the rules like no one knows the rules of the world at least maybe we understand that it operates according to Newtonian or quantum mechanics at the micro level all according to relativity at the macro level but that's not a model that's used to useful for us as people to to operate in it somehow the agent needs to understand the world for itself in a way where no one tells it the rules of the game and yet it can still figure out what to do in that world deal with this stream of observations coming in rich sensory input coming in actions going out in a way that allows it to reason in the way that alphago or or alpha zero can reason in the way that these go and chess-playing programs can reason but in a way that allows it to take actions in that messy world to to achieve its goals and so this led us to the most recent step in the story of alphago which was a system called mu 0 and mu zero is a system which learns for itself even when the rules are not given to it it actually can be dropped into a system with messy perceptual inputs we actually tried it in the in some Atari games the canonical domains of Atari that have been used for reinforcement learning and and this system learned to build a model of these Atari games there was sufficient the rich and useful enough for it to be able to plan successfully and in fact that system not only went on to to beat the state-of-the-art in Atari but the same system without modification was able to reach the same level of superhuman performance in go chess and shogi that we'd seen in alpha zero showing that even without the rules the system can learn for itself just by trial and error just by playing this game of go and no one tells you what the rules are but you just get to the end and and someone says you know win or loss you play this game of chess and someone says we're not lost so you play a game of breakout in Atari and someone just tells you you know your score at the end and the system for itself figures out essentially the rules of the system the dynamics of the world how the world works and that not in any explicit way but just implicitly enough understanding for it to be able to plan in that in that system in order to achieve its goals and that's the you know that's the fundamental price they have to go through when you're facing any uncertain kind of environment they would in the real world it's figuring out the sort of the rules the basic rules of the game that's right so there's a lot I mean the ad that that allows it to be applicable to basically any domain that could be digitized in the way that it needs to in order to be consumable sort of in order for the reinforcement learning framework to be able to sense the environment to be able to act anywhere and so on the full reinforcement learning problem needs to deal with with worlds that are unknown and and complex and and the agent needs to learn for itself how to deal with that and so museu I was as a step a further step in that direction youso the next incredible step right really the profound step is probably alphago zero I mean it's arguable I kind of see them all as the same place but really and perhaps you were already thinking that alphago zeros the natural it was always going to be the next step but it's removing the reliance on human expert games for pre-training as you mentioned so how big of an intellectual leap was this that that self play could achieve superhuman level performance in its own and maybe could you also say what is self play I kind of mentioned if you tell us but so let me start with self play so the idea of self play is something which is really about systems learning for themselves but in the situation where there's more than one agent and so if you're in a game the game is a played between two players then self play is really about understanding that game just by playing games against yourself rather than against any actual real opponent and so it's a way to kind of um discover strategies without having to actually need to go out and play against any particular human player for example the main idea of alpha zero was really to you know try and step back from any of the knowledge that we'd put into the system and ask the question is it possible to come up with a single elegant principle by which a system can learn for itself all of the knowledge which it requires to play to play a game such as go importantly by taking knowledge out you not only make the system less brittle in the sense that perhaps the knowledge you were putting in was was just getting in the way and maybe stopping the system learning for itself but also you make it more general the more knowledge you put in the harder it is for a system to actually be placed taken out of the system in which it's kind of been designed and placed in some other system that maybe would need a completely different knowledge base to to understand and perform well and so the real goal here is to strip out all of the knowledge that we put in to the point that we can just plug it into something totally different and that to me is really you know the the promise of AI is that we can have systems such as that which you know no matter what the goal is no matter what goal we set to the system we can come up with we have an algorithm which can be placed into that world into that environment and can succeed in achieving that goal and then that that's to me is almost the the essence of intelligence if we can achieve that and so alpha zero is a step towards that and it's a step that was taken in the context of two-player perfect information games like go and chess we also applied it to Japanese chess so just to clarify the first step was alphago zero the first step was to try and take all of the knowledge out of alphago in such a way that it could play in a in a fully self discovered way purely from self play and to me the the motivation for that was always that we could then plug it into other domains but we saved that bat until later well in in fact I mean just for fun I could tell you exactly the moment where where the idea for alpha zero occurred to me because I think there's maybe a lesson there for for researchers who kind of too deeply embedded in their in their research and you know working 24/7 to try and come up with the next idea which is actually occurred to me on honeymoon like it's my most fully relaxed state really enjoying myself and and just being this like the algorithm for alpha zero just appeared I come and in in its full form and this was actually before we played against Lisa Dahl but we we just didn't I think we were so busy trying to make sure we could beat the the world champion that it was only later that we had the opportunity to step back and start examining that that sort of deeper scientific question of whether this could really work so nevertheless so self play is probably one of the most sort of profound ideas that it represents to me at least artificial intelligence but the fact that you could use that kind of mechanism to again be more class players that's very surprising so we kind of to be it feels like you have to train in a large number of experts so was it surprising to you what was the intuition can you sort of think not necessarily at that time even now what's your intuition why this thing works so well I was able to learn from scratch well let me first say why we tried it so we tried it both because I feel that it was the deeper scientific question to to be asking to make progress towards AI and also because in general in my research I don't like to do research on questions for which we already know the likely outcome I don't see much value in running an experiment where you're 95% confident that that you will succeed and so we could have tried you know maybe to to take alphago and do something which we we knew for sure it would succeed on but much more interesting to me was to try try it on the things which we weren't sure about and one of the big questions on our minds back then was you know could you really do this with self play alone how far could that go would it be as strong and honestly we weren't sure yeah it was 50/50 I think you know we I really if you'd asked me I wasn't confident that it could reach the same level as these systems but it felt like the right question to ask and even if even if it had not achieved the same level I felt that that was an important direction to be studying and so then lo and behold it actually ended up our performing the version of of alphago and indeed was able to beat it by 100 games to zero so what's the intuition as to as to why I think that the intuition to me is clear that whenever you have errors in a in a system as we did in alphago alphago suffered from these delusions occasionally it would misunderstand what was going on in a position and Miss evaluate it how can how can you remove all of these these errors errors arise from many sources for us they were arising both from you know it started from the human data but also from the from the nature of the search and the nature of the algorithm itself but the only way to address them in any complex system is to give the system the ability to correct its own errors it must be able to correct them it must be able to learn for itself when it's doing something wrong and correct for it and so it seemed to me that the way to correct delusions was indeed to have more iterations of reinforcement learning that you know no matter where you start you should be able to correct those errors until it gets to play that out and understand oh well I thought that I was going to win in this situation but then I ended up losing that suggests that I was miss evaluating something and there's a hole in my knowledge and now now the system can correct for itself and and understand how to do better now if you take that same idea and trace it back all the way to the beginning it should be able to take you from no knowledge from completely random starting point all the way to the highest levels of knowledge that you can achieve in it in a domain and the principle is the same that if you give if you bestow a system with the ability to correct its own errors then it can take you from random to something slightly better than random because it sees the stupid things that the random is doing and it can correct them and then it can take you from that slightly better system and understand what what's that doing wrong and it takes you on to the next level and the next level and and this progress it can go on indefinitely and indeed you know what would have happened if we'd carried on training alphago zero for longer we saw no sign of it slowing down it's in improvements or at least it was certainly carrying to improve and presumably if you had the computational resources this this could lead to better and better systems that discover more and more so your intuition is fundamentally there's not a ceiling to this process the one of the surprising things just like you said is the process of patching errors it's intuitively makes sense they this is a reinforcement learning should be part of that process but what is surprising is in the process of patching your own lack of knowledge you don't open up other patches you go you keep sort of like there's a monotonic decrease of your weaknesses well let me let me back this up you know I think science always should make falsifiable hypotheses yes so let me let me back out this claim with a falsifiable hypothesis which is that if someone was to in the future take alpha zero as an algorithm and run it on with greater computational resources that we had available today then I predict that they would be able to beat the previous system 100 games to zero and that if they were then to do the same thing a couple of years later that that would be that previous system hundred games to zero and that that process would continue indefinitely throughout at least my human lifetime presumably the game of girl would set the ceiling I mean the game of go would set the ceiling but the game of grow has 10 to the hundred and seventy states in it so he so the ceiling isn't unreachable by any computational device that can be built out of the you know 10 to the 80 atoms in the universe you asked a really good question which is you know do you not open up other errors when you when you correct your previous ones and the answer is is yes you do and so so it's a remarkable fact about about this class of two-player game and also true of single agent games that essentially progress will always lead you to if you have sufficient representational resource like imagine you had could represent every state in a big table of the game then we we know for sure that a progress of self-improvement will lead all the way in the single agent case to the optimal possible behavior and in the two-player case to the minimax optimal behavior that is that the best way that I can play knowing that you're playing perfectly against me and so so for those cases we know that even if you do open up some new error that in some sense you've made progress you've you're progressing towards the the best that can be done so alphago was initially trained on expertise with some self play alphago zero removed the need to be trained and experts and then another incredible step for me because I just love chess is to generalize that further to be in alpha zero to be able to play the game of go beating alphago zero and alphago and then also being able to play the check at the game of chess and others so what was that step like what's the interesting aspects there that required to make that happen I think the remarkable observation which we saw with alpha zero was that actually without modifying the algorithm at all it was able to play and crack some of a i's greatest previous challenges in particular we dropped it into the game of chess and unlike the previous systems like deep blue which had been worked on for you know years and years and we were able to beat the world's strongest computer chess program convincingly using a system that was fully discovered by its own from from scratch with its own principles and in fact one of the nice things that that we found was that in fact we also achieved the same result in in Japanese chess a variant of chess where where you get to capture pieces and then place them back down on your on your own side as an extra piece so much more complicated variant of chess and we also beat the world's strongest programs and reach superhuman performance in that game too and it was the very first time that we'd ever run the system on that particular game was the version that we published in their paper on on alpha zero it just works out of the box literally no no no touching it we didn't have to do anything and and there it was superhuman performance no tweaking no no twiddling and so I think there's something beautiful about that principle that you can take an algorithm and without twiddling anything it just it just works now to go beyond alpha zero what's required alpha zero is is just a step and there's a long way to go beyond that to really crack the deep problems of AI but one of the important steps is to acknowledge that the world is a really messy place you know it's this rich complex beautiful but messy environment that we live in and no one gives us the rules like no one knows the rules of the world at least maybe we understand that it operates according to Newtonian or quantum mechanics at the micro level all according to relativity at the macro level but that's not a model that's used to useful for us as people to to operate in it somehow the agent needs to understand the world for itself in a way where no one tells it the rules of the game and yet it can still figure out what to do in that world deal with this stream of observations coming in rich sensory input coming in actions going out in a way that allows it to reason in the way that alphago or or alpha zero can reason in the way that these go and chess-playing programs can reason but in a way that allows it to take actions in that messy world to to achieve its goals and so this led us to the most recent step in the story of alphago which was a system called mu 0 and mu zero is a system which learns for itself even when the rules are not given to it it actually can be dropped into a system with messy perceptual inputs we actually tried it in the in some Atari games the canonical domains of Atari that have been used for reinforcement learning and and this system learned to build a model of these Atari games there was sufficient the rich and useful enough for it to be able to plan successfully and in fact that system not only went on to to beat the state-of-the-art in Atari but the same system without modification was able to reach the same level of superhuman performance in go chess and shogi that we'd seen in alpha zero showing that even without the rules the system can learn for itself just by trial and error just by playing this game of go and no one tells you what the rules are but you just get to the end and and someone says you know win or loss you play this game of chess and someone says we're not lost so you play a game of breakout in Atari and someone just tells you you know your score at the end and the system for itself figures out essentially the rules of the system the dynamics of the world how the world works and that not in any explicit way but just implicitly enough understanding for it to be able to plan in that in that system in order to achieve its goals and that's the you know that's the fundamental price they have to go through when you're facing any uncertain kind of environment they would in the real world it's figuring out the sort of the rules the basic rules of the game that's right so there's a lot I mean the ad that that allows it to be applicable to basically any domain that could be digitized in the way that it needs to in order to be consumable sort of in order for the reinforcement learning framework to be able to sense the environment to be able to act anywhere and so on the full reinforcement learning problem needs to deal with with worlds that are unknown and and complex and and the agent needs to learn for itself how to deal with that and so museu I was as a step a further step in that direction you\n"