David Silver - AlphaGo, AlphaZero, and Deep Reinforcement Learning _ Lex Fridman Podcast #86
The Goal of Maximizing Entropy: A Conversation with David Silver
Well, if the goal is to maximize entropy well, how do we do that by a particular system and maybe evolution is something that the universe discovered in order to kind of dissipate energy as efficiently as possible. And by the way, I'm borrowing from Max Tegmark for some of these metaphors, yes, the physicist, but if you can think of evolution as a mechanism for dispersing energy then evolution you might say as then becomes a goal which is if if evolution disperses energy by reproducing as efficiently as possible what's evolution then well it's now got its own goal within that which is to actually reproduce as effectively as possible. Now, how does reproduction how is that made as effective as possible well, you need entities within that that can survive and reproduce as effectively as possible, and so it's natural in order to achieve that high-level goal those individual organisms discover brains intelligences which enable them to support the goals of evolution.
And those brains what do they do well perhaps the early brains maybe they were controlling things at some direct level you know maybe they were the equivalent of pre-programmed systems which were directly controlling what was going on and setting certain you know things in order to achieve these particular particular goals but that led to a another level of discovery which was learning systems, you know parts of the brain which were able to learn from themselves and learn how to to program themselves to achieve any goal. And presumably there are parts of the game of the brain where goals are set to to parts of that system and provides this very flexible notion of intelligence that we as humans presumably have which is the ability to kind of wipe the reason, we feel that we can, we can we can achieve any goal so so it's a very long-winded answer to say that you know I think there are many perspectives and many levels at which intelligence can be understood and and each of those levels you can take multiple perspectives that you know you can view the system as something which is optimizing for a goal which is understanding it at a level by which we can maybe implement it and understand it as AI researchers or computer scientists or you can understand it at the level of the mechanistic thing which is going on that there are these you know atoms bouncing around in the brain and they lead to the the outcome of that system is not in contradiction with the fact that it's it's also a a decision-making system that's optimizing for some goal.
And I've never heard the description of the meaning of life structured so beautifully in layers but you did miss one layer which is the next step which you're responsible for which is creating the the artificial intelligence and data layer on top of that, and I can't wait to see well I may not be around but they can't wait to see what the next layer beyond that well we well let's just take that that argument you know and pursue it to a central conclusion so the next level indeed is for how can our how can our learning brain achieve its goals most effectively well maybe it does so by by us as learning beings building a system which is able to solve for those goals more effectively than we can. And so when we build a system to play the game of go you know when I said that I wanted to build a system that can play go better than I can, I've enabled myself to achieve that goal of playing go better than I could buy by directly playing it and learning it myself.
And so now a new layer has been created which is systems which are able to achieve goals for themselves and ultimately there may be layers beyond that where they set sub-goals to parts of their own system in order to to achieve those and so forth. So, the story of intelligence I think I think is is a multi-layered one and a multi perspective one, we live in an incredible universe David, thank you so much for dreaming of using learning to solve go and building intelligent systems and for actually making it happen and for inspiring millions of people in the process, it's truly an honor.
Thank you so much for talking today. Okay, thank you thanks for listening to this conversation with David Silver and thank you to our sponsors Masterclass and Cash App please consider supporting the podcast by signing up to Masterclass at masterclass.com and downloading Cash App and using code LexPodcast if you enjoy this podcast subscribe on YouTube review it with five stars an Apple Podcast supported on Patreon or simply connect with me on Twitter at LexFriedman, and now let me leave you with some words from David Silver. My personal belief is that we've seen something of a turning point where we're starting to understand that many abilities like intuition and creativity that we've previously thought were in the domain only of the human mind are actually accessible to machine intelligence as well, and I think that's a really exciting moment in history.
"WEBVTTKind: captionsLanguage: enthe following is a conversation with David silver who leads the reinforcement learning research group a deep mind and was the lead researcher on alphago alpha 0 and co led the Alpha star and Museum efforts and a lot of important work in reinforcement learning in general I believe alpha zero is one of the most important accomplishments in the history of artificial intelligence and David is one of the key humans who brought alpha zero to life together with a lot of other great researchers at deep mind he's humble kind and brilliant we were both jet lagged but didn't care and made it happen it was a pleasure and truly an honor to talk with David this conversation was recorded before the outbreak of the pandemic for everyone feeling the medical psychological and financial burden of this crisis I'm sending love your way stay strong or in this together we'll beat this thing this is the artificial intelligence podcast if you enjoy it subscribe on youtube review it with five stars an apple podcast support on patreon or simply connect with me on Twitter Alex Friedman spelled Fri DM aen as usual I'll do a few minutes of as now and never any ads in the middle they can break the flow of the conversation I hope that works for you and doesn't hurt the listening experience quick summary of the ads to sponsors masterclass and cash app please consider supporting the podcast by signing up to master class and master class comm slash flex and downloading cash app and using code and Lex podcast this show is presented by cash app the number one finance app in the App Store when you get it use code Lex podcast cash app lets you send money to friends buy Bitcoin and invest in the stock market with as little as one dollar since cash app allows you to buy Bitcoin let me mention that cryptocurrency in the context of the history of money it's fascinating I recommend a cent of money as a great book on this history debits and credits and Ledger's started around 30,000 years ago the US dollar created over two hundred years ago and Bitcoin the first decentralized cryptocurrency at least just over ten years ago so given that history cryptocurrency is still very much in its early days of development but it's still aiming to and just might redefine the nature of money so again if you get cash out from the App Store or Google Play and use the code let's podcast you get ten dollars and cash wrap will also donate ten dollars the first an organization that is helping to advance robotics and stem education for young people around the world this show is sponsored by masterclass set up a masterclass complex to get a discount and to support this podcast in fact for a limited time now if you sign up for an all-access pass for a year you get to get another all-access pass to share with a friend buy one get one free when I first heard about masterclass I thought it was too good to be true for one hundred eighty dollars a year you get an all-access pass to watch courses from to list some of my favorites Chris Hadfield on space exploration Neil deGrasse Tyson on scientific thinking communication will write the creator of SimCity and Sims on game design jane goodall on conservation Carlos Santana on guitar his song Europa could be the most beautiful guitar song ever written garry kasparov on chess daniel negreanu on poker and many many more Chris Hadfield explaining how Rockets work and the experience of being launched into space alone is worth the money for me the keys to not be overwhelmed by the abundance of choice pick three courses you want to complete watch each of them all the way through it's not that long but it's an experience that will stick with you for a long time I promise it's easily worth the money you can watch it on basically any device once again sign up a master class complex to get a discount and to support this podcast and now here's my conversation with David silver what was the first program you've ever written and what programming language do you remember I remember very clearly he have my my parents brought home this BBC modeled B microcomputer it was just this fascinating thing to me I was about seven years old and couldn't resist just playing around with it so I think first program ever was writing my name out in different colors and getting it to loop and repeat that and there was something magical about that which just led to more and more how did you think about computers back then like the magical aspect of it that you can write a program and there's this thing that you just gave birth to it's able to creative visual elements and live in its own or did you not think of it in those romantic notions was it more like oh that's cool I can I can solve some puzzles it was always more than solving puzzles it was something where you know there was this limitless possibilities once you have a computer in front of you you can do anything with it that's um I used to play with Lego with the same feeling you can make anything you want out of Lego but even more so with a computer you know you don't you're not constrained by the amount of kit you've got and so I was fascinated by it and started pulling out there you know the user guide and the advanced user guide and then learning so I started in basic and then you know later 6502 my father was also became interested in there in this machine and gave up his career to go back to school and study for an a master's degree in in artificial intelligence funnily enough Essex University when I was when I was seven so I was exposed to those things at an early age he showed me how to program in Prolog and do things like querying your family tree and those are some of my earlier earliest memories of trying to trying to figure things out on a computer those are the early steps in computer science programming but when did you first fall in love with artificial intelligence or were the ideas the dreams of AI I think it was really when I when I went to study at university so I was an undergrad at Cambridge and studying computer science and and I really started to question you know what what really are the goals what what's the goal where do we want to go with with computer science and it seemed to me that the the only step of major significance to take was to try and recreate something akin to human intelligence if we could do that that would be a major leap forward and that idea certainly wasn't the first to have it but it you know nestled within me somewhere and and became like a bug you know I really wanted to to crack that problem so you thought it was like you had a notion that this is something that human beings can do it is possible to create an intelligent machine well I mean unless you believe in something metaphysical then what are our brains doing well at some level their information processing systems which are able to take whatever information is in there transform it through some form of program and produce some kind of output which enables that that human being to do all the amazing things that they can do in this incredible world so so then do you remember the first time you've written a program that because you also had an interesting games do you remember the first time you were in the program that beat you in a game said I won't beat you at anything sort of achieved Super David silver level performance so I used to work in the games industry so for five years I programmed games for my first job so it was a amazing opportunity to get involved in a startup company and so I I was involved in in building AI at that time and so for sure there was a sense of building handcrafted what people used to call AI in the games industry which i think is not really what we might think of as AI and its fullest sense but something which is able to to take actions and in a way which which makes things interesting and challenging for their for the for the human player and at that time I was able to build you know these handcrafted agents which in certain limited cases could do things which which were able to do better than me but mostly in these kind of twitch like scenarios where where they were able to do things faster or because they had some pattern which was able to exploit repeatedly I think if we're talking about real AI the first experience for me came after that when I I realized that this path I was on wasn't taking me towards it wasn't it wasn't dealing with that bug which I still had inside me to really understand intelligence and try and and try and solve it everything people were doing in games was you know short-term fixes rather than long-term vision and so I went back to study for my PhD which was fairly enough trying to apply reinforcement learning to the game of go and I built my first go program using reinforcement learning a system which would by trial and error play against itself and was able to learn which patterns were actually helpful to predict whether it's going to win or lose the game and then choose the moves that led to the combination of patterns that would mean that you're more likely to win in that system that system beat me how did that make you feel make me feel good I was there as sort of the yeah then is the it's a mix of a sort of excitement and was there a tinge of sort of like almost like a fearful aw you know it's like in space 2001 Space Odyssey kind of realizing that you've created something that there's you know that is that's achieved human level intelligence in this one particular little task and in that case I suppose a neural networks weren't involved there were no neural networks in those days this was pre deep learning revolution but it was a principled self learning system based on a lot of the principles which which people are still using in deep reinforcement learning how did I feel I I think I found it immensely satisfying that a system which was able to learn from first principles for itself was able to reach the point that it was understanding this domain better than better than I could and able to outwit me I don't think it was a sense of or it was a sense that satisfaction that this that's something I felt should work had worked so to me alphago and I don't know how else to put it but to me alphago and alpha a girl zero mastery in the game of girl is again to me the most profound and inspiring moment in the history of artificial intelligence so you're one of the key people behind this achievement and I'm Russian so I really felt the first sort of seminal achievement one deep blue beat garry kasparov in 1987 so as far as I know the AI community at that point largely saw the game of Go was unbeatable in AI using the the sort of the state of the art to brute force methods search methods even if you consider at least the way I saw it even if you consider arbitrary exponential ski scaling of compute go would still not be solvable hence why it was thought to be impossible so given that the game of go was impossible to to master one was the dream for you you just mentioned your PG thesis of building the system that plays go what was the dream for you that you could actually build a computer program that achieves world-class not necessarily beat the world champion but I cheesed that kind of level of playing go first of all thank you that's very kind West and funnily enough I just came from a panel where I was actually in a conversation with Garry Kasparov and Marie Campbell who was the author of deep blue and it was their first meeting together since the since the match yesterday so I'm literally fresh from that experience so these are amazing moments when they happen but where did it all start well for me it started when I became fascinated in the game of go so go for me I've grown up playing games I've always had a fascination in in in board games I played chess as a kid I played Scrabble as a kid when I was at university I discovered the game of go and and to me it just blew all of those other games out of the water it was just so deep and profound in its in its complexity with endless levels to it what I discovered was that I could devote endless hours to this game and I knew in my heart of hearts that no matter how many hours I would devote to it I would never become a you know a grandmaster or there was another path and the other path was to try and understand how you could get some other intelligence to play this this game better than I would be able to and so even in those days I had this idea that you know what if what if it was possible to build a program that could crack this and as I started to explore the domain I discovered that you know this was really the domain where people felt deeply that if progress could be made and go it really mean a giant leap forward for a I it was the the challenge where all other approaches had failed you know this is coming out of the area you mentioned which was in some sense their the golden era for further classical methods of a I like heuristic search in the 90s you know they all they all fell one after another not just chess with deep blue but checkers backgammon Othello there were numerous cases where where systems built on top of heuristic search methods with you know his high-performance systems have been able to defeat the human world champion in each of those domains and yet in that same time period there was a million dollar prize available for the game of go for the first system to be a human professional player and at the end of that time period in year 2000 when the prize expired the strongest go program in the world was defeated by a nine-year-old child when that nine year old child was giving 9 free moves to the computer at the start of the game and to try and even things up yeah and computer go X but beat that strongest same strongest program with 29 handicaps tones 29 free moves so that's what the state of affairs was when I became interested in this problem in around 2000 and 2003 when I I start started working computer go there was nothing they were there was just there was very very little in the way of progress towards meaningful performance again anything approaching human level and so people they it wasn't through lack of effort people have tried many many things and so there was a strong sense that that something different would be required for go than then had been needed for all of these other domains where I had a I had been successful and maybe the single clearest example is that that go unlike those other domains had this kind of intuitive property that a go player would look at a position and say hey you know here's this mess of black and white stones but from this mess oh I can I can predict that that's this part of the board has become my territory this part of the boards become your territory and I've got this overall sense I'm going to win and this is about the right move to play and that intuitive sense of judgment of being able to evaluate what's going on in a position it was pivotal to humans being able to play this game and something that people had no idea how to put into computers so this question of how to evaluate in a position how to come up with these intuitive judgments was the key reason why go was so hard in addition to its enormous search space and the reason why methods which had succeeded so well elsewhere failed and go and so people really felt deep down that that you know in order to crack go we would need to get something akin to human intuition and if we got something akin to human intuition we'd be able to self you know much many many more problems in AI so to me that was the moment where it's like okay this is not just about playing the game of Go this is about something profound and it was back to that bug which had been itching me all those years now this is the opportunity to do something meaningful and and transformative and and I guess a dream was born that's a really interesting way to put it almost this realization that you need to find formulate girls are kind of a prediction problem versus a search problem was the intuition I mean I maybe that's the wrong crude term but the to give it us the ability to kind of Intuit things about positional structure of the board well okay but what about the learning part of it did you have a sense that you have to that learning has to be part of the system again something that hasn't really as as far as I think except with TD Guerin and in the 90s was RL a little bit hasn't been part of those state-of-the-art game playing systems so I strongly felt that learning would be necessary and that's why my my PhD topic back then was trying to apply reinforcement learning to the game of CO and not just learning of any type but I felt that the only way to really have a system to progress beyond human levels of performance wouldn't just be to mimic how humans do it but to understand for themselves and how else can a machine hope to understand what's going on except through learning if you're not learning what else are you doing while you're putting all the knowledge into the system and that just feels like a something which decades of AI have told us is is maybe not a dead end but certainly has a ceiling to the capabilities it's known as the you know knowledge acquisition bottleneck that there the more you try to put into something the more brittle the system becomes and and so you just have to have learning you have to have learning that's the only way you're going to be able to get a system which has sufficient knowledge in it you know millions and millions of pieces of knowledge billions trillions of a form that it can actually apply for itself and understand how those billions and trillions of pieces of knowledge can be leveraged in a way which will actually lead it towards its goal without conflict or or other issues yeah I mean if I put myself back in there in that time I just wouldn't think like that without a good demonstration of RL I would I would think more in the symbolic AI like that though it would not learning but sort of a simulation of knowledge base like a growing knowledge base but it would still be sort of pattern based lot like basically have little rules that you kind of assemble together into a large knowledge base well in a sense that was the state of the art back then so if you look at the go programs which had been competing for this prize I mentioned they were an assembly of different specialized systems some of which used huge amounts of human knowledge to describe how you should play the opening how you should all the different patterns that were required to to play well in the game of Go endgame Theory combinatorial game theory and combined with more principled search based methods which we're trying to solve for particular sub parts of the game like life and death connecting groups together all these amazing subproblems that just emerged in the game of Go there were there were different pieces all put together into this like collage which together would try and play against a human and although not all of the pieces were handcrafted the overall effect was nevertheless still brittle and it was hard to make all these pieces work well together and so really what I was pressing for and the main innovation of the approach they took was to go back to first principles and say well let's let's back off that and try and find a principled approach where the system can learn for itself it just from the outcome like you know learn for itself if you try something did that did that help or did it not help and only through that procedure can you arrive at knowledge which is which is verified the system has to verify it for itself not relying on any other third party to say this is right or this is wrong so that principle was already you know very important in those days but unfortunately we were missing some important pieces back then so before we dive into may be discussing the beauty of reinforcement learning let's think it's the back who kind of skipped skipped it a bit but the rules of the game of go what's the the elements of it perhaps contrasting to chess that sort of you really enjoyed as a human being and also that make it really difficult as a a I machine learning problem so the game of CO was has remarkably simple rules if that's so simple that people have speculated that if we were to meet alien life at some point that we wouldn't be able to communicate with them but we would be able to play hello go with that probably have discovered the same rule set yeah so the game is played on a on a 19 by 19 grid and you play on the intersections of the grid and the players take turns and the aim of the game is very simple it's to surround as much territory as you can as many of these intersections with your stones and just around more than your opponent does and the only nuance to the game is that if you fully surround your opponent's piece then you get to capture it and remove it from the board and it counts as your own territory now from those very simple rules immense complexity arises it's kind of profound strategies in how to surround territory how to kind of trade-off between making solid territory yourself now compared to building up influence that will help you acquire territory later in the game how to connect groups together how to keep your own groups alive which which patterns of stones are most useful compared to others there's just immense knowledge and human go players have played this game for it was discovered thousands of years ago and human go players have built up its immense knowledge base over over the years it's studied very deeply and played by something like 50 million players across the world mostly in China Japan and Korea where it's a important part of a culture so much so that it's considered one of the four ancient arts that was required by Chinese scholars so there's a deep history there but there's interesting quality so if I is it a comparative chess chess is in the same way as it is in Chinese culture of a goal in chess in Russia is also considered one of the secret arts so if we contrast sort of go with chess as interesting qualities about go maybe you can correct me if I'm wrong but the evaluation of a particular static board is not as reliable like you can't in chess you can kind of assign points to the different units and it's kind of a pretty good measure of who's one who's losing it's not so clear yeah so this game of the HOH you know you find yourself in a situation where both players have played the same number of stones actually captures a strong level of play happen very rarely which means that any moment in the game you've got the same number of white stones and black stones and the only thing which differentiates how well you're doing is this intuitive sense of you know where are the territories ultimately going to form on this board and when you if you look at the complexity of a real go position you know it's it's mind boggling that kind of question of what will happen in in 300 moves from now when you when you see just a scattering of twenty white and black stones intermingled and and so that that challenge is the reason why position of value is so hard in go compared to two other games in addition to that has an enormous search space so there's around ten to one hundred and seventy positions in the game of go that's an astronomical number and that search spaces is so great that traditional heuristic search methods that were so successful and things like deep blue and and chess programs just kind of fall over and go so a which pointed reinforcement learning enter your life your research life your way of thinking we just talked about learning but reinforcement learning is very particular kind of learning one that's both philosophically sort of profound yeah but also one that's pretty difficult to get to work as if we look back in the earth at least the early days so when did that enter your life and how did that work progress so I had just finished working in the games industry this startup company and I took I took a year out to discover for myself exactly which path I wanted to take I knew I wanted to study intelligence but I wasn't sure what that meant at that stage I really didn't feel had the tools to decide on exactly which path I wants to follow so during that year I I read a lot and one of the things I read was Saturn Umberto the sort of seminal tech spec are an introduction to reinforcement learning and when I read that textbook I I just had this resonating feeling that this is what I understood intelligence to be and this was the path that I felt would be necessary to go down to make progress in in AI so I got in touch with rich Saturn and asked him if he would be interested in supervising me on a PhD thesis in in computer go and he he basically said that if he's still alive he'd be happy to but unfortunately he'd been you know struggling with very serious cancer for some years and he really wasn't confident at that stage that he'd even be around to see the end event but fortunately that part of the story worked out very happily and I found myself out there in Alberta they've got a great games group out there with a history of fantastic working in board games as well as rich that in the father of RL so it was the the natural place for me to go in some sense to study this question and the more I looked into it the more the more strongly ie I felt that this wasn't just the path to progress in computer go but really you know this this was the thing I'd been looking for this was really an opportunity to to frame what intelligence means like what does what are the goals of AI in a clear single clear problem definition such that if we're able to solve that play a single problem definition in some sense we've cracked the problem of AI so to you reinforcement learning ideas at least sort of echoes of it would be at the core of intelligence it is as a core of intelligence and if we ever create in a human level intelligence system it would be at the core of that kind of system let me say it this way that I think I think it's helpful to separate out the problem from the solution so I see the problem of intelligence I would say it can be formalized as the reinforcement learning problem and that that formalization is enough to capture most if not all of the things that we mean by intelligence that that they can all be brought within this this this framework and gives us a way to access them in a meaningful way that allows us as as scientists to understand intelligence and us as computer scientists to to build them and so in that sense I feel that it gives us a path maybe not the only path but a path towards AI and so do I think that any system in the future that that's you know sold AI would would have to have RL within it well I think if you ask that you're asking about the solution methods I would say that if we have such a thing it would be a solution to the RL problem now what particular methods have been used to get there well we should keep an open mind about the best approaches to actually solve any problem and you know the things we have right now for reinforcement learning maybe maybe then maybe I believe they've got a lot of legs but maybe we're missing some things maybe there's gonna be better ideas I think we should keep her you know let's remain modest and we're at the early days of this field and and there are many amazing discoveries ahead of us for sure the specifics especially of the different kinds of our ell approaches currently there could be other things there followed is a very large umbrella of our ell but if it's if it's okay can we take a step back and kind of ask the basic question of what is to you reinforcement learning so reinforcement learning is the study and the science and the problem of intelligence in the form of an agent that interacts with an environment so the problem is trying to self is represented by some environment like the world in which that agent is situated and the goal of RL is clear that the agent gets to take actions those actions have some effects on the environment and the environment gives back an observation to the agent saying you know this is what you see your sense and one special thing which it gives back is it's called the raw signal how well it's doing in the environment and the reinforcement learning problem is to simply take actions over time so as to maximize that reward signal so a couple of basic questions what types of RL approaches are there so I don't know if there's a nice brief in words way to paint the picture of sort of value based model based policy based reinforcement learning yeah so now if we think about okay so there's this ambitious problem definition of RL it's really you know it's truly ambitious it's trying to capture and encircle all of the things in which an agent interacts with an environment and say well how can we formalize and understand what it means to to crack that now let's think about the solution method well how do you solve a really hard problem like that well one approach you can take is is to decompose that that very hard problem into into pieces that work together to solve that hard problem and and so you can kind of look at the decomposition that's inside the agents head if you like and ask well what form does that decomposition take and some of the most common pieces that people use when they're kind of putting this system the solution method together some of the most common pieces that people use are whether or not that solution has a value function that means is it trying to predict explicitly trying to predict how much reward it will get in the future does it have a representation of a policy that means something which is deciding how to pick actions is is that decision-making process explicitly represented and is there a model in the system is there something which is explicitly trying to predict what will happen in the environment and so those three pieces are to me some of the most common building blocks and I understand the different choices in RL as choices of whether or not to use those building blocks when you're trying to decompose the solution you know should I have a value function represented so they have a policy represented should I have a model represented and there are combinations of those pieces and of course other things that you could add to add into the picture as well but those those three fundamental choices give rise to some of the branches of RL with which we're very familiar and so those as you mentioned there is the choice of what's specified or modeled explicitly and the idea is that all of these are somehow implicitly learned within the system so it's almost a choice of how you approach a problem do you see those as fundamental differences or these almost like small specifics like the details of how you saw the problem but they're not fundamentally different from each other I think the the fundamental idea is is maybe at the higher level the fundamental idea is the first step of the decomposition is really to say well how are we really going to solve any kind of problem where you're trying to figure out how to take actions and just from a stream of observations you know you've got some agents situated it's sensory motor stream and getting all these observations here and getting to take these actions and and what should it do how can even broach that problem you know me the complexity of the world is so great that you can't even imagine how to build a system that would that would understand how to deal with that and so the first step of this decomposition is to say well you have to learn the system has to learn for itself and so note that the reinforcement learning problem doesn't actually stipulate that you have to learn but you could maximize your awards without learning it would just say wouldn't do a very good job event yes so learning is required because it's the only way to achieve good performance in any sufficiently large and complex environment so so that's the first step so that step give commonality to all of the other pieces because now you might ask well what should you be learning what is learning even mean you know in this sense you know learning might mean well you're trying to update the parameters of some system which is then the thing that actually picks the actions and and those parameters could be representing anything they could be parameterizing a value function or a model or a policy and so in that sense there's a lot of commonality in that whatever is being represented there is the thing which is being learned and it's being learned with the ultimate goal of maximizing rewards but but the way in which you decompose the problem is is is really what gives the semantics to the whole system like are you trying to learn something to predict well like a value function or a model are you learning something to perform well like a policy and and the form of that objective like it's kind of giving the semantics to the system and so it really is at the next level down a fundamental choice and we have to make those fundamental choices a system designers or enable are our algorithms to be able to learn how to make those choices for themselves so then the next step you mentioned the very for the very first thing you have to deal with is can you even take in this huge stream of observations and do anything with it so the natural next basic question is what is the what is deep reinforcement learning and what is this idea of using neural networks to deal with this huge incoming stream so amongst all the approaches for reinforcement learning deep reinforcement learning is one family of solution feds that tries to utilize powerful representations that are offered by neural networks to represent any of these different components of the solution of the agent like whether it's the value function or the model or the policy the idea of deep learning is to say well here's a powerful tool kit that's so powerful that it's Universal in the sense that it can represent any function and it can learn any function and so if we can leverage that universality that means that whatever whatever we need to represent for our policy or offer a value function or for a model deep learning can do it so that deep learning is is one approach that offers us a toolkit that is has no ceiling to its performance that as we start to put more resources into the system or more memory and more computation and more more data more experience of more interactions with the environment that these are systems that can just get better and better and better at doing whatever the job is they've asked them to do whatever we've asked that function to represent it can learn a function that does a better and better job of representing that that knowledge whether that knowledge be estimating how well you're going to do in the world the value function whether it's going to be choosing what to do in the world a policy or it's understanding the world itself what's going to happen next the model nevertheless the the the fact that neural networks are able to learn incredibly complex representations that allow you to do the policy the model or the value function is at least to my mind exceptionally beautiful and surprising like what was it is it surprising was it surprising to you can you still believe it works as well as it does do you have good intuition about why it works at all and works as well as it does I think let me take two parts to that question I think it's not surprising to me that the idea of reinforcement learning works because in some sense I think it's the I feel it's the only which can ultimately and so I feel we have to we have to address it and there must be success is possible because we have examples of intelligence and it must at some level be able to possible to acquire experience and use that experience to to do better in a way which is meaningful to environments of the complexity that humans can deal with it must be am I surprised that our current systems can do as well as they can do I think one of the big surprises for me and a lot of the community it's really the fact that deep learning can continue to perform so well despite than the fact that these neural networks that they're representing have these incredibly nonlinear kind of bumpy surfaces which two are kind of low dimensional intuitions make it feel like surely you're just going to get stuck and learning will get stuck because you won't be able to make any further progress and yet the big surprise is that learning continues and and these what appear to be local Optima turned out not to be because in high dimensions when we make really big neural nets there's always a way out and there's a way to go even lower and then he's still not another local Optima because there's some other pathway that will take you out and take you lower still and so no matter where you are learning can proceed and do better and better and breath better without bound and so that is a surprising and beautiful property of neural nets which I find elegant and beautiful and and somewhat shocking that it turns out to be the case as you said which I really like to our low dimensional intuitions that's surprising yeah yeah we're very we're very tuned to working within a three-dimensional environment and so to start to visualize what a billion dimensional neural network um surface that you're trying to optimize over what that even looks like is very hard for us and so I think that really if you try to account for the essentially the AI winter where where people gave up on Yule networks I think it's really down to that that lack of ability to generalize from from low dimensions to high dimensions because back then we were in the low dimensional case people could only build neural nets with you know 50 nodes in them or something and to to imagine that it might be possible to build a billion dimension on your net and it might have a completely different qualitatively different property was very hard to anticipate and I think even now we're starting to build the the theory to support that and and it's incomplete at the moment but all of the theory seems to be pointing in the direction that indeed this is an approach which which truly is universal both in its representational capacity which was known but also in its learning ability which is which is surprising and it makes one wonder what else were missing yes for a low demand intuitions yet there will seem obvious once it's discovered I often wonder you know when we one day do have a eyes which are superhuman in their abilities to to understand the world what will they think of the algorithms that we developed back now will it be you know looking back at these these days and you know and and and thinking that well will we look back and feel that these algorithms were were naive faire steps or will they still be the fundamental ideas which are used even in 100 thousand 10,000 years yeah Nels and I they'll they'll watch back to this conversation and I would the smile maybe a little bit of a laugh I mean my senses I think it just like on we used to think that the Sun revolved around the earth they'll see our systems of today in reinforcement learning as too complicated that the answer was simple all along there's something I just just think you said in a game of Go I mean I love those systems of like cellular automata that there's simple rules from which incredible complexity emerges so it feels like there might be some very simple approaches just like where Sutton says right these simple methods or with compute over time seem to prove to be the most effective I 100% agree I think that if we try to anticipate what will generalize well into the future I think it's likely to be the case that it's the simple clear ideas which will have the longest legs and walked or carry us farthest into the future nevertheless we're in a situation where we need to make things work day and today and sometimes that requires putting together more complex systems where we don't have the the full answers yet as to what those minimal ingredients might be so speaking of which if we could take us their bag to go what was Mogo and what was the key idea behind this system so back during my PhD on computer go around about that time there was a major new development in in which actually happened in the context of computer go and and it was really a revolution in the way that heuristic search was was done and and the idea was essentially that a position could be evaluated or a state in general could be evaluated not by humans saying whether that position is good or not or even humans providing rules as to how you might evaluate it but instead by allowing the system to randomly play out the game until the end multiple times and taking the average of those outcomes as the prediction of what will happen so for example if you're in the game of go the intuition is that you take a position and you get the system to kind of play random moves against itself all the way to the end of the game and you see who wins and if black ends up winning more of those random games than white well you say hey this is a position that favors white and if white ends up winning more of those random games than black then it favors white so that idea was known as Monte Carlo search and a particular form of Monte Carlo search that became very effective and was developed in computer go first by Remy Coulomb in 2006 and then taken further by others was something called Monte Carlo tree search which basically takes that same idea and uses that that insight to evaluate every node of a search tree is evaluated by the average of the random play outs from that from that node onwards and this idea was very powerful and suddenly led to huge leaps forward in the strength of computer go playing programs and among those the the strongest of the go playing programs in those days was a program called Mogo which was the first program to actually reach human master level on small boards nine by nine boards and so this was a program by someone called Sylvan jelly he was a good colleague of mine but I worked with him a little bit in those days of my PhD thesis and Mogo was a a first step towards the latest successes we saw and computer go but it was still missing a key ingredient Mogo was evaluating purely by random rollouts against itself and in a way it's it's truly remarkable that random play gives you anything at all yeah like how why why in this perfectly deterministic game that's very precise and involves these very exact sequences why is it that that random randomization is helpful and so the intuition is that randomization captures something about the the nature of the of the search tree that from a position that you're you're understanding the nature of the search tree from that node onwards by by by using randomization and this was a very powerful idea and I've seen this in other spaces talk to the virtual carpet and so on randomized algorithms somehow magically are able to do exceptionally well and and simplifying the problem somehow makes you wonder about the fundamental nature of randomness in our universe it seems to be a useful thing but so from that moment can you maybe tell the origin story in the journey of alphago yeah so programs based on Monty College research were a first revolution in the sense that they led to suddenly programs that could play the game to any reasonable level but they they plateaued it seemed that no matter how much effort people put into these techniques they couldn't exceed the level of amateur Dan level go players so strong players but not not anywhere near the level of professionals never mind the world champion and so that brings us to the birth of alphago which happened in the context of a startup company known as deep mind or where them where a project was born and the project was really a scientific investigation where myself and a jipang and an intern Chris Madison were exploring a scientific question and that scientific question was really is there another fundamentally different approach to to this key question of Goa the key challenge of how can you build that intuition and how can you just have a system that could look at a position and understand what moved to play or or how well you're doing in that position who's going to win and so the deep learning Revolution had just begun their systems like imagenet had suddenly been won by deep learning techniques back in 2012 and following that it was natural to ask well you know if if deep learning is able to scale up so effectively with images to to understand them enough to to classify them well why not go why why not take a the black and white stones of the NGO board and build some a system which can understand for itself what that means in terms of what moved to pick or who's going to win the game black or white and so that was our scientific question which we we were probing and trying to understand and as we started to look at it we discovered that we could build a a system so in fact our very first paper on alphago was actually a pure deep learning system which was trying to answer this question and we showed that actually a pure deep learning system with no search at all was actually able to reach human van level master level at the full game of go 19 by 19 boards and so without any search at all suddenly we had systems which were playing at the level of the best Monte Carlo tree search systems the ones with randomized rollouts so first I'm sorry to interrupt but there's kind of a groundbreaking notion let's say that's like basically a definitive step away from the a couple of decades of essentially search dominating AI yeah so what how do them make you feel would you that was a surprising from a scientific perspective in general how to make you feel I I found this to be profoundly surprising in fact it was so surprising that that we had a bet back then and like many good projects you know bets are quite motivating and Anna bet was you know whether it was possible for a system purely on on deep learning no search at all to beat a Dan level human player and so we had someone who joined our team who was a damn level player he came in and and we had this first match against him and we turned the bit where you want by the way do you handle losing and they were in except I tend to be an optimist with the with the power of of deep learning and reinforcement learning so the system won and we were able to beat this human Dan level player and for me that was the moment where where it's like okay something something special is afoot here we have a system which without search is able to to already just look at this position and understand things as well as a strong human player and from that point onwards I really felt that reaching that reaching the top levels of human play you know professional level world champion level I felt it was actually an inevitability and and if it was an inevitable outcome I was rather keen it would be us that achieve it so we scaled up this was something where you know so I had lots of conversations back then with demo so service that the head of deepmind who was extremely excited and we we made the decision to to scale up the project brought more people on board and and so alphago became something where where we we had a clear goal which was to try and crack this outstanding challenge of AI to see if we could beat the world's best players and this led within the space of not so many months to playing against the European champion fan way in a match which became you know memorable in history is the first time a go program would ever beated a a professional player and at that time we had to make a judgment as to whether when and and whether we should go and challenge the world champion and and this was a difficult to make again we were basing our predictions on on our own progress and had to estimate based on the rapidity of our own progress when we thought we would exceeds the level of the human world champion and and we tried to make an estimate and set up a match and that became the the alphago versus Lisa dolls match in 2016 and we should say spoiler alert that alphago was able to defeat Lisa doll that's right yeah so maybe a could take even a broader view alphago involves both learning from expert games and as far as I remember a self play component - where he learns by playing guess himself but in your sense what was the role of learning from experts there and in terms of your self evaluation whether you can take on the world champion what was the thing that they're trying to do more of sort of train more on expert games or was there's now another I'm asking so many poorly faced questions but did you have a hope a dream that self play would be the key component at that moment yet so in the early days of alphago we we used human data to explore the science of what deep learning can achieve and so when we had our first paper that showed that it was possible to predict the winner of the game that it was possible to suggest moves that was done using human data of solely human did yes and and and and so the reason that we did it that way was at that time we were exploring separately the deep learning aspect from the reinforcement learning aspect that was the part which was which was new and unknown to me at that time was how far could that be stretched once we had that it then became natural to try and use that same representation and see if we could learn for ourselves using that same representation and so right from the beginning actually our goal had been to build a system using self play and to us the human data right from the beginning was an expedient step to help us for pragmatic reasons to go faster towards the goals of the project then we might be able to starting solely from self play and so in those days we were very aware that we were choosing to to use human data and that might not be the long-term holy grail of AI but that it was something which was extremely useful to us it helped us to understand the system helped us to build deep learning representations which were clear and simple and easy to use and so really I would say it's it served a purpose not just as part of the algorithm but something which I continued to use in our research today which is trying to break down a very hard challenge into pieces which are easier to understand for us as researchers and develop so if you if you use a component based on human data it can help you to understand the system such that then you can build the more principled version later that does it for itself so as I said the alphago victory and I don't think I'm being sort of romanticizing this notion I think is one of the greatest moments in the history of AI so were you cognizant of this magnitude of the accomplishment at the time I mean we are you cognizant of it even now because to me I feel like it's something that would we mentioned what the AGI systems of the future will look back I think they'll look back at the alphago tree as like holy crap they figured it out this is where this is where the started well thank you again I mean it's funny because I guess I've been working on I've been working on computer go for a long time so I've been working at the time at the alphago match on computer go for more than a decade and throughout that decade I'd had this dream of what would it be like - what would it be like really - to actually be able to build a system that could play against the world champion and and I imagined that that would be an interesting moment that maybe you know some people might care about that and that this might be you know a nice achievement but I think when I arrived in in Seoul and discovered the legions of that were following us around and 100 million people that were watching the match online life I realized that I had been off in my estimation of how significant this moment was by several orders of magnitude and so there was definitely an adjustment process to to realize that this this was something which the world really cared about and which was a watershed moment and I think there was that moment of realization it was also a little bit scary because you know if you go into something thinking it's going to be may be of interest and then discover that 100 million people are watching it suddenly makes you worry about whether some of the decisions you've made where really they're the best ones or the wisest or we're going to lead to the best outcome and we knew for sure that there were still imperfections in alphago which were going to be exposed to the whole world watching and so yeah it was a it was I think a great experience and I I feel privileged to have been part of it privileged to have led that amazing team I feel privileged to have been in a moment of history like you say but also lucky that you know in a sense I was insulated from from the knowledge of I think it would have been harder to focus on the research if the full kind of reality of what was going to come to pass her had been known to me and the team I think it was you know we were we were in our bubble and we were working on research and we were trying to answer the scientific questions and then BAM you know the public sees it and and I think it was it was it was better that way in retrospect were you confident did I guess what were the chances that you could get the win so just like you said I'm a little bit more familiar with another accomplishment that we may not even get a chance to talk to I talked to us about Alpha star which is another incredible accomplishment but here you know with alpha star and beating the Starcraft there was like already a track record with alphago there this is like the really first time you get to see reinforcement learning face the best humour in the world so what was your confidence like what was the odds well we actually was there a bit but funnily enough there was so so just before the match we weren't betting on anything concrete but we all held out a hand everyone in the team held out her hand at beginning of the match and the number of fingers that they had out on the hand was supposed to represent how many games they thought we would win I guess Lisa doll and there was an amazing spread in there in the team's predictions but I have to say I predicted four one and and the reason was based purely on on data so I'm a scientist first and foremost and one of the things which we had established was that alphago in around 1 in 5 games would develop something which we called a delusion which was a kind of inner hole in its in its knowledge where it wasn't able to fully understand everything about the position and that that hole and its knowledge would persist for tens of moves throughout the game and we knew two things we knew that if there were no delusions that alphago seemed to be playing at a level that was far beyond any human capabilities but we also knew that if there were delusions the office it was true and and and in fact you know that's that's what came to pass we saw we saw all of those outcomes and Lisa doll in in one of the games played a really beautiful sequence that that that alphago just hadn't predicted and after that it it led it into this situation where it was unable to really understand the position fully and and and found itself in one of these these delusions so so indeed yeah for one was the outcome so yeah and can you maybe speak to it a little bit more what were the five games like what what happened is there interesting things that they come to memory in terms of the play of the human machine so I remember all of these games vividly of course you know moments like these don't come too often in the lifetime of her of her scientist and the the first game was was magical because it was the first time that a computer program had defeated a world champion in this Grand Challenge of go and and there was a moment where where alphago invaded Lisa dolls territory towards the end of the game and and that's quite an audacious thing to do it's like saying hey you thought this was gonna be your territory in the game but I'm going to stick a stone right in the middle of it and and and prove to you that I can break it up and Lisa dolls face just dropped he wasn't expecting a computer to to do something that audacious the second game became famous for a move known as move 37 this was a move that was played by alphago that was broke all of the conventions of go that the go players were so shocked by this they they they thought that maybe the operator had made a mistake they they thought that there's something crazy going on and and it just broke every rule that go players are taught from a very young age they just taught you know you this kind of move called the shoulder hit you you you can only play it on the third line or the fourth line and alphago played out in the fifth line and and it turned out to be a brilliant move and made this beautiful pattern in the middle of the board that ended up winning the game and so this really was a clear instance where we could say computers exhibited creativity that this was really a move that was something humans hadn't known about hadn't anticipated and computers discovered this idea they they were the ones to say actually you know here's a new idea something new not not in the domains of human knowledge of the game and and and now the humans think this is a reasonable thing to do and and it's part of go knowledge now the third game something special happens when you play against a human world champion which again I hadn't anticipated before going there which is you know these these players are amazing Lisa Dahl was a true champion eighteen time world champion and had this amazing ability to to probe alphago fer for weaknesses of any kind and in the third game he was losing and we felt we were sailing comfortably to victory but he managed to from nothing stir up this fight and build what's called a double ko these kind of repetitive positions and he knew that historically no no computer go program had ever been able to deal correctly with double code positions and he managed to summon one out of out of nothing and so for us you know this was this was a real challenge like would alphago be able to deal with this or would it just kind of crumble in the face of this situation and fortunately it dealt with it perfectly the force game was was amazing in that Lisa doll appeared to be losing this game alphago thought it was winning and then Lisa doll did something which I think only a true world champion can do which is he found a brilliant sequence in the middle of the game a brilliant sequence that led him to really just transform the position it kind of it it he found it's just a piece of genius really and after that alphago it's it's evaluation just tumbled it thought it was winning this game and all of a sudden it tumbled and said oh now I've got no chance and it starts to behave rather oddly at that point in the final game for some reason we as a team were convinced having seen alphago in the previous game suffer from delusions we as a team were convinced that it was suffering from another delusion we were convinced that it was miss evaluating the position and that something was going terribly wrong and it was only in the last few moves of the game that we realized that actually although it had been predicting it was going to win all the way through it really was and and so somehow you know it just taught us yet again that you have to have faith in in your systems when they when they exceed your own level of ability in your own judgment you have to trust in them too to know better than the new the designer once you've you've stowed in them the ability to to judge better than you can then trust the system to do so so just looking in case of deep blue beating Garry Kasparov so get garrus is I think the first time he's ever lost actually to anybody and I mean there's a similar situation loose at all it's uh it's a tragic it's a tragic loss for humans but a beautiful one I think that's kind of from the tragedy sort of emerges over time emerges the kind of inspiring story but Lisa Dahl recently announced his retirement I don't know if we can look too deeply into it but he did say that even if I become number one there's an entity that cannot be defeated so what do you think about these words what do you think about his retirement from the game ago well let me take you back first of all to the first part of your comment about Garry Kasparov because actually at the panel yesterday he specifically said that when he first lost a deep-blue he he viewed it as a failure he viewed that this this had been a failure of his but later on in his career he said he'd come to realize that actually it was a success it was a success for everyone because this marked a transformational moment for AI and so even for Kip Garry Kasparov he came to realize at that moment was was was pivotal and actually meant something much more than then you know his personal loss in that moment Lisa doll I think was a much more cognizant of that even at the time so in his closing remarks to the match he really felt very strongly that what had happened and the alphago match was not only meaningful for AI but for humans as well and he felt as a go player that it had opened his horizons and meant that he could start exploring new things it brought his joy back for the game of go because it broken all of the conventions and barriers and meant that you know suddenly suddenly anything was possible again and so you know I was sad to hear that he'd retired but you know he's been a great a great world champion over many many years and I think you know that he'll be he'll be remembered for that evermore he'll be remembered as the last person to to beat alphago I mean after after that we increased the power of the system and and the next version of alphago beats the the other strong human players 60 games to nil so you know what a great moment for him and something to be remembered for it's interestingly you spent time at triple AI on a panel with Garry Kasparov what I mean it's almost just curious to learn the conversations you've had with Garry and the because he's also now he's written a book about artificial intelligence he's thinking about AI he has kind of a view of it and he talks about alphago a lot what what's your sense be arguably I'm not just being Russian but I think Gary is the greatest chess player of all time the probably one of the greatest game players of all time and you sort of at the center of creating a system that beats one of the greatest players of all time so what's that conversation like is there anything yeah any interesting digs any bets and you come and you find new things and you profound things so Gary Kasparov has an incredible respect for what we did with alphago and you know it's it's an amazing tribute coming from from him of all people that he really appreciates and respects what what we've done and I think he feels that the progress which was happened in in computer chess which later after alphago we we built the alpha zero system which defeated the the world's strongest chess programs and to Garry Kasparov that moment in computer chess was more profound than than than deep blue and the reason he believes it mattered more was because it was done with with learning and a system which was able to discover for itself new principles new ideas which were able to play the game in a in a in a way which he hadn't always known about or anyone and in fact one of the things I discovered at this panel was that the current world champion Magnus Carlsen apparently recently commented on his improvement in performance and he attributes it to alpha zero that he's been studying the games of alpha zero and he's changed his style play more like alpha zero and it's led to him actually increasing his his his rating to a new peak yeah I guess to me just like to Gary the inspiring thing is that and just like you said with reinforcement learning reinforcement learning and deep learning machine learning feels like what intelligence is yeah and you know you could attribute it to sort of a bitter viewpoint from Gary's perspective from us humans perspective saying that sir pure search that IBM do Blue was doing is not really intelligence but somehow it didn't feel like it and so that's the magical I'm not sure what it is about learning that feels like intelligence but it but it does so I think we should not demean the achievements of what was done in previous eras of AI I think that deep blue was an amazing achievement in itself and that heuristic search of the kind that was used by deep blue had some powerful ideas that were in there but it also missed some things so so the fact that the that the evaluation function the way that the chess position was understood was created by humans and not by the machine is a limitation which means that there's a ceiling on how well it can do but maybe more importantly it means the same idea cannot be applied in other domains where we don't have access to the kind of human Grand Master's and that ability to kind of encode exactly their knowledge into an evaluation function and the reality is that the story of AI is that you know most domains turn out to be of the second type where when knowledge is messy it's hard to extract from experts or it isn't even available and so so we need to solve problems in a different way and I think alphago is a step towards solving things in a way which which puts learning as first-class citizen and says systems need to understand for themselves how to understand the world how to judge their the value of any action that they might take within that world in any state they might find themselves in and in order to do that we we make progress towards AI yeah so one of the nice things about this about taking a learning approach to the game of Go game playing is that the things you learn the things you figure out are actually going to be applicable to other problems there are real-world problems that's so that's ultimately I mean there's two really interesting things about alphago one is the science of it just the science of learning the science of intelligence and then the other is all you're actually learning to figuring out how to build systems that would be potentially applicable in in other applications medical autonomous vehicles robotics all I mean it's just open the door to all kinds of applications so the next incredible step right really the profound step is probably alphago zero I mean it's arguable I kind of see them all as the same place but really in perhaps you were already thinking that alphago zeros the natural it was always going to be the next step but it's removing the reliance on human expert games for pre-training as you mentioned so how big of an intellectual leap was this that that self play could achieve superhuman level performance it's on and maybe could you also say what is self play we kind of mentioned a few times but so let me start with self play so the idea of self play is something which is really about systems learning for themselves but in the situation where there's more than one agent and so if you're in a game and a game is a played between two players then self play is really about understanding that game just by playing games against yourself rather than against any actual real opponent and so it's a way to kind of um discover strategies without having to actually need to go out and play against any particular human player for example the main idea of alpha zero was really to you know try and step back from any of the knowledge that we'd put into the system and ask the question is it possible to come up with a single elegant principle by which a system can learn for itself all of the knowledge which it requires to play to play a game such as go importantly by taking knowledge out you not only make the system less brittle in the sense that perhaps the knowledge you were putting in was was just getting in the way and maybe stopping the system learning for itself but also you make it more general the more knowledge you put in the harder it is for a system to actually be placed taken out of the system in which it's kind of been designed and placed in some other system that maybe would need a completely different knowledge base to to understand and perform well and so the real goal here is to strip out all of the knowledge that we put in to the point that we can just plug it into something totally different and that to me is really you know the the promise of AI is that we can have systems such as that which you know no matter what the goal is no matter what goal we set to the system we can come up with we have an algorithm which can be placed into that world into that and and can succeed in achieving that goal and then that that's to me is almost the the essence of intelligence if we can achieve that and so alpha zero is a step towards that and it's a step that was taken in the context of two-player perfect information games like go and chess we also applied it to Japanese chess so just to clarify the first step was alphago zero the first step was to try and take all of the knowledge out of alphago in such a way that it could play in a in a fully self discovered way purely from self play and to me the the motivation for that was always that we could then plug it into other domains but we saved that bat until later well in in fact I mean just for fun I could tell you exactly the moment where where the idea for alpha zero occurred to me because I think there's maybe a lesson there for for researchers who kind of too deeply embedded in their in their research and you know working 24/7 to try and come up with the next idea which is actually occurred to me on honeymoon like it's my most fully relaxed state really enjoying myself and and just being this like the algorithm for alpha zero just appeared I come and in in its full form and this was actually before we played against Lisa doll but we we just didn't I think we were so busy trying to make sure we could beat the the world champion that it was only later that we had the the opportunity to step back and start examining that that sort of deeper scientific question of whether this could really work so nevertheless so soft play is probably one of the most profound ideas that represents to me at least artificial intelligence but the fact that you could use that kind of mechanism to again be more glass players that's very surprising so we kind of to be it feels like you have to train in a large number of expert gamer so was it surprising to you what was the intuition can you sort of think not necessarily at that time even now what's your intuition why this thing works so well why I was able to learn from scratch well let me first say why we tried it so we tried it both because I feel that it was the deeper scientific question to to be asking to make progress towards AI and also because in general in my research I don't like to do research on questions for which we already know the likely outcome I don't see much value in running an experiment where you're 95% confident that that you will succeed and so we could have tried you know maybe to to take alphago and do something which we we knew for sure it would succeed on but much more interesting to me was to try try it on the things which we weren't sure about and one of the big questions on our minds back then was you know could you really do this with self play alone how far could that go would it be as strong and honestly we weren't sure yeah it was 50/50 I think you know we I really if you'd asked me I wasn't confident that it could reach the same level as these systems but it felt like the right question to ask and even if even if it had not achieved the same level I felt that that was an important direction to be studying and so then lo and behold it actually ended up outperforming the previous version of of alphago and indeed was able to beat it by 100 games to zero so what's the intuition as to as to why I think that the intuition to me is clear that whenever you have errors in a in a system as we did in alphago alphago suffered from these delusions occasionally it would misunderstand what was going on in a position and miss evaluate it how can how can you remove all of these these errors errors arise from many sources for us they were arising both from you know it started from the human data but also from there from the nature of the search and the nature of the algorithm itself but the only way to address them in any complex system is to give the system the ability to correct its own errors it must be able to correct them it must be able to learn for itself when it's doing something wrong and correct for it and so it seemed to me that the way to correct delusions was indeed to have more iterations of reinforcement learning that you know no matter where you start you should be able to correct those errors until it gets to play that out and understand oh well I thought that I was going to win in this situation but then I ended up losing that suggests that I was miss evaluating something there's a hole in my knowledge and now now the system can correct for itself and and understand how to do better now if you take that same idea and trace it back all the way to the beginning it should be able to take you from no knowledge from completely random starting point all the way to the highest levels of knowledge that you can achieve in in a domain and the principle is the same that if you give if you bestow a system with the ability to correct its own errors then it can take you from random to something slightly better than random because it sees the stupid things that the random is doing and it can correct them and then it can take you from that slightly better system and understand what what's that doing wrong and it takes you on to the next level and the next level and and this progress it can go on indefinitely and indeed you know what would have happened if we'd carried on training alphago zero for longer we saw no sign of it slowing down it's in improvements or at least it was certainly carrying on to improve and presumably if you had the computational resources this this could lead to better and better systems that discover more and more so your intuition is fundamentally there's not a ceiling to this process the one of the surprising things just like you said is the process of patching errors it's intuitively makes sense they this is a reinforcement learning should be part of that process but what is surprising is in the process of patching your own lack of knowledge you don't open up other patches you go you keep sort of cool like there's a monotonic decrease of your weaknesses well let me let me back this up you know I think science always should make falsifiable hypotheses yes so let me let me back out this claim with a falsifiable hypothesis which is that if someone was to in the future take alpha zero as an algorithm and run it on with greater computational resources that we had available today then I predict that they would be able to beat the previous system 100 games to zero and that if they were then to do the same thing a couple of years later that that would be that previous system hundred games to zero and that that process would continue indefinitely throughout at least my human lifetime presumably the game of girl would set the ceiling I mean the game of go would set the ceiling but the game of go has ten to the hundred and seventy states in it so so the ceiling is unreachable by any computational device that can be built out of the you know 10 to the 80 atoms in the universe you asked a really good question which is you know do you not open up other errors when you when you correct your previous ones and the answer is is yes you do and so so it's a remarkable fact about about this class of two-player game and also true of single agent games that essentially progress will always lead you to if you have sufficient representational resource like imagine you had could represent every state in a big table of the game then we we know for sure that a progress of self-improvement will lead all the way in the single agent case to the optimal possible behavior and in the two-player case to the minimax optimal behavior and that is that the best way that I can play knowing that you're playing perfectly against me and so so for those cases we know that even if you do open up some new error that in some sense you've made progress you've you're progressing towards the the best that can be done so alphago was initially trained expert games with some self play alphago zero removed the need to be trained on expert games and then another incredible step for me because I just love chess is to generalize that further to be in alpha zero to be able to play the game of go beating alphago zero and alphago and then also being able to play the check the game of chess and others so what was that step like what's the interesting aspects there that required to make that happen I think the remarkable observation which we saw with alpha zero was that actually without modifying the algorithm at all it was able to play and crack some of a i's greatest previous challenges in particular we dropped it into the game of chess and unlike the previous systems like deep blue which had been worked on for you know years and years and we were able to beat the world's strongest computer chess program convincingly using a system that was fully discovered by its own from from scratch with its own principles and in fact one of the nice things that that we found was that in fact we also achieved the same result in in Japanese chess a variant of chess where where you get to capture pieces and then place them back down on your on your own side as an extra piece so much more complicated variant of chess and we also beat the world's strongest programs and reach superhuman performance in that game too and it was the very first time that we'd ever run the system on that particular game was the version that we published in the paper on on alpha zero it just works out of the box literally no no no touching it we didn't have to do anything and and there it was superhuman performance no tweaking no no twiddling and so I think there's something beautiful about that principle that you can take and algorithm and without twiddling anything it just it just works now to go beyond alpha zero what's required alpha zero is is just a step and there's a long way to go beyond that to really crack the deep problems of AI but one of the important steps is to acknowledge that the world is a really messy place you know it's this rich complex beautiful but messy environment that we live in and no one gives us the rules like no one knows the rules of the world at least maybe we understand that it operates according to Newtonian or quantum mechanics at the micro level all according to relativity at the macro level but that's not a model that's used to useful for us as people to to operate in it somehow the agent needs to understand the world for itself in a way where no one tells it the rules of the game and yet it can still figure out what to do in that world deal with this stream of observations coming in rich sensory input coming in actions going out in a way that allows it to reason in the way that alphago or alpha zero can reason in the way that these go and chess-playing programs can reason but in a way that allows it to take actions in that messy world to to achieve its goals and so this led us to the most recent step in the story of alphago which was a system called mu 0 and mu zero is a system which learns for itself even when the rules are not given to it it actually can be dropped into a system with messy perceptual inputs we actually tried it in the in some Atari games the canonical domains of Atari that have been used for reinforcement learning and and this system learned to build a model of these Atari games they were sufficiently rich and useful enough for it to be able to plan successfully and in fact that system not only went on to to beat the state of the art in Atari but the same system without modification was able to reach the same level of superhuman performance in go chess and shogi that we'd seen in alpha zero showing that even without the rules the system can learn for itself just by trial and error just by playing this game of go and no one tells you what the rules are but you just get to the end and and someone says you know win or loss you play this game and someone says win or lost so you play a game of breakout in Atari and someone just tells you you know your score at the end and the system for itself figures out essentially the rules of the system the dynamics of the world how the world works and that not in any explicit way but just implicitly enough understanding for it to be able to plan in that in that system in order to achieve its goals and that's the you know that's the fundamental process there to go through when you're facing any uncertain kind of environment they would in the real world it's figuring out the sort of the rules the basic rules of the game that's right so there's a lot I mean the ad that that allows it to be applicable to basically any domain that could be digitized in the way that it needs to in order to be consumable sort of in order for the reinforcement learning framework to be able to sense the environment to be able to act anywhere and so on the full reinforcement learning problem needs to deal with with worlds that are unknown and and complex and and the agent needs to learn for itself how to deal with that so museu I was as a step I felt a step in that direction one of the things that inspired the general public interesting conversations I have like with my parents or something my mom that just loves what was done is kind of at least the notion that there was some display of creativity some new strategies new behaviors that were created that that again has echoes of intelligence so is there something that stands up do you see it the same way that there's creativity and there's some behaviors patterns you saw that alpha zero was able to display their truly creative so let me start by I think saying that I think we should ask what creativity really means so to me creativity means discovering something which wasn't known before something unexpected something out outside of our norms and so in that sense the process of reinforcement learning or the self play approach that was used by alpha zero is it's the essence of creativity it's really saying at every stage you're playing according to your current norms and you try something and if it works out you say hey here's something great I'm gonna start using that and then that process it's like a micro discovery that happens millions and millions of times over the course of the algorithms life where it just discovers some new idea oh this pattern this patterns working really well for me I'm gonna I'm gonna start using that oh now oh here's this other thing I can do I can start to to connect these stones together in this way or I can start to you know sacrifice stones or give up on on on pieces or play shoulder hits on the fifth line or whatever it is the system is discovering things like this for itself continually repeatedly all the time and so it should come as no surprise to us then when if you leave these systems going that they discover things that are not known to humans to the human norms are considered creative and we've seen this several times in fact in alphago zero we saw this beautiful timeline of discovery where what we saw was that there are these opening patterns that humans play called joseki these are like the patterns that humans learn to play in the corners and they've been developed and refined over over literally thousands of years in the game of go and what we saw was in the course of the training alphago 0 over the course of the 40 days that we trained this system it's just to discover exactly these patterns that human players play and over time we found that all of the joseki that humans played were were discovered by the system through this process of self play and a sort of essential notion of creativity well what was really interesting was that over time it then started to discard some of these maybe own joseki that humans didn't know about yeah and it starts to say oh well you thought that the Knights move pincer joseki was a great idea but here's something you different you can do there which make some new variation that the humans didn't know about and actually now the human go player study the joseki their alphago played and they become the new norms that are used in today um top-level guy competitions that never gets old even just the first to me maybe just makes me feel good as a human being that a self play mechanism knows nothing about us humans discovers patterns that we humans do it's just I get an affirmation that we're doing we're doing okay as humans yeah in this domain in other domains we do we figure it out it's like the Churchill quote about democracy it's the you know it's the but it sucks but it's the best song we've tried so in general taking a step outside of go and I take a million accomplishment to have no time to talk about that with alpha star and so on and and and the current work but in general this self play mechanism that you've inspired the world with by beating the world champion goal player do you see that as DC being applied in other domains do you have sort of dreams and hopes that is applied in both the simulated environments in a constrained environments of games constrained I mean alpha star really demonstrates that you can remove a lot of the constraints but nevertheless it's in a digital simulated environment do you have a hope a dream that it starts being applied in the robotics environment and maybe even in domains that are a little safety critical and so on and have you know have a real impact in the real world like autonomous vehicles for example it seems like a very far-out dream at this point so I absolutely do hope and and imagine that we will we will get to the point where ideas just like these are used in all kinds of different domains in fact one of the most satisfying things as a researcher as when you start to see other people use your your algorithms in unexpected ways so in the last couple of years there have been you know a couple of nature papers where different teams unbeknownst to to us took alpha zero and applied exactly those same algorithms and ideas to real-world problems of huge meaning to to society so one of them was the problem of chemical synthesis and they were able to beat the state-of-the-art in finding pathways of how to actually synthesize chemicals retro retro chemical synthesis and the second paper actually actually just came out a couple of weeks ago in nature showed that in quantum computation you know one of the big questions is how to how to understand the nature of the the function in quantum computation and a system based on alpha zero beat the state of the art by quite some distance there again so so these are just examples and I think you know the lesson which we've seen elsewhere in machine learning time and time again is that if you make something general it will be used in all kinds of ways you know you provide a really powerful tools to society and and those tools can be used in in amazing ways and so I think we're just at the beginning and and for sure I hope that we we see all kinds of outcomes so the the in the the other side of the question of a reinforcement learning framework is you know you usually want to specify a reward function and an objective function what do you think about sort of ideas of intrinsic rewards if we're not really sure about you know of if we take you know human beings existence proof that we don't seem to be operating according to a single reward do you think that there's interesting ideas for when you don't know how to truly specify the reward you know that there's some flexibility for discovering it intrinsically or so on in the context of reinforcement learning so I think you know when we think about intelligence it's really important to be clear about the problem of intelligence and I think it's clearest to understand that problem in terms of some ultimate goal that we want the system to to try and solve for and after all if we don't understand the ultimate purpose of the system do we really even have a clearly defined defined problem that we are solving at all now within that as with your example for humans the system may choose to create its own motivations and sub goals that helped the system to achieve its ultimate goal and that may indeed be a hugely important mechanism to achieve those altima goals but there is still some ultimate goal I think the system needs to be measurable and and evaluated against and even for humans I mean humans were incredibly flexible we feel that we we can you know any goal that we're given we feel we can we can master to some degree but if we think of those goals really you know like the goal of being able to pick up an object or the goal of being able to communicate although influence people to do things in a particular way or whatever those goals are really they are that they're sub goals really that we set ourselves you know we choose to pick up the object we choose to communicate we choose to to influence someone else and we choose those because we think it will lead us to something in our in later art and we think that that's helpful to us to achieve some ultimate goal now I don't want to speculate whether or not humans as a system necessarily have a singular overall goal of survival or whatever it is but I think the principle for understanding and implementing intelligences has to be that if we're trying to understand intelligence or implement our own there has to be a well-defined problem otherwise if it's not I think it's it's like an admission of defeat that forget to be hope for understanding or implementing intelligence we have to know what we're doing we have to know what we're asking the system to do otherwise if you if you don't have a clearly defined purpose you're not going to get a clearly defined answer the the ridiculous big question that has to naturally follow because they have to pin you down on this on this thing that nevertheless one of the big silly or big real questions before humans is the meaning of life is us trying to figure out our own reward function yeah and you just kind of mentioned that if you want to build the intelligence systems and you know what you're doing you should be at least cognizant to some degree of what the reward function is so the natural question is what do you think is the reward function of human life the meaning of life for us humans the meaning of our existence I think you know I'd be speculating beyond my own expertise but but just for fun let me do that yes please and say I think that there are many levels at which you can understand a system and and you can understand something as as optimizing for a goal at many levels and so so you can understand the the you know let's start with the universe like um does the universe have a purpose well it feels like it's just one level just following certain mechanical laws of physics and that that's led to the development of the universe but at another level you can view it as actually there's the second law of thermodynamics that says that this is increasing in entropy over time forever and now there's a view that's been developed by certain people at MIT that this you can think of this as as almost like a goal of the universe that the purpose of the universe is to maximize entropy so there's multiple levels at which you can understand a system the next level down you might say well if the goal is to is to maximize entropy well how do how does how can that be done by a particular system and maybe evolution is something that the universe discovered in order in order to kind of dissipate energy as efficiently as possible and by the way I'm borrowing from Max tegmark for some of these metaphors yes the physicist but if you can think of evolution as a mechanism for dispersing energy then then evolution you you might say as then becomes a goal which is if if evolution disperses energy by reproducing as efficiently as possible what's evolution then well it's now got its own goal within that which is to actually reproduce as effectively as possible and now how does reproduction how is that made as effective as possible well you need entities within that that can survive and reproduce as effectively as possible and so it's natural in order to achieve that high level goal those individual organisms discover brains intelligences which enable them to support the goals of evolution and those brains what do they do well perhaps the early brains maybe they were controlling things at some direct level you know maybe they were the equivalent of pre-programmed systems which were directly controlling what was going on and setting certain you know things in order to achieve these particular particular goals but that led to a another level of discovery which was learning systems you know parts of the brain which were able to learn from themselves and learn how to to program themselves to achieve any goal and presumably there are parts of the game of the brain where goals are set to to parts of that that system and provides this very flexible notion of intelligence that we as humans presumably have which is the ability to kind of wipe the reason we feel that we can we can we can achieve any goal so so it's a very long-winded answer to say that you know I think there are many perspectives and many levels at which intelligence can be understood and and each of those levels you can take multiple perspectives that you know you can view the system as something which is optimizing for a goal which is understanding it at a level by which we can maybe implement it and understand it as AI researchers or computer scientists or you can understand it at the level of the mechanistic thing which is going on that there are these you know atoms bouncing around in the brain and they lead to the the outcome of that system is not in contradiction with the fact that it's it's also a a decision-making system that's optimizing for some goal and and purpose I've never heard the description of the meaning of life structured so beautifully in layers but you did miss one layer which is the next step which you're responsible for which is creating the the artificial intelligence and data layer on top of that and I can't wait to see well I may not be around but they can't wait to see what the next layer beyond that well we well let's just take that that argument you know and pursue it to a central conclusion so the next level indeed is for for how can our how can our learning brain achieve its goals most effectively well maybe it does so by by us as learning beings building a system which is able to solve for those goals more effectively than we can and so when we build a system to play the game of go you know when I said that I wanted to build a system that can play go better than I can I've enabled myself to achieve that goal of playing go better than I could buy buy directly playing it and learning it myself and so now a new layer has been created which is systems which are able to achieve goals for themselves and ultimately there may be layers beyond that where they set sub goals to parts of their own system in order to to achieve those and so forth so incredible so the story of intelligence I think I think is is a multi-layered one and a multi perspective one we live in an incredible universe David thank you so much first of all for dreaming of using learning to solve go and building intelligent systems and for actually making it happen and for inspiring millions of people in the process it's truly an honor thank you so much for talking today okay thank you thanks for listening to this conversation with David silver and thank you to our sponsors masterclass and cash app please consider supporting the podcast by signing up to master class at masterclass complex and downloading cash app and using code lex podcast if you enjoy this podcast subscribe on youtube review it with five stars an apple podcast supported on patreon or simply connect with me on Twitter at lex friedman and now let me leave you with some words from david silver my personal belief is that we've seen something of a turning point where we're starting to understand that many abilities like intuition and creativity that we've previously thought or in the domain only of the human mind are actually accessible to machine intelligence as well and I think that's a really exciting moment in history thank you for listening and hope to see you next time youthe following is a conversation with David silver who leads the reinforcement learning research group a deep mind and was the lead researcher on alphago alpha 0 and co led the Alpha star and Museum efforts and a lot of important work in reinforcement learning in general I believe alpha zero is one of the most important accomplishments in the history of artificial intelligence and David is one of the key humans who brought alpha zero to life together with a lot of other great researchers at deep mind he's humble kind and brilliant we were both jet lagged but didn't care and made it happen it was a pleasure and truly an honor to talk with David this conversation was recorded before the outbreak of the pandemic for everyone feeling the medical psychological and financial burden of this crisis I'm sending love your way stay strong or in this together we'll beat this thing this is the artificial intelligence podcast if you enjoy it subscribe on youtube review it with five stars an apple podcast support on patreon or simply connect with me on Twitter Alex Friedman spelled Fri DM aen as usual I'll do a few minutes of as now and never any ads in the middle they can break the flow of the conversation I hope that works for you and doesn't hurt the listening experience quick summary of the ads to sponsors masterclass and cash app please consider supporting the podcast by signing up to master class and master class comm slash flex and downloading cash app and using code and Lex podcast this show is presented by cash app the number one finance app in the App Store when you get it use code Lex podcast cash app lets you send money to friends buy Bitcoin and invest in the stock market with as little as one dollar since cash app allows you to buy Bitcoin let me mention that cryptocurrency in the context of the history of money it's fascinating I recommend a cent of money as a great book on this history debits and credits and Ledger's started around 30,000 years ago the US dollar created over two hundred years ago and Bitcoin the first decentralized cryptocurrency at least just over ten years ago so given that history cryptocurrency is still very much in its early days of development but it's still aiming to and just might redefine the nature of money so again if you get cash out from the App Store or Google Play and use the code let's podcast you get ten dollars and cash wrap will also donate ten dollars the first an organization that is helping to advance robotics and stem education for young people around the world this show is sponsored by masterclass set up a masterclass complex to get a discount and to support this podcast in fact for a limited time now if you sign up for an all-access pass for a year you get to get another all-access pass to share with a friend buy one get one free when I first heard about masterclass I thought it was too good to be true for one hundred eighty dollars a year you get an all-access pass to watch courses from to list some of my favorites Chris Hadfield on space exploration Neil deGrasse Tyson on scientific thinking communication will write the creator of SimCity and Sims on game design jane goodall on conservation Carlos Santana on guitar his song Europa could be the most beautiful guitar song ever written garry kasparov on chess daniel negreanu on poker and many many more Chris Hadfield explaining how Rockets work and the experience of being launched into space alone is worth the money for me the keys to not be overwhelmed by the abundance of choice pick three courses you want to complete watch each of them all the way through it's not that long but it's an experience that will stick with you for a long time I promise it's easily worth the money you can watch it on basically any device once again sign up a master class complex to get a discount and to support this podcast and now here's my conversation with David silver what was the first program you've ever written and what programming language do you remember I remember very clearly he have my my parents brought home this BBC modeled B microcomputer it was just this fascinating thing to me I was about seven years old and couldn't resist just playing around with it so I think first program ever was writing my name out in different colors and getting it to loop and repeat that and there was something magical about that which just led to more and more how did you think about computers back then like the magical aspect of it that you can write a program and there's this thing that you just gave birth to it's able to creative visual elements and live in its own or did you not think of it in those romantic notions was it more like oh that's cool I can I can solve some puzzles it was always more than solving puzzles it was something where you know there was this limitless possibilities once you have a computer in front of you you can do anything with it that's um I used to play with Lego with the same feeling you can make anything you want out of Lego but even more so with a computer you know you don't you're not constrained by the amount of kit you've got and so I was fascinated by it and started pulling out there you know the user guide and the advanced user guide and then learning so I started in basic and then you know later 6502 my father was also became interested in there in this machine and gave up his career to go back to school and study for an a master's degree in in artificial intelligence funnily enough Essex University when I was when I was seven so I was exposed to those things at an early age he showed me how to program in Prolog and do things like querying your family tree and those are some of my earlier earliest memories of trying to trying to figure things out on a computer those are the early steps in computer science programming but when did you first fall in love with artificial intelligence or were the ideas the dreams of AI I think it was really when I when I went to study at university so I was an undergrad at Cambridge and studying computer science and and I really started to question you know what what really are the goals what what's the goal where do we want to go with with computer science and it seemed to me that the the only step of major significance to take was to try and recreate something akin to human intelligence if we could do that that would be a major leap forward and that idea certainly wasn't the first to have it but it you know nestled within me somewhere and and became like a bug you know I really wanted to to crack that problem so you thought it was like you had a notion that this is something that human beings can do it is possible to create an intelligent machine well I mean unless you believe in something metaphysical then what are our brains doing well at some level their information processing systems which are able to take whatever information is in there transform it through some form of program and produce some kind of output which enables that that human being to do all the amazing things that they can do in this incredible world so so then do you remember the first time you've written a program that because you also had an interesting games do you remember the first time you were in the program that beat you in a game said I won't beat you at anything sort of achieved Super David silver level performance so I used to work in the games industry so for five years I programmed games for my first job so it was a amazing opportunity to get involved in a startup company and so I I was involved in in building AI at that time and so for sure there was a sense of building handcrafted what people used to call AI in the games industry which i think is not really what we might think of as AI and its fullest sense but something which is able to to take actions and in a way which which makes things interesting and challenging for their for the for the human player and at that time I was able to build you know these handcrafted agents which in certain limited cases could do things which which were able to do better than me but mostly in these kind of twitch like scenarios where where they were able to do things faster or because they had some pattern which was able to exploit repeatedly I think if we're talking about real AI the first experience for me came after that when I I realized that this path I was on wasn't taking me towards it wasn't it wasn't dealing with that bug which I still had inside me to really understand intelligence and try and and try and solve it everything people were doing in games was you know short-term fixes rather than long-term vision and so I went back to study for my PhD which was fairly enough trying to apply reinforcement learning to the game of go and I built my first go program using reinforcement learning a system which would by trial and error play against itself and was able to learn which patterns were actually helpful to predict whether it's going to win or lose the game and then choose the moves that led to the combination of patterns that would mean that you're more likely to win in that system that system beat me how did that make you feel make me feel good I was there as sort of the yeah then is the it's a mix of a sort of excitement and was there a tinge of sort of like almost like a fearful aw you know it's like in space 2001 Space Odyssey kind of realizing that you've created something that there's you know that is that's achieved human level intelligence in this one particular little task and in that case I suppose a neural networks weren't involved there were no neural networks in those days this was pre deep learning revolution but it was a principled self learning system based on a lot of the principles which which people are still using in deep reinforcement learning how did I feel I I think I found it immensely satisfying that a system which was able to learn from first principles for itself was able to reach the point that it was understanding this domain better than better than I could and able to outwit me I don't think it was a sense of or it was a sense that satisfaction that this that's something I felt should work had worked so to me alphago and I don't know how else to put it but to me alphago and alpha a girl zero mastery in the game of girl is again to me the most profound and inspiring moment in the history of artificial intelligence so you're one of the key people behind this achievement and I'm Russian so I really felt the first sort of seminal achievement one deep blue beat garry kasparov in 1987 so as far as I know the AI community at that point largely saw the game of Go was unbeatable in AI using the the sort of the state of the art to brute force methods search methods even if you consider at least the way I saw it even if you consider arbitrary exponential ski scaling of compute go would still not be solvable hence why it was thought to be impossible so given that the game of go was impossible to to master one was the dream for you you just mentioned your PG thesis of building the system that plays go what was the dream for you that you could actually build a computer program that achieves world-class not necessarily beat the world champion but I cheesed that kind of level of playing go first of all thank you that's very kind West and funnily enough I just came from a panel where I was actually in a conversation with Garry Kasparov and Marie Campbell who was the author of deep blue and it was their first meeting together since the since the match yesterday so I'm literally fresh from that experience so these are amazing moments when they happen but where did it all start well for me it started when I became fascinated in the game of go so go for me I've grown up playing games I've always had a fascination in in in board games I played chess as a kid I played Scrabble as a kid when I was at university I discovered the game of go and and to me it just blew all of those other games out of the water it was just so deep and profound in its in its complexity with endless levels to it what I discovered was that I could devote endless hours to this game and I knew in my heart of hearts that no matter how many hours I would devote to it I would never become a you know a grandmaster or there was another path and the other path was to try and understand how you could get some other intelligence to play this this game better than I would be able to and so even in those days I had this idea that you know what if what if it was possible to build a program that could crack this and as I started to explore the domain I discovered that you know this was really the domain where people felt deeply that if progress could be made and go it really mean a giant leap forward for a I it was the the challenge where all other approaches had failed you know this is coming out of the area you mentioned which was in some sense their the golden era for further classical methods of a I like heuristic search in the 90s you know they all they all fell one after another not just chess with deep blue but checkers backgammon Othello there were numerous cases where where systems built on top of heuristic search methods with you know his high-performance systems have been able to defeat the human world champion in each of those domains and yet in that same time period there was a million dollar prize available for the game of go for the first system to be a human professional player and at the end of that time period in year 2000 when the prize expired the strongest go program in the world was defeated by a nine-year-old child when that nine year old child was giving 9 free moves to the computer at the start of the game and to try and even things up yeah and computer go X but beat that strongest same strongest program with 29 handicaps tones 29 free moves so that's what the state of affairs was when I became interested in this problem in around 2000 and 2003 when I I start started working computer go there was nothing they were there was just there was very very little in the way of progress towards meaningful performance again anything approaching human level and so people they it wasn't through lack of effort people have tried many many things and so there was a strong sense that that something different would be required for go than then had been needed for all of these other domains where I had a I had been successful and maybe the single clearest example is that that go unlike those other domains had this kind of intuitive property that a go player would look at a position and say hey you know here's this mess of black and white stones but from this mess oh I can I can predict that that's this part of the board has become my territory this part of the boards become your territory and I've got this overall sense I'm going to win and this is about the right move to play and that intuitive sense of judgment of being able to evaluate what's going on in a position it was pivotal to humans being able to play this game and something that people had no idea how to put into computers so this question of how to evaluate in a position how to come up with these intuitive judgments was the key reason why go was so hard in addition to its enormous search space and the reason why methods which had succeeded so well elsewhere failed and go and so people really felt deep down that that you know in order to crack go we would need to get something akin to human intuition and if we got something akin to human intuition we'd be able to self you know much many many more problems in AI so to me that was the moment where it's like okay this is not just about playing the game of Go this is about something profound and it was back to that bug which had been itching me all those years now this is the opportunity to do something meaningful and and transformative and and I guess a dream was born that's a really interesting way to put it almost this realization that you need to find formulate girls are kind of a prediction problem versus a search problem was the intuition I mean I maybe that's the wrong crude term but the to give it us the ability to kind of Intuit things about positional structure of the board well okay but what about the learning part of it did you have a sense that you have to that learning has to be part of the system again something that hasn't really as as far as I think except with TD Guerin and in the 90s was RL a little bit hasn't been part of those state-of-the-art game playing systems so I strongly felt that learning would be necessary and that's why my my PhD topic back then was trying to apply reinforcement learning to the game of CO and not just learning of any type but I felt that the only way to really have a system to progress beyond human levels of performance wouldn't just be to mimic how humans do it but to understand for themselves and how else can a machine hope to understand what's going on except through learning if you're not learning what else are you doing while you're putting all the knowledge into the system and that just feels like a something which decades of AI have told us is is maybe not a dead end but certainly has a ceiling to the capabilities it's known as the you know knowledge acquisition bottleneck that there the more you try to put into something the more brittle the system becomes and and so you just have to have learning you have to have learning that's the only way you're going to be able to get a system which has sufficient knowledge in it you know millions and millions of pieces of knowledge billions trillions of a form that it can actually apply for itself and understand how those billions and trillions of pieces of knowledge can be leveraged in a way which will actually lead it towards its goal without conflict or or other issues yeah I mean if I put myself back in there in that time I just wouldn't think like that without a good demonstration of RL I would I would think more in the symbolic AI like that though it would not learning but sort of a simulation of knowledge base like a growing knowledge base but it would still be sort of pattern based lot like basically have little rules that you kind of assemble together into a large knowledge base well in a sense that was the state of the art back then so if you look at the go programs which had been competing for this prize I mentioned they were an assembly of different specialized systems some of which used huge amounts of human knowledge to describe how you should play the opening how you should all the different patterns that were required to to play well in the game of Go endgame Theory combinatorial game theory and combined with more principled search based methods which we're trying to solve for particular sub parts of the game like life and death connecting groups together all these amazing subproblems that just emerged in the game of Go there were there were different pieces all put together into this like collage which together would try and play against a human and although not all of the pieces were handcrafted the overall effect was nevertheless still brittle and it was hard to make all these pieces work well together and so really what I was pressing for and the main innovation of the approach they took was to go back to first principles and say well let's let's back off that and try and find a principled approach where the system can learn for itself it just from the outcome like you know learn for itself if you try something did that did that help or did it not help and only through that procedure can you arrive at knowledge which is which is verified the system has to verify it for itself not relying on any other third party to say this is right or this is wrong so that principle was already you know very important in those days but unfortunately we were missing some important pieces back then so before we dive into may be discussing the beauty of reinforcement learning let's think it's the back who kind of skipped skipped it a bit but the rules of the game of go what's the the elements of it perhaps contrasting to chess that sort of you really enjoyed as a human being and also that make it really difficult as a a I machine learning problem so the game of CO was has remarkably simple rules if that's so simple that people have speculated that if we were to meet alien life at some point that we wouldn't be able to communicate with them but we would be able to play hello go with that probably have discovered the same rule set yeah so the game is played on a on a 19 by 19 grid and you play on the intersections of the grid and the players take turns and the aim of the game is very simple it's to surround as much territory as you can as many of these intersections with your stones and just around more than your opponent does and the only nuance to the game is that if you fully surround your opponent's piece then you get to capture it and remove it from the board and it counts as your own territory now from those very simple rules immense complexity arises it's kind of profound strategies in how to surround territory how to kind of trade-off between making solid territory yourself now compared to building up influence that will help you acquire territory later in the game how to connect groups together how to keep your own groups alive which which patterns of stones are most useful compared to others there's just immense knowledge and human go players have played this game for it was discovered thousands of years ago and human go players have built up its immense knowledge base over over the years it's studied very deeply and played by something like 50 million players across the world mostly in China Japan and Korea where it's a important part of a culture so much so that it's considered one of the four ancient arts that was required by Chinese scholars so there's a deep history there but there's interesting quality so if I is it a comparative chess chess is in the same way as it is in Chinese culture of a goal in chess in Russia is also considered one of the secret arts so if we contrast sort of go with chess as interesting qualities about go maybe you can correct me if I'm wrong but the evaluation of a particular static board is not as reliable like you can't in chess you can kind of assign points to the different units and it's kind of a pretty good measure of who's one who's losing it's not so clear yeah so this game of the HOH you know you find yourself in a situation where both players have played the same number of stones actually captures a strong level of play happen very rarely which means that any moment in the game you've got the same number of white stones and black stones and the only thing which differentiates how well you're doing is this intuitive sense of you know where are the territories ultimately going to form on this board and when you if you look at the complexity of a real go position you know it's it's mind boggling that kind of question of what will happen in in 300 moves from now when you when you see just a scattering of twenty white and black stones intermingled and and so that that challenge is the reason why position of value is so hard in go compared to two other games in addition to that has an enormous search space so there's around ten to one hundred and seventy positions in the game of go that's an astronomical number and that search spaces is so great that traditional heuristic search methods that were so successful and things like deep blue and and chess programs just kind of fall over and go so a which pointed reinforcement learning enter your life your research life your way of thinking we just talked about learning but reinforcement learning is very particular kind of learning one that's both philosophically sort of profound yeah but also one that's pretty difficult to get to work as if we look back in the earth at least the early days so when did that enter your life and how did that work progress so I had just finished working in the games industry this startup company and I took I took a year out to discover for myself exactly which path I wanted to take I knew I wanted to study intelligence but I wasn't sure what that meant at that stage I really didn't feel had the tools to decide on exactly which path I wants to follow so during that year I I read a lot and one of the things I read was Saturn Umberto the sort of seminal tech spec are an introduction to reinforcement learning and when I read that textbook I I just had this resonating feeling that this is what I understood intelligence to be and this was the path that I felt would be necessary to go down to make progress in in AI so I got in touch with rich Saturn and asked him if he would be interested in supervising me on a PhD thesis in in computer go and he he basically said that if he's still alive he'd be happy to but unfortunately he'd been you know struggling with very serious cancer for some years and he really wasn't confident at that stage that he'd even be around to see the end event but fortunately that part of the story worked out very happily and I found myself out there in Alberta they've got a great games group out there with a history of fantastic working in board games as well as rich that in the father of RL so it was the the natural place for me to go in some sense to study this question and the more I looked into it the more the more strongly ie I felt that this wasn't just the path to progress in computer go but really you know this this was the thing I'd been looking for this was really an opportunity to to frame what intelligence means like what does what are the goals of AI in a clear single clear problem definition such that if we're able to solve that play a single problem definition in some sense we've cracked the problem of AI so to you reinforcement learning ideas at least sort of echoes of it would be at the core of intelligence it is as a core of intelligence and if we ever create in a human level intelligence system it would be at the core of that kind of system let me say it this way that I think I think it's helpful to separate out the problem from the solution so I see the problem of intelligence I would say it can be formalized as the reinforcement learning problem and that that formalization is enough to capture most if not all of the things that we mean by intelligence that that they can all be brought within this this this framework and gives us a way to access them in a meaningful way that allows us as as scientists to understand intelligence and us as computer scientists to to build them and so in that sense I feel that it gives us a path maybe not the only path but a path towards AI and so do I think that any system in the future that that's you know sold AI would would have to have RL within it well I think if you ask that you're asking about the solution methods I would say that if we have such a thing it would be a solution to the RL problem now what particular methods have been used to get there well we should keep an open mind about the best approaches to actually solve any problem and you know the things we have right now for reinforcement learning maybe maybe then maybe I believe they've got a lot of legs but maybe we're missing some things maybe there's gonna be better ideas I think we should keep her you know let's remain modest and we're at the early days of this field and and there are many amazing discoveries ahead of us for sure the specifics especially of the different kinds of our ell approaches currently there could be other things there followed is a very large umbrella of our ell but if it's if it's okay can we take a step back and kind of ask the basic question of what is to you reinforcement learning so reinforcement learning is the study and the science and the problem of intelligence in the form of an agent that interacts with an environment so the problem is trying to self is represented by some environment like the world in which that agent is situated and the goal of RL is clear that the agent gets to take actions those actions have some effects on the environment and the environment gives back an observation to the agent saying you know this is what you see your sense and one special thing which it gives back is it's called the raw signal how well it's doing in the environment and the reinforcement learning problem is to simply take actions over time so as to maximize that reward signal so a couple of basic questions what types of RL approaches are there so I don't know if there's a nice brief in words way to paint the picture of sort of value based model based policy based reinforcement learning yeah so now if we think about okay so there's this ambitious problem definition of RL it's really you know it's truly ambitious it's trying to capture and encircle all of the things in which an agent interacts with an environment and say well how can we formalize and understand what it means to to crack that now let's think about the solution method well how do you solve a really hard problem like that well one approach you can take is is to decompose that that very hard problem into into pieces that work together to solve that hard problem and and so you can kind of look at the decomposition that's inside the agents head if you like and ask well what form does that decomposition take and some of the most common pieces that people use when they're kind of putting this system the solution method together some of the most common pieces that people use are whether or not that solution has a value function that means is it trying to predict explicitly trying to predict how much reward it will get in the future does it have a representation of a policy that means something which is deciding how to pick actions is is that decision-making process explicitly represented and is there a model in the system is there something which is explicitly trying to predict what will happen in the environment and so those three pieces are to me some of the most common building blocks and I understand the different choices in RL as choices of whether or not to use those building blocks when you're trying to decompose the solution you know should I have a value function represented so they have a policy represented should I have a model represented and there are combinations of those pieces and of course other things that you could add to add into the picture as well but those those three fundamental choices give rise to some of the branches of RL with which we're very familiar and so those as you mentioned there is the choice of what's specified or modeled explicitly and the idea is that all of these are somehow implicitly learned within the system so it's almost a choice of how you approach a problem do you see those as fundamental differences or these almost like small specifics like the details of how you saw the problem but they're not fundamentally different from each other I think the the fundamental idea is is maybe at the higher level the fundamental idea is the first step of the decomposition is really to say well how are we really going to solve any kind of problem where you're trying to figure out how to take actions and just from a stream of observations you know you've got some agents situated it's sensory motor stream and getting all these observations here and getting to take these actions and and what should it do how can even broach that problem you know me the complexity of the world is so great that you can't even imagine how to build a system that would that would understand how to deal with that and so the first step of this decomposition is to say well you have to learn the system has to learn for itself and so note that the reinforcement learning problem doesn't actually stipulate that you have to learn but you could maximize your awards without learning it would just say wouldn't do a very good job event yes so learning is required because it's the only way to achieve good performance in any sufficiently large and complex environment so so that's the first step so that step give commonality to all of the other pieces because now you might ask well what should you be learning what is learning even mean you know in this sense you know learning might mean well you're trying to update the parameters of some system which is then the thing that actually picks the actions and and those parameters could be representing anything they could be parameterizing a value function or a model or a policy and so in that sense there's a lot of commonality in that whatever is being represented there is the thing which is being learned and it's being learned with the ultimate goal of maximizing rewards but but the way in which you decompose the problem is is is really what gives the semantics to the whole system like are you trying to learn something to predict well like a value function or a model are you learning something to perform well like a policy and and the form of that objective like it's kind of giving the semantics to the system and so it really is at the next level down a fundamental choice and we have to make those fundamental choices a system designers or enable are our algorithms to be able to learn how to make those choices for themselves so then the next step you mentioned the very for the very first thing you have to deal with is can you even take in this huge stream of observations and do anything with it so the natural next basic question is what is the what is deep reinforcement learning and what is this idea of using neural networks to deal with this huge incoming stream so amongst all the approaches for reinforcement learning deep reinforcement learning is one family of solution feds that tries to utilize powerful representations that are offered by neural networks to represent any of these different components of the solution of the agent like whether it's the value function or the model or the policy the idea of deep learning is to say well here's a powerful tool kit that's so powerful that it's Universal in the sense that it can represent any function and it can learn any function and so if we can leverage that universality that means that whatever whatever we need to represent for our policy or offer a value function or for a model deep learning can do it so that deep learning is is one approach that offers us a toolkit that is has no ceiling to its performance that as we start to put more resources into the system or more memory and more computation and more more data more experience of more interactions with the environment that these are systems that can just get better and better and better at doing whatever the job is they've asked them to do whatever we've asked that function to represent it can learn a function that does a better and better job of representing that that knowledge whether that knowledge be estimating how well you're going to do in the world the value function whether it's going to be choosing what to do in the world a policy or it's understanding the world itself what's going to happen next the model nevertheless the the the fact that neural networks are able to learn incredibly complex representations that allow you to do the policy the model or the value function is at least to my mind exceptionally beautiful and surprising like what was it is it surprising was it surprising to you can you still believe it works as well as it does do you have good intuition about why it works at all and works as well as it does I think let me take two parts to that question I think it's not surprising to me that the idea of reinforcement learning works because in some sense I think it's the I feel it's the only which can ultimately and so I feel we have to we have to address it and there must be success is possible because we have examples of intelligence and it must at some level be able to possible to acquire experience and use that experience to to do better in a way which is meaningful to environments of the complexity that humans can deal with it must be am I surprised that our current systems can do as well as they can do I think one of the big surprises for me and a lot of the community it's really the fact that deep learning can continue to perform so well despite than the fact that these neural networks that they're representing have these incredibly nonlinear kind of bumpy surfaces which two are kind of low dimensional intuitions make it feel like surely you're just going to get stuck and learning will get stuck because you won't be able to make any further progress and yet the big surprise is that learning continues and and these what appear to be local Optima turned out not to be because in high dimensions when we make really big neural nets there's always a way out and there's a way to go even lower and then he's still not another local Optima because there's some other pathway that will take you out and take you lower still and so no matter where you are learning can proceed and do better and better and breath better without bound and so that is a surprising and beautiful property of neural nets which I find elegant and beautiful and and somewhat shocking that it turns out to be the case as you said which I really like to our low dimensional intuitions that's surprising yeah yeah we're very we're very tuned to working within a three-dimensional environment and so to start to visualize what a billion dimensional neural network um surface that you're trying to optimize over what that even looks like is very hard for us and so I think that really if you try to account for the essentially the AI winter where where people gave up on Yule networks I think it's really down to that that lack of ability to generalize from from low dimensions to high dimensions because back then we were in the low dimensional case people could only build neural nets with you know 50 nodes in them or something and to to imagine that it might be possible to build a billion dimension on your net and it might have a completely different qualitatively different property was very hard to anticipate and I think even now we're starting to build the the theory to support that and and it's incomplete at the moment but all of the theory seems to be pointing in the direction that indeed this is an approach which which truly is universal both in its representational capacity which was known but also in its learning ability which is which is surprising and it makes one wonder what else were missing yes for a low demand intuitions yet there will seem obvious once it's discovered I often wonder you know when we one day do have a eyes which are superhuman in their abilities to to understand the world what will they think of the algorithms that we developed back now will it be you know looking back at these these days and you know and and and thinking that well will we look back and feel that these algorithms were were naive faire steps or will they still be the fundamental ideas which are used even in 100 thousand 10,000 years yeah Nels and I they'll they'll watch back to this conversation and I would the smile maybe a little bit of a laugh I mean my senses I think it just like on we used to think that the Sun revolved around the earth they'll see our systems of today in reinforcement learning as too complicated that the answer was simple all along there's something I just just think you said in a game of Go I mean I love those systems of like cellular automata that there's simple rules from which incredible complexity emerges so it feels like there might be some very simple approaches just like where Sutton says right these simple methods or with compute over time seem to prove to be the most effective I 100% agree I think that if we try to anticipate what will generalize well into the future I think it's likely to be the case that it's the simple clear ideas which will have the longest legs and walked or carry us farthest into the future nevertheless we're in a situation where we need to make things work day and today and sometimes that requires putting together more complex systems where we don't have the the full answers yet as to what those minimal ingredients might be so speaking of which if we could take us their bag to go what was Mogo and what was the key idea behind this system so back during my PhD on computer go around about that time there was a major new development in in which actually happened in the context of computer go and and it was really a revolution in the way that heuristic search was was done and and the idea was essentially that a position could be evaluated or a state in general could be evaluated not by humans saying whether that position is good or not or even humans providing rules as to how you might evaluate it but instead by allowing the system to randomly play out the game until the end multiple times and taking the average of those outcomes as the prediction of what will happen so for example if you're in the game of go the intuition is that you take a position and you get the system to kind of play random moves against itself all the way to the end of the game and you see who wins and if black ends up winning more of those random games than white well you say hey this is a position that favors white and if white ends up winning more of those random games than black then it favors white so that idea was known as Monte Carlo search and a particular form of Monte Carlo search that became very effective and was developed in computer go first by Remy Coulomb in 2006 and then taken further by others was something called Monte Carlo tree search which basically takes that same idea and uses that that insight to evaluate every node of a search tree is evaluated by the average of the random play outs from that from that node onwards and this idea was very powerful and suddenly led to huge leaps forward in the strength of computer go playing programs and among those the the strongest of the go playing programs in those days was a program called Mogo which was the first program to actually reach human master level on small boards nine by nine boards and so this was a program by someone called Sylvan jelly he was a good colleague of mine but I worked with him a little bit in those days of my PhD thesis and Mogo was a a first step towards the latest successes we saw and computer go but it was still missing a key ingredient Mogo was evaluating purely by random rollouts against itself and in a way it's it's truly remarkable that random play gives you anything at all yeah like how why why in this perfectly deterministic game that's very precise and involves these very exact sequences why is it that that random randomization is helpful and so the intuition is that randomization captures something about the the nature of the of the search tree that from a position that you're you're understanding the nature of the search tree from that node onwards by by by using randomization and this was a very powerful idea and I've seen this in other spaces talk to the virtual carpet and so on randomized algorithms somehow magically are able to do exceptionally well and and simplifying the problem somehow makes you wonder about the fundamental nature of randomness in our universe it seems to be a useful thing but so from that moment can you maybe tell the origin story in the journey of alphago yeah so programs based on Monty College research were a first revolution in the sense that they led to suddenly programs that could play the game to any reasonable level but they they plateaued it seemed that no matter how much effort people put into these techniques they couldn't exceed the level of amateur Dan level go players so strong players but not not anywhere near the level of professionals never mind the world champion and so that brings us to the birth of alphago which happened in the context of a startup company known as deep mind or where them where a project was born and the project was really a scientific investigation where myself and a jipang and an intern Chris Madison were exploring a scientific question and that scientific question was really is there another fundamentally different approach to to this key question of Goa the key challenge of how can you build that intuition and how can you just have a system that could look at a position and understand what moved to play or or how well you're doing in that position who's going to win and so the deep learning Revolution had just begun their systems like imagenet had suddenly been won by deep learning techniques back in 2012 and following that it was natural to ask well you know if if deep learning is able to scale up so effectively with images to to understand them enough to to classify them well why not go why why not take a the black and white stones of the NGO board and build some a system which can understand for itself what that means in terms of what moved to pick or who's going to win the game black or white and so that was our scientific question which we we were probing and trying to understand and as we started to look at it we discovered that we could build a a system so in fact our very first paper on alphago was actually a pure deep learning system which was trying to answer this question and we showed that actually a pure deep learning system with no search at all was actually able to reach human van level master level at the full game of go 19 by 19 boards and so without any search at all suddenly we had systems which were playing at the level of the best Monte Carlo tree search systems the ones with randomized rollouts so first I'm sorry to interrupt but there's kind of a groundbreaking notion let's say that's like basically a definitive step away from the a couple of decades of essentially search dominating AI yeah so what how do them make you feel would you that was a surprising from a scientific perspective in general how to make you feel I I found this to be profoundly surprising in fact it was so surprising that that we had a bet back then and like many good projects you know bets are quite motivating and Anna bet was you know whether it was possible for a system purely on on deep learning no search at all to beat a Dan level human player and so we had someone who joined our team who was a damn level player he came in and and we had this first match against him and we turned the bit where you want by the way do you handle losing and they were in except I tend to be an optimist with the with the power of of deep learning and reinforcement learning so the system won and we were able to beat this human Dan level player and for me that was the moment where where it's like okay something something special is afoot here we have a system which without search is able to to already just look at this position and understand things as well as a strong human player and from that point onwards I really felt that reaching that reaching the top levels of human play you know professional level world champion level I felt it was actually an inevitability and and if it was an inevitable outcome I was rather keen it would be us that achieve it so we scaled up this was something where you know so I had lots of conversations back then with demo so service that the head of deepmind who was extremely excited and we we made the decision to to scale up the project brought more people on board and and so alphago became something where where we we had a clear goal which was to try and crack this outstanding challenge of AI to see if we could beat the world's best players and this led within the space of not so many months to playing against the European champion fan way in a match which became you know memorable in history is the first time a go program would ever beated a a professional player and at that time we had to make a judgment as to whether when and and whether we should go and challenge the world champion and and this was a difficult to make again we were basing our predictions on on our own progress and had to estimate based on the rapidity of our own progress when we thought we would exceeds the level of the human world champion and and we tried to make an estimate and set up a match and that became the the alphago versus Lisa dolls match in 2016 and we should say spoiler alert that alphago was able to defeat Lisa doll that's right yeah so maybe a could take even a broader view alphago involves both learning from expert games and as far as I remember a self play component - where he learns by playing guess himself but in your sense what was the role of learning from experts there and in terms of your self evaluation whether you can take on the world champion what was the thing that they're trying to do more of sort of train more on expert games or was there's now another I'm asking so many poorly faced questions but did you have a hope a dream that self play would be the key component at that moment yet so in the early days of alphago we we used human data to explore the science of what deep learning can achieve and so when we had our first paper that showed that it was possible to predict the winner of the game that it was possible to suggest moves that was done using human data of solely human did yes and and and and so the reason that we did it that way was at that time we were exploring separately the deep learning aspect from the reinforcement learning aspect that was the part which was which was new and unknown to me at that time was how far could that be stretched once we had that it then became natural to try and use that same representation and see if we could learn for ourselves using that same representation and so right from the beginning actually our goal had been to build a system using self play and to us the human data right from the beginning was an expedient step to help us for pragmatic reasons to go faster towards the goals of the project then we might be able to starting solely from self play and so in those days we were very aware that we were choosing to to use human data and that might not be the long-term holy grail of AI but that it was something which was extremely useful to us it helped us to understand the system helped us to build deep learning representations which were clear and simple and easy to use and so really I would say it's it served a purpose not just as part of the algorithm but something which I continued to use in our research today which is trying to break down a very hard challenge into pieces which are easier to understand for us as researchers and develop so if you if you use a component based on human data it can help you to understand the system such that then you can build the more principled version later that does it for itself so as I said the alphago victory and I don't think I'm being sort of romanticizing this notion I think is one of the greatest moments in the history of AI so were you cognizant of this magnitude of the accomplishment at the time I mean we are you cognizant of it even now because to me I feel like it's something that would we mentioned what the AGI systems of the future will look back I think they'll look back at the alphago tree as like holy crap they figured it out this is where this is where the started well thank you again I mean it's funny because I guess I've been working on I've been working on computer go for a long time so I've been working at the time at the alphago match on computer go for more than a decade and throughout that decade I'd had this dream of what would it be like - what would it be like really - to actually be able to build a system that could play against the world champion and and I imagined that that would be an interesting moment that maybe you know some people might care about that and that this might be you know a nice achievement but I think when I arrived in in Seoul and discovered the legions of that were following us around and 100 million people that were watching the match online life I realized that I had been off in my estimation of how significant this moment was by several orders of magnitude and so there was definitely an adjustment process to to realize that this this was something which the world really cared about and which was a watershed moment and I think there was that moment of realization it was also a little bit scary because you know if you go into something thinking it's going to be may be of interest and then discover that 100 million people are watching it suddenly makes you worry about whether some of the decisions you've made where really they're the best ones or the wisest or we're going to lead to the best outcome and we knew for sure that there were still imperfections in alphago which were going to be exposed to the whole world watching and so yeah it was a it was I think a great experience and I I feel privileged to have been part of it privileged to have led that amazing team I feel privileged to have been in a moment of history like you say but also lucky that you know in a sense I was insulated from from the knowledge of I think it would have been harder to focus on the research if the full kind of reality of what was going to come to pass her had been known to me and the team I think it was you know we were we were in our bubble and we were working on research and we were trying to answer the scientific questions and then BAM you know the public sees it and and I think it was it was it was better that way in retrospect were you confident did I guess what were the chances that you could get the win so just like you said I'm a little bit more familiar with another accomplishment that we may not even get a chance to talk to I talked to us about Alpha star which is another incredible accomplishment but here you know with alpha star and beating the Starcraft there was like already a track record with alphago there this is like the really first time you get to see reinforcement learning face the best humour in the world so what was your confidence like what was the odds well we actually was there a bit but funnily enough there was so so just before the match we weren't betting on anything concrete but we all held out a hand everyone in the team held out her hand at beginning of the match and the number of fingers that they had out on the hand was supposed to represent how many games they thought we would win I guess Lisa doll and there was an amazing spread in there in the team's predictions but I have to say I predicted four one and and the reason was based purely on on data so I'm a scientist first and foremost and one of the things which we had established was that alphago in around 1 in 5 games would develop something which we called a delusion which was a kind of inner hole in its in its knowledge where it wasn't able to fully understand everything about the position and that that hole and its knowledge would persist for tens of moves throughout the game and we knew two things we knew that if there were no delusions that alphago seemed to be playing at a level that was far beyond any human capabilities but we also knew that if there were delusions the office it was true and and and in fact you know that's that's what came to pass we saw we saw all of those outcomes and Lisa doll in in one of the games played a really beautiful sequence that that that alphago just hadn't predicted and after that it it led it into this situation where it was unable to really understand the position fully and and and found itself in one of these these delusions so so indeed yeah for one was the outcome so yeah and can you maybe speak to it a little bit more what were the five games like what what happened is there interesting things that they come to memory in terms of the play of the human machine so I remember all of these games vividly of course you know moments like these don't come too often in the lifetime of her of her scientist and the the first game was was magical because it was the first time that a computer program had defeated a world champion in this Grand Challenge of go and and there was a moment where where alphago invaded Lisa dolls territory towards the end of the game and and that's quite an audacious thing to do it's like saying hey you thought this was gonna be your territory in the game but I'm going to stick a stone right in the middle of it and and and prove to you that I can break it up and Lisa dolls face just dropped he wasn't expecting a computer to to do something that audacious the second game became famous for a move known as move 37 this was a move that was played by alphago that was broke all of the conventions of go that the go players were so shocked by this they they they thought that maybe the operator had made a mistake they they thought that there's something crazy going on and and it just broke every rule that go players are taught from a very young age they just taught you know you this kind of move called the shoulder hit you you you can only play it on the third line or the fourth line and alphago played out in the fifth line and and it turned out to be a brilliant move and made this beautiful pattern in the middle of the board that ended up winning the game and so this really was a clear instance where we could say computers exhibited creativity that this was really a move that was something humans hadn't known about hadn't anticipated and computers discovered this idea they they were the ones to say actually you know here's a new idea something new not not in the domains of human knowledge of the game and and and now the humans think this is a reasonable thing to do and and it's part of go knowledge now the third game something special happens when you play against a human world champion which again I hadn't anticipated before going there which is you know these these players are amazing Lisa Dahl was a true champion eighteen time world champion and had this amazing ability to to probe alphago fer for weaknesses of any kind and in the third game he was losing and we felt we were sailing comfortably to victory but he managed to from nothing stir up this fight and build what's called a double ko these kind of repetitive positions and he knew that historically no no computer go program had ever been able to deal correctly with double code positions and he managed to summon one out of out of nothing and so for us you know this was this was a real challenge like would alphago be able to deal with this or would it just kind of crumble in the face of this situation and fortunately it dealt with it perfectly the force game was was amazing in that Lisa doll appeared to be losing this game alphago thought it was winning and then Lisa doll did something which I think only a true world champion can do which is he found a brilliant sequence in the middle of the game a brilliant sequence that led him to really just transform the position it kind of it it he found it's just a piece of genius really and after that alphago it's it's evaluation just tumbled it thought it was winning this game and all of a sudden it tumbled and said oh now I've got no chance and it starts to behave rather oddly at that point in the final game for some reason we as a team were convinced having seen alphago in the previous game suffer from delusions we as a team were convinced that it was suffering from another delusion we were convinced that it was miss evaluating the position and that something was going terribly wrong and it was only in the last few moves of the game that we realized that actually although it had been predicting it was going to win all the way through it really was and and so somehow you know it just taught us yet again that you have to have faith in in your systems when they when they exceed your own level of ability in your own judgment you have to trust in them too to know better than the new the designer once you've you've stowed in them the ability to to judge better than you can then trust the system to do so so just looking in case of deep blue beating Garry Kasparov so get garrus is I think the first time he's ever lost actually to anybody and I mean there's a similar situation loose at all it's uh it's a tragic it's a tragic loss for humans but a beautiful one I think that's kind of from the tragedy sort of emerges over time emerges the kind of inspiring story but Lisa Dahl recently announced his retirement I don't know if we can look too deeply into it but he did say that even if I become number one there's an entity that cannot be defeated so what do you think about these words what do you think about his retirement from the game ago well let me take you back first of all to the first part of your comment about Garry Kasparov because actually at the panel yesterday he specifically said that when he first lost a deep-blue he he viewed it as a failure he viewed that this this had been a failure of his but later on in his career he said he'd come to realize that actually it was a success it was a success for everyone because this marked a transformational moment for AI and so even for Kip Garry Kasparov he came to realize at that moment was was was pivotal and actually meant something much more than then you know his personal loss in that moment Lisa doll I think was a much more cognizant of that even at the time so in his closing remarks to the match he really felt very strongly that what had happened and the alphago match was not only meaningful for AI but for humans as well and he felt as a go player that it had opened his horizons and meant that he could start exploring new things it brought his joy back for the game of go because it broken all of the conventions and barriers and meant that you know suddenly suddenly anything was possible again and so you know I was sad to hear that he'd retired but you know he's been a great a great world champion over many many years and I think you know that he'll be he'll be remembered for that evermore he'll be remembered as the last person to to beat alphago I mean after after that we increased the power of the system and and the next version of alphago beats the the other strong human players 60 games to nil so you know what a great moment for him and something to be remembered for it's interestingly you spent time at triple AI on a panel with Garry Kasparov what I mean it's almost just curious to learn the conversations you've had with Garry and the because he's also now he's written a book about artificial intelligence he's thinking about AI he has kind of a view of it and he talks about alphago a lot what what's your sense be arguably I'm not just being Russian but I think Gary is the greatest chess player of all time the probably one of the greatest game players of all time and you sort of at the center of creating a system that beats one of the greatest players of all time so what's that conversation like is there anything yeah any interesting digs any bets and you come and you find new things and you profound things so Gary Kasparov has an incredible respect for what we did with alphago and you know it's it's an amazing tribute coming from from him of all people that he really appreciates and respects what what we've done and I think he feels that the progress which was happened in in computer chess which later after alphago we we built the alpha zero system which defeated the the world's strongest chess programs and to Garry Kasparov that moment in computer chess was more profound than than than deep blue and the reason he believes it mattered more was because it was done with with learning and a system which was able to discover for itself new principles new ideas which were able to play the game in a in a in a way which he hadn't always known about or anyone and in fact one of the things I discovered at this panel was that the current world champion Magnus Carlsen apparently recently commented on his improvement in performance and he attributes it to alpha zero that he's been studying the games of alpha zero and he's changed his style play more like alpha zero and it's led to him actually increasing his his his rating to a new peak yeah I guess to me just like to Gary the inspiring thing is that and just like you said with reinforcement learning reinforcement learning and deep learning machine learning feels like what intelligence is yeah and you know you could attribute it to sort of a bitter viewpoint from Gary's perspective from us humans perspective saying that sir pure search that IBM do Blue was doing is not really intelligence but somehow it didn't feel like it and so that's the magical I'm not sure what it is about learning that feels like intelligence but it but it does so I think we should not demean the achievements of what was done in previous eras of AI I think that deep blue was an amazing achievement in itself and that heuristic search of the kind that was used by deep blue had some powerful ideas that were in there but it also missed some things so so the fact that the that the evaluation function the way that the chess position was understood was created by humans and not by the machine is a limitation which means that there's a ceiling on how well it can do but maybe more importantly it means the same idea cannot be applied in other domains where we don't have access to the kind of human Grand Master's and that ability to kind of encode exactly their knowledge into an evaluation function and the reality is that the story of AI is that you know most domains turn out to be of the second type where when knowledge is messy it's hard to extract from experts or it isn't even available and so so we need to solve problems in a different way and I think alphago is a step towards solving things in a way which which puts learning as first-class citizen and says systems need to understand for themselves how to understand the world how to judge their the value of any action that they might take within that world in any state they might find themselves in and in order to do that we we make progress towards AI yeah so one of the nice things about this about taking a learning approach to the game of Go game playing is that the things you learn the things you figure out are actually going to be applicable to other problems there are real-world problems that's so that's ultimately I mean there's two really interesting things about alphago one is the science of it just the science of learning the science of intelligence and then the other is all you're actually learning to figuring out how to build systems that would be potentially applicable in in other applications medical autonomous vehicles robotics all I mean it's just open the door to all kinds of applications so the next incredible step right really the profound step is probably alphago zero I mean it's arguable I kind of see them all as the same place but really in perhaps you were already thinking that alphago zeros the natural it was always going to be the next step but it's removing the reliance on human expert games for pre-training as you mentioned so how big of an intellectual leap was this that that self play could achieve superhuman level performance it's on and maybe could you also say what is self play we kind of mentioned a few times but so let me start with self play so the idea of self play is something which is really about systems learning for themselves but in the situation where there's more than one agent and so if you're in a game and a game is a played between two players then self play is really about understanding that game just by playing games against yourself rather than against any actual real opponent and so it's a way to kind of um discover strategies without having to actually need to go out and play against any particular human player for example the main idea of alpha zero was really to you know try and step back from any of the knowledge that we'd put into the system and ask the question is it possible to come up with a single elegant principle by which a system can learn for itself all of the knowledge which it requires to play to play a game such as go importantly by taking knowledge out you not only make the system less brittle in the sense that perhaps the knowledge you were putting in was was just getting in the way and maybe stopping the system learning for itself but also you make it more general the more knowledge you put in the harder it is for a system to actually be placed taken out of the system in which it's kind of been designed and placed in some other system that maybe would need a completely different knowledge base to to understand and perform well and so the real goal here is to strip out all of the knowledge that we put in to the point that we can just plug it into something totally different and that to me is really you know the the promise of AI is that we can have systems such as that which you know no matter what the goal is no matter what goal we set to the system we can come up with we have an algorithm which can be placed into that world into that and and can succeed in achieving that goal and then that that's to me is almost the the essence of intelligence if we can achieve that and so alpha zero is a step towards that and it's a step that was taken in the context of two-player perfect information games like go and chess we also applied it to Japanese chess so just to clarify the first step was alphago zero the first step was to try and take all of the knowledge out of alphago in such a way that it could play in a in a fully self discovered way purely from self play and to me the the motivation for that was always that we could then plug it into other domains but we saved that bat until later well in in fact I mean just for fun I could tell you exactly the moment where where the idea for alpha zero occurred to me because I think there's maybe a lesson there for for researchers who kind of too deeply embedded in their in their research and you know working 24/7 to try and come up with the next idea which is actually occurred to me on honeymoon like it's my most fully relaxed state really enjoying myself and and just being this like the algorithm for alpha zero just appeared I come and in in its full form and this was actually before we played against Lisa doll but we we just didn't I think we were so busy trying to make sure we could beat the the world champion that it was only later that we had the the opportunity to step back and start examining that that sort of deeper scientific question of whether this could really work so nevertheless so soft play is probably one of the most profound ideas that represents to me at least artificial intelligence but the fact that you could use that kind of mechanism to again be more glass players that's very surprising so we kind of to be it feels like you have to train in a large number of expert gamer so was it surprising to you what was the intuition can you sort of think not necessarily at that time even now what's your intuition why this thing works so well why I was able to learn from scratch well let me first say why we tried it so we tried it both because I feel that it was the deeper scientific question to to be asking to make progress towards AI and also because in general in my research I don't like to do research on questions for which we already know the likely outcome I don't see much value in running an experiment where you're 95% confident that that you will succeed and so we could have tried you know maybe to to take alphago and do something which we we knew for sure it would succeed on but much more interesting to me was to try try it on the things which we weren't sure about and one of the big questions on our minds back then was you know could you really do this with self play alone how far could that go would it be as strong and honestly we weren't sure yeah it was 50/50 I think you know we I really if you'd asked me I wasn't confident that it could reach the same level as these systems but it felt like the right question to ask and even if even if it had not achieved the same level I felt that that was an important direction to be studying and so then lo and behold it actually ended up outperforming the previous version of of alphago and indeed was able to beat it by 100 games to zero so what's the intuition as to as to why I think that the intuition to me is clear that whenever you have errors in a in a system as we did in alphago alphago suffered from these delusions occasionally it would misunderstand what was going on in a position and miss evaluate it how can how can you remove all of these these errors errors arise from many sources for us they were arising both from you know it started from the human data but also from there from the nature of the search and the nature of the algorithm itself but the only way to address them in any complex system is to give the system the ability to correct its own errors it must be able to correct them it must be able to learn for itself when it's doing something wrong and correct for it and so it seemed to me that the way to correct delusions was indeed to have more iterations of reinforcement learning that you know no matter where you start you should be able to correct those errors until it gets to play that out and understand oh well I thought that I was going to win in this situation but then I ended up losing that suggests that I was miss evaluating something there's a hole in my knowledge and now now the system can correct for itself and and understand how to do better now if you take that same idea and trace it back all the way to the beginning it should be able to take you from no knowledge from completely random starting point all the way to the highest levels of knowledge that you can achieve in in a domain and the principle is the same that if you give if you bestow a system with the ability to correct its own errors then it can take you from random to something slightly better than random because it sees the stupid things that the random is doing and it can correct them and then it can take you from that slightly better system and understand what what's that doing wrong and it takes you on to the next level and the next level and and this progress it can go on indefinitely and indeed you know what would have happened if we'd carried on training alphago zero for longer we saw no sign of it slowing down it's in improvements or at least it was certainly carrying on to improve and presumably if you had the computational resources this this could lead to better and better systems that discover more and more so your intuition is fundamentally there's not a ceiling to this process the one of the surprising things just like you said is the process of patching errors it's intuitively makes sense they this is a reinforcement learning should be part of that process but what is surprising is in the process of patching your own lack of knowledge you don't open up other patches you go you keep sort of cool like there's a monotonic decrease of your weaknesses well let me let me back this up you know I think science always should make falsifiable hypotheses yes so let me let me back out this claim with a falsifiable hypothesis which is that if someone was to in the future take alpha zero as an algorithm and run it on with greater computational resources that we had available today then I predict that they would be able to beat the previous system 100 games to zero and that if they were then to do the same thing a couple of years later that that would be that previous system hundred games to zero and that that process would continue indefinitely throughout at least my human lifetime presumably the game of girl would set the ceiling I mean the game of go would set the ceiling but the game of go has ten to the hundred and seventy states in it so so the ceiling is unreachable by any computational device that can be built out of the you know 10 to the 80 atoms in the universe you asked a really good question which is you know do you not open up other errors when you when you correct your previous ones and the answer is is yes you do and so so it's a remarkable fact about about this class of two-player game and also true of single agent games that essentially progress will always lead you to if you have sufficient representational resource like imagine you had could represent every state in a big table of the game then we we know for sure that a progress of self-improvement will lead all the way in the single agent case to the optimal possible behavior and in the two-player case to the minimax optimal behavior and that is that the best way that I can play knowing that you're playing perfectly against me and so so for those cases we know that even if you do open up some new error that in some sense you've made progress you've you're progressing towards the the best that can be done so alphago was initially trained expert games with some self play alphago zero removed the need to be trained on expert games and then another incredible step for me because I just love chess is to generalize that further to be in alpha zero to be able to play the game of go beating alphago zero and alphago and then also being able to play the check the game of chess and others so what was that step like what's the interesting aspects there that required to make that happen I think the remarkable observation which we saw with alpha zero was that actually without modifying the algorithm at all it was able to play and crack some of a i's greatest previous challenges in particular we dropped it into the game of chess and unlike the previous systems like deep blue which had been worked on for you know years and years and we were able to beat the world's strongest computer chess program convincingly using a system that was fully discovered by its own from from scratch with its own principles and in fact one of the nice things that that we found was that in fact we also achieved the same result in in Japanese chess a variant of chess where where you get to capture pieces and then place them back down on your on your own side as an extra piece so much more complicated variant of chess and we also beat the world's strongest programs and reach superhuman performance in that game too and it was the very first time that we'd ever run the system on that particular game was the version that we published in the paper on on alpha zero it just works out of the box literally no no no touching it we didn't have to do anything and and there it was superhuman performance no tweaking no no twiddling and so I think there's something beautiful about that principle that you can take and algorithm and without twiddling anything it just it just works now to go beyond alpha zero what's required alpha zero is is just a step and there's a long way to go beyond that to really crack the deep problems of AI but one of the important steps is to acknowledge that the world is a really messy place you know it's this rich complex beautiful but messy environment that we live in and no one gives us the rules like no one knows the rules of the world at least maybe we understand that it operates according to Newtonian or quantum mechanics at the micro level all according to relativity at the macro level but that's not a model that's used to useful for us as people to to operate in it somehow the agent needs to understand the world for itself in a way where no one tells it the rules of the game and yet it can still figure out what to do in that world deal with this stream of observations coming in rich sensory input coming in actions going out in a way that allows it to reason in the way that alphago or alpha zero can reason in the way that these go and chess-playing programs can reason but in a way that allows it to take actions in that messy world to to achieve its goals and so this led us to the most recent step in the story of alphago which was a system called mu 0 and mu zero is a system which learns for itself even when the rules are not given to it it actually can be dropped into a system with messy perceptual inputs we actually tried it in the in some Atari games the canonical domains of Atari that have been used for reinforcement learning and and this system learned to build a model of these Atari games they were sufficiently rich and useful enough for it to be able to plan successfully and in fact that system not only went on to to beat the state of the art in Atari but the same system without modification was able to reach the same level of superhuman performance in go chess and shogi that we'd seen in alpha zero showing that even without the rules the system can learn for itself just by trial and error just by playing this game of go and no one tells you what the rules are but you just get to the end and and someone says you know win or loss you play this game and someone says win or lost so you play a game of breakout in Atari and someone just tells you you know your score at the end and the system for itself figures out essentially the rules of the system the dynamics of the world how the world works and that not in any explicit way but just implicitly enough understanding for it to be able to plan in that in that system in order to achieve its goals and that's the you know that's the fundamental process there to go through when you're facing any uncertain kind of environment they would in the real world it's figuring out the sort of the rules the basic rules of the game that's right so there's a lot I mean the ad that that allows it to be applicable to basically any domain that could be digitized in the way that it needs to in order to be consumable sort of in order for the reinforcement learning framework to be able to sense the environment to be able to act anywhere and so on the full reinforcement learning problem needs to deal with with worlds that are unknown and and complex and and the agent needs to learn for itself how to deal with that so museu I was as a step I felt a step in that direction one of the things that inspired the general public interesting conversations I have like with my parents or something my mom that just loves what was done is kind of at least the notion that there was some display of creativity some new strategies new behaviors that were created that that again has echoes of intelligence so is there something that stands up do you see it the same way that there's creativity and there's some behaviors patterns you saw that alpha zero was able to display their truly creative so let me start by I think saying that I think we should ask what creativity really means so to me creativity means discovering something which wasn't known before something unexpected something out outside of our norms and so in that sense the process of reinforcement learning or the self play approach that was used by alpha zero is it's the essence of creativity it's really saying at every stage you're playing according to your current norms and you try something and if it works out you say hey here's something great I'm gonna start using that and then that process it's like a micro discovery that happens millions and millions of times over the course of the algorithms life where it just discovers some new idea oh this pattern this patterns working really well for me I'm gonna I'm gonna start using that oh now oh here's this other thing I can do I can start to to connect these stones together in this way or I can start to you know sacrifice stones or give up on on on pieces or play shoulder hits on the fifth line or whatever it is the system is discovering things like this for itself continually repeatedly all the time and so it should come as no surprise to us then when if you leave these systems going that they discover things that are not known to humans to the human norms are considered creative and we've seen this several times in fact in alphago zero we saw this beautiful timeline of discovery where what we saw was that there are these opening patterns that humans play called joseki these are like the patterns that humans learn to play in the corners and they've been developed and refined over over literally thousands of years in the game of go and what we saw was in the course of the training alphago 0 over the course of the 40 days that we trained this system it's just to discover exactly these patterns that human players play and over time we found that all of the joseki that humans played were were discovered by the system through this process of self play and a sort of essential notion of creativity well what was really interesting was that over time it then started to discard some of these maybe own joseki that humans didn't know about yeah and it starts to say oh well you thought that the Knights move pincer joseki was a great idea but here's something you different you can do there which make some new variation that the humans didn't know about and actually now the human go player study the joseki their alphago played and they become the new norms that are used in today um top-level guy competitions that never gets old even just the first to me maybe just makes me feel good as a human being that a self play mechanism knows nothing about us humans discovers patterns that we humans do it's just I get an affirmation that we're doing we're doing okay as humans yeah in this domain in other domains we do we figure it out it's like the Churchill quote about democracy it's the you know it's the but it sucks but it's the best song we've tried so in general taking a step outside of go and I take a million accomplishment to have no time to talk about that with alpha star and so on and and and the current work but in general this self play mechanism that you've inspired the world with by beating the world champion goal player do you see that as DC being applied in other domains do you have sort of dreams and hopes that is applied in both the simulated environments in a constrained environments of games constrained I mean alpha star really demonstrates that you can remove a lot of the constraints but nevertheless it's in a digital simulated environment do you have a hope a dream that it starts being applied in the robotics environment and maybe even in domains that are a little safety critical and so on and have you know have a real impact in the real world like autonomous vehicles for example it seems like a very far-out dream at this point so I absolutely do hope and and imagine that we will we will get to the point where ideas just like these are used in all kinds of different domains in fact one of the most satisfying things as a researcher as when you start to see other people use your your algorithms in unexpected ways so in the last couple of years there have been you know a couple of nature papers where different teams unbeknownst to to us took alpha zero and applied exactly those same algorithms and ideas to real-world problems of huge meaning to to society so one of them was the problem of chemical synthesis and they were able to beat the state-of-the-art in finding pathways of how to actually synthesize chemicals retro retro chemical synthesis and the second paper actually actually just came out a couple of weeks ago in nature showed that in quantum computation you know one of the big questions is how to how to understand the nature of the the function in quantum computation and a system based on alpha zero beat the state of the art by quite some distance there again so so these are just examples and I think you know the lesson which we've seen elsewhere in machine learning time and time again is that if you make something general it will be used in all kinds of ways you know you provide a really powerful tools to society and and those tools can be used in in amazing ways and so I think we're just at the beginning and and for sure I hope that we we see all kinds of outcomes so the the in the the other side of the question of a reinforcement learning framework is you know you usually want to specify a reward function and an objective function what do you think about sort of ideas of intrinsic rewards if we're not really sure about you know of if we take you know human beings existence proof that we don't seem to be operating according to a single reward do you think that there's interesting ideas for when you don't know how to truly specify the reward you know that there's some flexibility for discovering it intrinsically or so on in the context of reinforcement learning so I think you know when we think about intelligence it's really important to be clear about the problem of intelligence and I think it's clearest to understand that problem in terms of some ultimate goal that we want the system to to try and solve for and after all if we don't understand the ultimate purpose of the system do we really even have a clearly defined defined problem that we are solving at all now within that as with your example for humans the system may choose to create its own motivations and sub goals that helped the system to achieve its ultimate goal and that may indeed be a hugely important mechanism to achieve those altima goals but there is still some ultimate goal I think the system needs to be measurable and and evaluated against and even for humans I mean humans were incredibly flexible we feel that we we can you know any goal that we're given we feel we can we can master to some degree but if we think of those goals really you know like the goal of being able to pick up an object or the goal of being able to communicate although influence people to do things in a particular way or whatever those goals are really they are that they're sub goals really that we set ourselves you know we choose to pick up the object we choose to communicate we choose to to influence someone else and we choose those because we think it will lead us to something in our in later art and we think that that's helpful to us to achieve some ultimate goal now I don't want to speculate whether or not humans as a system necessarily have a singular overall goal of survival or whatever it is but I think the principle for understanding and implementing intelligences has to be that if we're trying to understand intelligence or implement our own there has to be a well-defined problem otherwise if it's not I think it's it's like an admission of defeat that forget to be hope for understanding or implementing intelligence we have to know what we're doing we have to know what we're asking the system to do otherwise if you if you don't have a clearly defined purpose you're not going to get a clearly defined answer the the ridiculous big question that has to naturally follow because they have to pin you down on this on this thing that nevertheless one of the big silly or big real questions before humans is the meaning of life is us trying to figure out our own reward function yeah and you just kind of mentioned that if you want to build the intelligence systems and you know what you're doing you should be at least cognizant to some degree of what the reward function is so the natural question is what do you think is the reward function of human life the meaning of life for us humans the meaning of our existence I think you know I'd be speculating beyond my own expertise but but just for fun let me do that yes please and say I think that there are many levels at which you can understand a system and and you can understand something as as optimizing for a goal at many levels and so so you can understand the the you know let's start with the universe like um does the universe have a purpose well it feels like it's just one level just following certain mechanical laws of physics and that that's led to the development of the universe but at another level you can view it as actually there's the second law of thermodynamics that says that this is increasing in entropy over time forever and now there's a view that's been developed by certain people at MIT that this you can think of this as as almost like a goal of the universe that the purpose of the universe is to maximize entropy so there's multiple levels at which you can understand a system the next level down you might say well if the goal is to is to maximize entropy well how do how does how can that be done by a particular system and maybe evolution is something that the universe discovered in order in order to kind of dissipate energy as efficiently as possible and by the way I'm borrowing from Max tegmark for some of these metaphors yes the physicist but if you can think of evolution as a mechanism for dispersing energy then then evolution you you might say as then becomes a goal which is if if evolution disperses energy by reproducing as efficiently as possible what's evolution then well it's now got its own goal within that which is to actually reproduce as effectively as possible and now how does reproduction how is that made as effective as possible well you need entities within that that can survive and reproduce as effectively as possible and so it's natural in order to achieve that high level goal those individual organisms discover brains intelligences which enable them to support the goals of evolution and those brains what do they do well perhaps the early brains maybe they were controlling things at some direct level you know maybe they were the equivalent of pre-programmed systems which were directly controlling what was going on and setting certain you know things in order to achieve these particular particular goals but that led to a another level of discovery which was learning systems you know parts of the brain which were able to learn from themselves and learn how to to program themselves to achieve any goal and presumably there are parts of the game of the brain where goals are set to to parts of that that system and provides this very flexible notion of intelligence that we as humans presumably have which is the ability to kind of wipe the reason we feel that we can we can we can achieve any goal so so it's a very long-winded answer to say that you know I think there are many perspectives and many levels at which intelligence can be understood and and each of those levels you can take multiple perspectives that you know you can view the system as something which is optimizing for a goal which is understanding it at a level by which we can maybe implement it and understand it as AI researchers or computer scientists or you can understand it at the level of the mechanistic thing which is going on that there are these you know atoms bouncing around in the brain and they lead to the the outcome of that system is not in contradiction with the fact that it's it's also a a decision-making system that's optimizing for some goal and and purpose I've never heard the description of the meaning of life structured so beautifully in layers but you did miss one layer which is the next step which you're responsible for which is creating the the artificial intelligence and data layer on top of that and I can't wait to see well I may not be around but they can't wait to see what the next layer beyond that well we well let's just take that that argument you know and pursue it to a central conclusion so the next level indeed is for for how can our how can our learning brain achieve its goals most effectively well maybe it does so by by us as learning beings building a system which is able to solve for those goals more effectively than we can and so when we build a system to play the game of go you know when I said that I wanted to build a system that can play go better than I can I've enabled myself to achieve that goal of playing go better than I could buy buy directly playing it and learning it myself and so now a new layer has been created which is systems which are able to achieve goals for themselves and ultimately there may be layers beyond that where they set sub goals to parts of their own system in order to to achieve those and so forth so incredible so the story of intelligence I think I think is is a multi-layered one and a multi perspective one we live in an incredible universe David thank you so much first of all for dreaming of using learning to solve go and building intelligent systems and for actually making it happen and for inspiring millions of people in the process it's truly an honor thank you so much for talking today okay thank you thanks for listening to this conversation with David silver and thank you to our sponsors masterclass and cash app please consider supporting the podcast by signing up to master class at masterclass complex and downloading cash app and using code lex podcast if you enjoy this podcast subscribe on youtube review it with five stars an apple podcast supported on patreon or simply connect with me on Twitter at lex friedman and now let me leave you with some words from david silver my personal belief is that we've seen something of a turning point where we're starting to understand that many abilities like intuition and creativity that we've previously thought or in the domain only of the human mind are actually accessible to machine intelligence as well and I think that's a really exciting moment in history thank you for listening and hope to see you next time you\n"