Better together

**The Unsettling Nature of Cooperation in AI**

As we navigate the complex world of artificial intelligence, one phenomenon has caught our attention: cooperation. In games like Go and Chess, where strategy and planning are key, cooperative play can be a rare and fascinating occurrence. We're a team, or so it seems, as I didn't know this until after playing the first game with a replica of itself lots of times. This strategy works well in competitive games, but it's not always conducive to cooperation. It was only after hours of gameplay that we managed to deliver a total of four dishes to our hungry customers, not a bad score, but I couldn't say I played much of a role in that. My second teammate was trained on a range of partners with different playing styles and was in theory the more cooperative one.

As we continued to play, it became apparent that my partner's speed was significantly slower than mine. But despite this difference, they were getting in my way a lot less, which I found enjoyable. We worked slightly better as a team this time around, but maybe it was because we're just both equally rubbish. Last time you got four dishes and this time you got three; well, here's the thing - it's asking me which should I prefer the first part or the second? The reality is that even though we won with the first partner, I enjoyed the experience more with the one who was better matched to my speed levels. Is that an unusual choice?

**The Importance of Partner Response**

I think that's the usual choice you felt that your partner was more responsive to your behavior and the way that you were playing. The traditional approach to training AI to play games like Starcraft and Go has relied on getting them to play endless games against themselves in order to get as good as possible for cooperative context. But when it comes to cooperation, this method may not be sufficient. Research suggests that an agent can learn to be really cooperative in simulation, but true cooperation requires more than just playing well together - it requires a genuine alignment of behavior.

One thing you might have noticed by now is that a lot of this cooperation work is currently being done in simulation or in gaming scenarios. It's absolutely safe; nothing gets broken. But we want our agents ultimately to work in the real world, and the real world will always differ from simulation unless we believe we're living in a simulation. How do we overcome this simulation-to-reality gap? There are different general strategies that researchers have identified. One approach is to make the simulation as accurate in reflecting reality as possible. Another strategy is to create as much diversity in simulation as we can, so that the resulting agents are not changing their behavior depending on details of the environment.

**The Simulation-Realty Gap**

The question of whether an agent can truly be considered cooperative may be a bit of a red herring. If the agents behave in a way that is beneficial for their cooperation partner and for themselves, then I would call it cooperation - regardless of whether they meant to do so or not. This reminds me of my son when I ask him to mow the lawn; we always joke and he says okay, i'll do it, and then i say and you have to enjoy doing it. He goes yeah, that's ask too much isn't it? But even if researchers like Tori and Kevin are highly successful in getting agents to cooperate in simulation, they will still need to bridge the simulation-to-reality gap.

The real world is messy, full of unpredictable actors and unforeseen circumstances. It's only by having embodied AI in the real world that true artificial general intelligence can be achieved. In our next episode, we're going to be paying a visit to one of my favorite places - the DeepMind Robotics Lab. We'll see if they've got any tricks up their sleeve, or if they're just trying to walk backwards while giving up. That's next time on The DeepMind Podcast presented by me, Hannah Fry, and produced by Dan Houn, at Whistledown Productions. Special thanks to GoodKill Music for producing the catchy tune for our Overcooked game, which we used in this episode. If you've been enjoying this series, please do leave us a rating or review - if you can.

"WEBVTTKind: captionsLanguage: enwelcome back to deepmind the podcast i'm hannah fry a mathematician who's been following the remarkable progress in artificial intelligence in recent years in this series i'm talking to scientists researchers and engineers about how their latest work is changing our world deepmind's big goal is solving intelligence and over the next few episodes we're going to be asking what that could look like and just what capabilities will be needed to create artificial general intelligence last time we heard why lots of work is going into large language models that give machines the ability to communicate with humans but there are some people here who believe that while communication is an important tool in its own right its real value lies in facilitating another skill that is integral to human intelligence the ability to cooperate this is episode three better together communication acts as a booster for cooperation if you want to cooperate you first need to understand not only what are your needs or desires but also what are the other person's needs or desires that is the voice of torrey grepel until recently he led the multi-agent team at deepmind torrey and his colleagues have defined cooperative ai as helping humans and machines find ways to improve their joint welfare and building something capable of that goal is he believes a crucial milestone in the evolution of artificial intelligence the highest level of intelligence that we know of is human intelligence and humans are super cooperators among all the animals humans stand out as the species that is best at cooperation you know that's how we've built our civilization and if we want to reach that level of intelligence with machines then we need to teach machines to cooperate admittedly whenever you read in the news about conflicts and poverty and discrimination around the world it doesn't exactly seem as though humans are paragons of cooperation but it's also true that the endeavors of which we humans are most proud the biggest achievements of science art of engineering or literature are almost never down to a single person the invention of the railway even the creation of the covid19 vaccine they all required great swells of people pulling in the same direction is it almost like intelligence at the level of a society rather than just at the level of an individual yes you could also take the viewpoint that the group itself maybe a family or maybe a whole country is an intelligent entity because people within it work together towards common goals this isn't just a theoretical idea that we'll make a nice paper for an academic conference with self-driving cars we don't want to be in a situation where you're trying to merge into the left lane i'm trying to merge into the right lane and we block each other because neither of our cars is willing to seed their own self-interest from moments in the protection of the environment individually each one of us thinks what difference does it make if i don't save on carbon emissions but if we all think that way then collectively we will have a problem people talk about the possibility of developing super intelligent machines the analog would be a super cooperative machine i mean it'd be quite handy to have an entity that existed only to make you happy yeah we'll come on to all of that in a moment but let's start with the question how do you teach an agent to cooperate part of the answer lies in the way they're trained using a technique called reinforcement learning if you listen to series 1 of this podcast you may well be familiar with this approach to machine learning already after all it's behind many of deepmind's major breakthroughs in recent years including alphago's victory over a human player in the game of go but it's worth having just a quick refresher joyner precup is head of deepmind's montreal office and has spent decades refining the technique of reinforcement learning the idea of reinforcement learning originates really from psychology and animal learning theory where people thought that rewards are a good way for animals to learn how to perform certain tasks and of course skinner was one of the pioneers in this line of work b.f skinner was a famous american psychologist known amongst other things for his work using positive reinforcement in pigeons using a specially designed box that would release a treat at his push of a button skinner found that training pigeons to perform a task like spinning around anti-clockwise in a circle was fantastically simple you only needed to wait until the moment the animal started behaving in the right direction like turning to the left before offering a treat a hungry pigeon soon realizes that its own actions are delivering treats and the behavior is reinforced the whole thing can take as little as 40 seconds the advantage of using rewards is that you can communicate the task to the animal very easily and similarly if you have an automated agent it may be able to communicate the task to that agent very easily if we give the agent these rewards which are numerical in our case instead of giving ai's literal treats the breadcrumbs that you'd fling to a pigeon ais are rewarded with numbers sounds a bit measly but think of it as points in a computer game a plus one for succeeding at a given task and a minus one for failing the agents are built only to maximize the number of points they win and optimizing for cooperation in this way marks a subtle shift in the history of ai here's tori gratel again at first agents were trained to solve single agent tasks say to classify images or to navigate through a maze we then moved on to what we call zero sum games where one agent's gain is another agent's loss like for example in alphago and alpha zero in these programs that play the game of goal and now the natural next step is to consider mixed motive scenarios where it's not quite clear how aligned agents are with one another in their incentives but they still find great cooperative solutions cooperative behavior arises when you get points not just for how well you do but when the people around you also benefit from your actions in humans this is what psychologists call our social value orientation here's kevin mckee another research scientist working on cooperative ai to explain if i'm at the coffee shop i'm there to get my tea for the day and then right as i get to the counter i think oh you know actually i know hannah really likes cappuccinos i do i'm gonna grab an extra cappuccino and bring it to you and there's no expectation that you're going to get the cappuccino from me next time but it's still feasible that i buy the coffee because i know that it'll bring you some happiness you can think about you know essentially social value orientation maps pretty well to random acts of kindness but they're little things are they like holding the door open for somebody or i don't know giving up your seat on the bus that kind of thing those small things would be social value orientation personally i think it goes a little bit deeper than that anytime you get close with someone you kind of redefine your own self-interest in terms of the other person's interest too if i'm going to dinner with a partner or a very close friend and i decide to try a new dish if i was by myself then maybe the only thing i'd pay attention to is how much i liked that dish and then if i really liked it then okay great next time i'll make sure to order this again but actually if i'm going out with my partner or a very close friend or a family member let's say they like really don't like it even the smell of the fish dish that i just ordered it's kind of driving them crazy then actually i'll probably integrate that feedback and decide actually the next time we go out we won't even go to that restaurant if it like really drove them crazy and i'm describing reinforcement learning i'm taking normally what would only be my reward but in this case is the reward of someone else and i'm using it to modify the likelihood that i take an action in the future i have exactly that with sardines my husband loves eating them but i'm like i don't want to look at that while i'm at a table but he takes my feelings into account and doesn't order them at social value rotation okay so how do you instill that in agents then so the environment provides selfish reward already to each agent when my husband eats a plate of delicious sardines he gets a little dopamine hit a plus one for his own reward function and so the way that we kind of build social value orientation into our agents is we also expose the reward that other agents are receiving minus one point for putting your extremely delightful dinner partner off her meal but here's the question should they really both be worth one point each how much should you factor in the opinions of others the selfishness of agents is to a large degree under our control toure grapeville again so you can for example design an agent that when some other agent or human comes into the room all they care about is the well-being of that other entity a purely altruistic almost self-sacrificial agent might just pay attention to the other agent's reward it's almost like you got a little selfishness style right it's like turn it one way and they only live to serve and turn it the other way and they're like fundamentally totally and completely selfish we're going to come back to that dial in just a moment because i want to pick up on something that the owl eared among you might have already spotted because there's one problem with basing your reward function on another agent's desires you might not always be right about how they're feeling it's not like we walk around with the neon sign declaring how happy or sad we are i mean i suppose we have our facial expressions which are a bit of a clue but you can lie right i could pretend i thought the pasta dish was absolutely delicious when really i thought it was the most disgusting thing i've ever tasted yeah we could end up in a situation where you don't want to hurt my feelings and so you say i love the pasta dish and we end up going to that restaurant several more times even though it's not what either of us would want so with agents then they're sort of peeking at the answers for each other in a lot of ways yeah some people would think about that as cheating and so a key challenge would be how to build a system that can try to infer those rewards for other agents and then the con committed challenges that kind of arise there would be around deception this happens frequently in negotiations right so you kind of don't give away what your actual preferences are so that you can try to nudge the other person the other agent to doing what you would like them to do so like for example actually i don't mind the pasta dish but i'm going to pretend to you that i think it's disgusting because the restaurant that i really want to go to is much closer to my house and knowing that you'll take into account my feelings about the pasta dish actually i can manipulate you into doing something that you don't really want to do just because i know how your reward function works and i can deceive you about my internal state exactly i'm just describing my weekend with my husband now so with the power to code an algorithm for altruism surely he must be tempted to turn the dial up to the maximum and see what happens but it turns out that pure altruism might not be the most effective way for whole groups of agents to cooperate you and i let's say we're both perfectly altruistic we are on opposite sides of a door and trying to enter into the room across from us if i just care about you walking through the door first and you just care about me walking through the door first then we'll just sit there for a while perfect altruism gets you nowhere fast and that can be quite a problem when it comes to using these algorithms in real-world scenarios like self-driving cars just think about an intersection where one needs to yield to the other it's really not in anyone's interest that the cars would just stand there and not know what to do so somehow two cars in that situation would have to negotiate and figure out who goes first and who goes second imagine the year is 2050. you're in a busy metropolis and ai driven cars fill the streets snaking through traffic alongside pedestrians bikes and other vehicles if those cars are designed to be wholly considerate of other road users it simply must be the case that a pedestrian stepping out in front of one will cause it to stop a selfless car would avoid collision at all costs but then what happens the whole equilibrium of the road would change because suddenly the pedestrians would feel empowered to cross at any time and the very fact that they are forces the self-driving cars to react in exactly the way that is expected of them by those pedestrians and break every single time it's important to say here that the researchers at deepmind aren't actually working on selfless driverless cars for the future but they are thinking very carefully about this balance between selfishness and altruism especially because in the real world most situations involve a delicate combination of both just think of a football team where each player has an incentive to help their team win the game but also wants to be the one to score the goal or even arranging a meeting with a colleague you both want the meeting to take place but would like it to happen at a time that suits you researchers call these mixed motive scenarios and they're the ones we encounter most in everyday life but there are other trickier situations which encourage people to behave in more selfish ways social dilemmas directly incentivize the individual to behave in a selfish way but if everyone behaves in that selfish way then the collective will suffer and you can see examples of that everywhere in human endeavor most crucially in the protection of the environment individually each one of us thinks what difference does it make if i don't save on carbon emissions but if we all think that way then collectively we will have a problem there's that phrase in there it's just one plastic bottle said seven billion people nice this particular dilemma is known as the tragedy of the commons and it's a well-studied phenomenon in economics it arises when the incentives of the individual are in conflict with what's best for the group of course we all want to protect the environment but it's quite nice owning a new car or having the heating on in september and it's really hard to turn down all of those things when nobody else seems to be doing so so as a result we all lose torrey and his colleagues can run a simplified idea of the same scenario in a simulation a kind of ai petri dish to see if there are ways to encourage more cooperation in the ai version of the tragedy of the commons agents move around as little dots in a grid world and receive a positive reward every time they eat an apple these apples grow in little patches and if you eat only some of the apples they will regrow so if you harvest carefully you can have apples and apples into eternity but once you destroy the whole patch of apples nothing will ever grow there anymore if you were to put a single reward maximizing agent into this world they would soon realize that if they want to ensure their future supply of apples they will always have to leave one or two in each patch but what happens when two or more agents live in this magical orchard it's much harder now because they all need to learn that it's good to leave a few apples the best thing would be if they could realize that it's forbidden to take the last apple from a patch and now the question is can we help them discover these norms and one way to do this is if you now build walls within this environment so that they all live in their little territory then they can act sustainably within that territory again because that's like the first case where there's just one agent and of course we have a name for that in society it's private property as soon as it is a private piece of land then the owner has an incentive to work with it in a sustainable way but the altruism dial isn't the only lever there are other ways to encourage cooperation having norms or rules imposed from above being one way but what if in the true spirit of reinforcement learning you want agents to work out how to cooperate by themselves this is an idea that tory and his colleagues tested when they trained seven agents to play a version of the strategic board game diplomacy i'm very scarred for my memories of playing diplomacy because it almost always ends in an argument i'm not surprised i played it as a teenager with friends or at least they were friends when we started playing the game of diplomacy is played on a board painted with a big map of europe set in the years leading up to the great war each player takes on the role of one of the great powers france austria-hungary england russia the aim is to move across the board form alliances capture land and ultimately beat your opponents it is a good testbed for ai because it's effectively a competition in your skill to cooperate the players need to walk that line between being reliable alliance members but because they can only win alone in the end they also need to understand at what point they need to leave those alliances diplomacy is a notoriously challenging game for ai to play not only are there up to seven different players who could perform an almost infinite number of moves at every turn but the game is a complex fusion of cooperative and competitive dynamics the diplomacy playing agents were trained using a reinforcement learning algorithm each player assigns a value to each situation in the game which is essentially the probability of them winning their goal is to make moves that will increase this value and further their objectives remarkably tory and his colleagues noticed that the seven players were starting to cooperate with each other without being explicitly taught to do so we experimented first with a version of the game where there is no communication between the agents but even in that setting we see that they support each other's moves to support a move in diplomacy means pretty much what it might have meant in early 20th century europe to lend some troops and back up another player's invasion say for instance someone has a unit in berlin and wants to move into munich and they need the support of the austrians the attacker if you like needs to make the move from berlin to munich and they need to write that on their little sheet of what they want to do and the austrian party needs to write on their sheet my unit in vienna supports the move from berlin to munich you see how much coordination that requires these things don't happen by chance that's a crazy idea that they can recognize that cooperating will give them the best chance at long-term success and so even if in that moment it doesn't directly benefit that particular agent they'll still engage in it yes as torrey mentioned so far the agents playing diplomacy have been tackling a simpler version of the game known as no press where they are unable to communicate with each other in order to negotiate and make explicit agreements this is mostly for technical reasons because it turns out it's really hard but researchers would like to add in some form of communication in the future and that communication probably in the first step wouldn't be full natural language it would probably be simpler things like maybe just a statement do you want to form an alliance and the other agent could say yes but the ultimate goal of course would be for these agents to play the game with humans once you start introducing slightly more sophisticated forms of communication do you expect these agents to become devious we would expect them to do what's best for them long term and that might include deception they might say one thing but then they would behave in a different way and try to get an advantage through that but maybe they will also learn that in the long term lying will cost them their credibility and if they lie too much other agents will not pay attention to what they say anymore or even punish them for lying maybe the really good agents will actually arrive at a strategy that would at least most of the time tell the truth that is i guess the idea with humans that of course lying is possible but there is pressure to tell the truth because in the long term it's a better strategy maybe just a little lie every now and then yeah one of the reasons that researchers use games is to understand how agents behave in a safe environment but the possibility of deception does raise questions for how ai is deployed in the real world longer term according to ethics researcher laura weidinger who we heard from in the last episode when ai reaches the real world it should not be allowed to deceive others for example if you were to ask this ai about the bank details of another person it could just say i will not give you this information i think more generally i see real risks associated with ais that can deceive it posits a risk for human autonomy if the ai system deceives me i could be manipulated to doing things i wouldn't otherwise do of course in fundamental research like in particular games we may want to develop something like deception this could give us some important insights but in terms of ai that is publicly released i haven't yet seen an application where it would be desirable for an ai to deceive so far in this episode we've mainly explored how ai agents interact with each other but as we heard at the start those aren't the only partnerships worth considering the future is also likely to require quite a lot of cooperation between ai and humans in the real world it's rarely the case that there's a task where an ai agent is clearly better or if so it's a very specialized task but often the team of humans and artificial intelligence can do it better together just think about a radiologist we can now train ais that are very good at classifying these medical images but do you think that ais will replace radiologists no they make them better right because there are other parts of their job to talk to the patient to understand the bigger picture of treatment there's that famous quotation by curtis langlotz a radiologist at stanford ai won't replace radiologists he says but radiologists who use ai will replace radiologists who don't but what's it actually like for humans to cooperate with ai well many of us already do this in our daily lives when we talk to our smart speaker or use facial recognition systems to organize our photos but what would happen if an ai and a human tried to cook a meal together well you are about to find out here's what happened when i donned my chef's wipes and joined an ai in a collaborative cooking game called overcooked here's kevin mckee to explain so two players partner together they have to prepare dishes to serve in a kitchen and you are fully sharing the kitchen space you first have to grab ingredients you have to prepare them by let's say chopping them up putting them in a pot allowing them to cook and then serving them on a dish you might say this is relatively simple but actually if you've ever cooked with a family member or partner you know that especially if you're under time pressure it can be a challenge to kind of keep cool tempers and so maybe we won't necessarily be cooking with our ai systems but certainly we hope that we'll be collaborating with them in close proximity once we deploy them to the real world now i am never one to shy away from a challenge so i fired up the engine and let the game begin all right oh here i am that's my little chef not being funny but my chef's got a lot of swag it's pretty cool floppy hat now it's telling me my chef is going to be in a kitchen it's quite a simple kitchen we're talking maybe circa 1998 computer graphics you have to kind of earn your way to the more advanced kitchen the game has a pixelated rectangular kitchen with a stack of tomatoes on the right a cooking pot in the middle and a serving station on the left using keyboard keys to navigate the objective is to pick up the tomatoes put them in a pot to start cooking and once they're ready take the freshly made tomato soup to the serving station delicious my reward a bonus of 10 cents for every dish delivered first i played a practice round by myself to get the hang of it right let's pick up a tomato oh gone too far there we go right so i've got to go get my soup lovely and pop it on the serving station great that was easy it seems like you've been practicing this is pretty good kevin then paired me up with two different ai co-players one of them looks exactly like me they have long red hair and they're wearing an orange suit so i'm looking forward to going up against them or with them i should say because we're a team although i didn't know this until after playing the first had been trained by playing the game with a replica of itself lots of times this is a strategy which works well in competitive games like go or chess but it's not always conducive to cooperation okay here we go let's play oh crikey they're fast hang on chill out sorry hang on hang on hang on i'm trying to get involved but they keep blocking me excuse me thank you they're definitely a lot faster than i am i mean we're doing well but i wouldn't say i felt like i was contributing fairly oh hang on i've got the dish hold on after 90 seconds of gameplay we managed to deliver a total of four dishes to our hungry customers not a bad score but i can't say i played much of a role in that my second teammate was trained on a range of partners with different playing styles and was in theory the more cooperative one oh you're much slower aren't you there we go you take your turn this partner in comparison to the other one is a lot slower but they're getting in my way a lot less which i'm enjoying oh hang on i've got too many tomatoes again so actually i sort of think we're working slightly better as a team this time but maybe it's because we're just both equally rubbish so last time you got four dishes and this time you got three well i mean here's the thing it's asking me which should i prefer the first part or the second the reality is even though we won with the first partner i enjoyed the experience more with the one who was better matched to my speed levels is that an unusual choice do you think kevin no i i think that's the usual choice you felt that your partner was more responsive to your behavior and the way that you were playing the first one was as you're saying kind of making a mad dash for for all of the tomatoes it's fair to say i got pretty into this game after a while when i was finished making tomato soup kevin told me more about why i might have preferred playing with the second partner whereas the traditional approach to training ai to play games like starcraft and go has relied on getting them to play endless games against themselves in order to get as good as possible for cooperative context the way that your partner manages to maybe align their behavior with yours probably matters a lot and so we should be paying more attention to it because winning is not just about having the best player on your team it's about working together exactly one thing you might have noticed by now is that a lot of this cooperation work is currently being done in simulation or in gaming scenarios in simulation it's absolutely safe nothing gets broken tori grapel again we have complete read out of what's happening but then we want our agents ultimately to work in the real world and the real world will always differ from simulation unless you believe we're living in a simulation so how do we overcome this simulation to reality gap there are different general strategies one of course is to make the simulation as accurate in reflecting reality as we can another strategy is to create as much diversity in simulation as we can so that the resulting agents are not changing their behavior depending on details of the environment do you think that an agent can learn to be really cooperative in simulation the question of this true cooperation is maybe a bit of a red herring if the agents behave in a way that is beneficial for their cooperation partner and for themselves then i would call it cooperation i wouldn't then drill down on the question if they meant it it reminds me a little bit of my son when i ask him to mow the lawn we always joke and he says okay i'll do it and then i say and you have to enjoy doing it he goes yeah that's ask too much isn't it but even if researchers like tori and kevin are highly successful in getting agents to cooperate in simulation they will still need to bridge the simulation to reality or sim to real gap that toray alluded to after all the real world is messy full of unpredictable actors and unforeseen circumstances and there are researchers who believe that it's only by having embodied ai in the real world that true artificial general intelligence can be achieved in the next episode we're going to be paying a visit to one of my favorite places the deepmind robotics lab looks like it's a drunk robot so it's trying to walk backwards but it's sort of um it's just given up it's given up and it's really full that's next time on the deepmind podcast presented by me hannah fry and produced by dan hardoon at whistledown productions special thanks to goodkill music who produced the catchy tune for the overcooked game which we used in this episode if you've been enjoying this series please do leave us a rating or review if you can goodbyewelcome back to deepmind the podcast i'm hannah fry a mathematician who's been following the remarkable progress in artificial intelligence in recent years in this series i'm talking to scientists researchers and engineers about how their latest work is changing our world deepmind's big goal is solving intelligence and over the next few episodes we're going to be asking what that could look like and just what capabilities will be needed to create artificial general intelligence last time we heard why lots of work is going into large language models that give machines the ability to communicate with humans but there are some people here who believe that while communication is an important tool in its own right its real value lies in facilitating another skill that is integral to human intelligence the ability to cooperate this is episode three better together communication acts as a booster for cooperation if you want to cooperate you first need to understand not only what are your needs or desires but also what are the other person's needs or desires that is the voice of torrey grepel until recently he led the multi-agent team at deepmind torrey and his colleagues have defined cooperative ai as helping humans and machines find ways to improve their joint welfare and building something capable of that goal is he believes a crucial milestone in the evolution of artificial intelligence the highest level of intelligence that we know of is human intelligence and humans are super cooperators among all the animals humans stand out as the species that is best at cooperation you know that's how we've built our civilization and if we want to reach that level of intelligence with machines then we need to teach machines to cooperate admittedly whenever you read in the news about conflicts and poverty and discrimination around the world it doesn't exactly seem as though humans are paragons of cooperation but it's also true that the endeavors of which we humans are most proud the biggest achievements of science art of engineering or literature are almost never down to a single person the invention of the railway even the creation of the covid19 vaccine they all required great swells of people pulling in the same direction is it almost like intelligence at the level of a society rather than just at the level of an individual yes you could also take the viewpoint that the group itself maybe a family or maybe a whole country is an intelligent entity because people within it work together towards common goals this isn't just a theoretical idea that we'll make a nice paper for an academic conference with self-driving cars we don't want to be in a situation where you're trying to merge into the left lane i'm trying to merge into the right lane and we block each other because neither of our cars is willing to seed their own self-interest from moments in the protection of the environment individually each one of us thinks what difference does it make if i don't save on carbon emissions but if we all think that way then collectively we will have a problem people talk about the possibility of developing super intelligent machines the analog would be a super cooperative machine i mean it'd be quite handy to have an entity that existed only to make you happy yeah we'll come on to all of that in a moment but let's start with the question how do you teach an agent to cooperate part of the answer lies in the way they're trained using a technique called reinforcement learning if you listen to series 1 of this podcast you may well be familiar with this approach to machine learning already after all it's behind many of deepmind's major breakthroughs in recent years including alphago's victory over a human player in the game of go but it's worth having just a quick refresher joyner precup is head of deepmind's montreal office and has spent decades refining the technique of reinforcement learning the idea of reinforcement learning originates really from psychology and animal learning theory where people thought that rewards are a good way for animals to learn how to perform certain tasks and of course skinner was one of the pioneers in this line of work b.f skinner was a famous american psychologist known amongst other things for his work using positive reinforcement in pigeons using a specially designed box that would release a treat at his push of a button skinner found that training pigeons to perform a task like spinning around anti-clockwise in a circle was fantastically simple you only needed to wait until the moment the animal started behaving in the right direction like turning to the left before offering a treat a hungry pigeon soon realizes that its own actions are delivering treats and the behavior is reinforced the whole thing can take as little as 40 seconds the advantage of using rewards is that you can communicate the task to the animal very easily and similarly if you have an automated agent it may be able to communicate the task to that agent very easily if we give the agent these rewards which are numerical in our case instead of giving ai's literal treats the breadcrumbs that you'd fling to a pigeon ais are rewarded with numbers sounds a bit measly but think of it as points in a computer game a plus one for succeeding at a given task and a minus one for failing the agents are built only to maximize the number of points they win and optimizing for cooperation in this way marks a subtle shift in the history of ai here's tori gratel again at first agents were trained to solve single agent tasks say to classify images or to navigate through a maze we then moved on to what we call zero sum games where one agent's gain is another agent's loss like for example in alphago and alpha zero in these programs that play the game of goal and now the natural next step is to consider mixed motive scenarios where it's not quite clear how aligned agents are with one another in their incentives but they still find great cooperative solutions cooperative behavior arises when you get points not just for how well you do but when the people around you also benefit from your actions in humans this is what psychologists call our social value orientation here's kevin mckee another research scientist working on cooperative ai to explain if i'm at the coffee shop i'm there to get my tea for the day and then right as i get to the counter i think oh you know actually i know hannah really likes cappuccinos i do i'm gonna grab an extra cappuccino and bring it to you and there's no expectation that you're going to get the cappuccino from me next time but it's still feasible that i buy the coffee because i know that it'll bring you some happiness you can think about you know essentially social value orientation maps pretty well to random acts of kindness but they're little things are they like holding the door open for somebody or i don't know giving up your seat on the bus that kind of thing those small things would be social value orientation personally i think it goes a little bit deeper than that anytime you get close with someone you kind of redefine your own self-interest in terms of the other person's interest too if i'm going to dinner with a partner or a very close friend and i decide to try a new dish if i was by myself then maybe the only thing i'd pay attention to is how much i liked that dish and then if i really liked it then okay great next time i'll make sure to order this again but actually if i'm going out with my partner or a very close friend or a family member let's say they like really don't like it even the smell of the fish dish that i just ordered it's kind of driving them crazy then actually i'll probably integrate that feedback and decide actually the next time we go out we won't even go to that restaurant if it like really drove them crazy and i'm describing reinforcement learning i'm taking normally what would only be my reward but in this case is the reward of someone else and i'm using it to modify the likelihood that i take an action in the future i have exactly that with sardines my husband loves eating them but i'm like i don't want to look at that while i'm at a table but he takes my feelings into account and doesn't order them at social value rotation okay so how do you instill that in agents then so the environment provides selfish reward already to each agent when my husband eats a plate of delicious sardines he gets a little dopamine hit a plus one for his own reward function and so the way that we kind of build social value orientation into our agents is we also expose the reward that other agents are receiving minus one point for putting your extremely delightful dinner partner off her meal but here's the question should they really both be worth one point each how much should you factor in the opinions of others the selfishness of agents is to a large degree under our control toure grapeville again so you can for example design an agent that when some other agent or human comes into the room all they care about is the well-being of that other entity a purely altruistic almost self-sacrificial agent might just pay attention to the other agent's reward it's almost like you got a little selfishness style right it's like turn it one way and they only live to serve and turn it the other way and they're like fundamentally totally and completely selfish we're going to come back to that dial in just a moment because i want to pick up on something that the owl eared among you might have already spotted because there's one problem with basing your reward function on another agent's desires you might not always be right about how they're feeling it's not like we walk around with the neon sign declaring how happy or sad we are i mean i suppose we have our facial expressions which are a bit of a clue but you can lie right i could pretend i thought the pasta dish was absolutely delicious when really i thought it was the most disgusting thing i've ever tasted yeah we could end up in a situation where you don't want to hurt my feelings and so you say i love the pasta dish and we end up going to that restaurant several more times even though it's not what either of us would want so with agents then they're sort of peeking at the answers for each other in a lot of ways yeah some people would think about that as cheating and so a key challenge would be how to build a system that can try to infer those rewards for other agents and then the con committed challenges that kind of arise there would be around deception this happens frequently in negotiations right so you kind of don't give away what your actual preferences are so that you can try to nudge the other person the other agent to doing what you would like them to do so like for example actually i don't mind the pasta dish but i'm going to pretend to you that i think it's disgusting because the restaurant that i really want to go to is much closer to my house and knowing that you'll take into account my feelings about the pasta dish actually i can manipulate you into doing something that you don't really want to do just because i know how your reward function works and i can deceive you about my internal state exactly i'm just describing my weekend with my husband now so with the power to code an algorithm for altruism surely he must be tempted to turn the dial up to the maximum and see what happens but it turns out that pure altruism might not be the most effective way for whole groups of agents to cooperate you and i let's say we're both perfectly altruistic we are on opposite sides of a door and trying to enter into the room across from us if i just care about you walking through the door first and you just care about me walking through the door first then we'll just sit there for a while perfect altruism gets you nowhere fast and that can be quite a problem when it comes to using these algorithms in real-world scenarios like self-driving cars just think about an intersection where one needs to yield to the other it's really not in anyone's interest that the cars would just stand there and not know what to do so somehow two cars in that situation would have to negotiate and figure out who goes first and who goes second imagine the year is 2050. you're in a busy metropolis and ai driven cars fill the streets snaking through traffic alongside pedestrians bikes and other vehicles if those cars are designed to be wholly considerate of other road users it simply must be the case that a pedestrian stepping out in front of one will cause it to stop a selfless car would avoid collision at all costs but then what happens the whole equilibrium of the road would change because suddenly the pedestrians would feel empowered to cross at any time and the very fact that they are forces the self-driving cars to react in exactly the way that is expected of them by those pedestrians and break every single time it's important to say here that the researchers at deepmind aren't actually working on selfless driverless cars for the future but they are thinking very carefully about this balance between selfishness and altruism especially because in the real world most situations involve a delicate combination of both just think of a football team where each player has an incentive to help their team win the game but also wants to be the one to score the goal or even arranging a meeting with a colleague you both want the meeting to take place but would like it to happen at a time that suits you researchers call these mixed motive scenarios and they're the ones we encounter most in everyday life but there are other trickier situations which encourage people to behave in more selfish ways social dilemmas directly incentivize the individual to behave in a selfish way but if everyone behaves in that selfish way then the collective will suffer and you can see examples of that everywhere in human endeavor most crucially in the protection of the environment individually each one of us thinks what difference does it make if i don't save on carbon emissions but if we all think that way then collectively we will have a problem there's that phrase in there it's just one plastic bottle said seven billion people nice this particular dilemma is known as the tragedy of the commons and it's a well-studied phenomenon in economics it arises when the incentives of the individual are in conflict with what's best for the group of course we all want to protect the environment but it's quite nice owning a new car or having the heating on in september and it's really hard to turn down all of those things when nobody else seems to be doing so so as a result we all lose torrey and his colleagues can run a simplified idea of the same scenario in a simulation a kind of ai petri dish to see if there are ways to encourage more cooperation in the ai version of the tragedy of the commons agents move around as little dots in a grid world and receive a positive reward every time they eat an apple these apples grow in little patches and if you eat only some of the apples they will regrow so if you harvest carefully you can have apples and apples into eternity but once you destroy the whole patch of apples nothing will ever grow there anymore if you were to put a single reward maximizing agent into this world they would soon realize that if they want to ensure their future supply of apples they will always have to leave one or two in each patch but what happens when two or more agents live in this magical orchard it's much harder now because they all need to learn that it's good to leave a few apples the best thing would be if they could realize that it's forbidden to take the last apple from a patch and now the question is can we help them discover these norms and one way to do this is if you now build walls within this environment so that they all live in their little territory then they can act sustainably within that territory again because that's like the first case where there's just one agent and of course we have a name for that in society it's private property as soon as it is a private piece of land then the owner has an incentive to work with it in a sustainable way but the altruism dial isn't the only lever there are other ways to encourage cooperation having norms or rules imposed from above being one way but what if in the true spirit of reinforcement learning you want agents to work out how to cooperate by themselves this is an idea that tory and his colleagues tested when they trained seven agents to play a version of the strategic board game diplomacy i'm very scarred for my memories of playing diplomacy because it almost always ends in an argument i'm not surprised i played it as a teenager with friends or at least they were friends when we started playing the game of diplomacy is played on a board painted with a big map of europe set in the years leading up to the great war each player takes on the role of one of the great powers france austria-hungary england russia the aim is to move across the board form alliances capture land and ultimately beat your opponents it is a good testbed for ai because it's effectively a competition in your skill to cooperate the players need to walk that line between being reliable alliance members but because they can only win alone in the end they also need to understand at what point they need to leave those alliances diplomacy is a notoriously challenging game for ai to play not only are there up to seven different players who could perform an almost infinite number of moves at every turn but the game is a complex fusion of cooperative and competitive dynamics the diplomacy playing agents were trained using a reinforcement learning algorithm each player assigns a value to each situation in the game which is essentially the probability of them winning their goal is to make moves that will increase this value and further their objectives remarkably tory and his colleagues noticed that the seven players were starting to cooperate with each other without being explicitly taught to do so we experimented first with a version of the game where there is no communication between the agents but even in that setting we see that they support each other's moves to support a move in diplomacy means pretty much what it might have meant in early 20th century europe to lend some troops and back up another player's invasion say for instance someone has a unit in berlin and wants to move into munich and they need the support of the austrians the attacker if you like needs to make the move from berlin to munich and they need to write that on their little sheet of what they want to do and the austrian party needs to write on their sheet my unit in vienna supports the move from berlin to munich you see how much coordination that requires these things don't happen by chance that's a crazy idea that they can recognize that cooperating will give them the best chance at long-term success and so even if in that moment it doesn't directly benefit that particular agent they'll still engage in it yes as torrey mentioned so far the agents playing diplomacy have been tackling a simpler version of the game known as no press where they are unable to communicate with each other in order to negotiate and make explicit agreements this is mostly for technical reasons because it turns out it's really hard but researchers would like to add in some form of communication in the future and that communication probably in the first step wouldn't be full natural language it would probably be simpler things like maybe just a statement do you want to form an alliance and the other agent could say yes but the ultimate goal of course would be for these agents to play the game with humans once you start introducing slightly more sophisticated forms of communication do you expect these agents to become devious we would expect them to do what's best for them long term and that might include deception they might say one thing but then they would behave in a different way and try to get an advantage through that but maybe they will also learn that in the long term lying will cost them their credibility and if they lie too much other agents will not pay attention to what they say anymore or even punish them for lying maybe the really good agents will actually arrive at a strategy that would at least most of the time tell the truth that is i guess the idea with humans that of course lying is possible but there is pressure to tell the truth because in the long term it's a better strategy maybe just a little lie every now and then yeah one of the reasons that researchers use games is to understand how agents behave in a safe environment but the possibility of deception does raise questions for how ai is deployed in the real world longer term according to ethics researcher laura weidinger who we heard from in the last episode when ai reaches the real world it should not be allowed to deceive others for example if you were to ask this ai about the bank details of another person it could just say i will not give you this information i think more generally i see real risks associated with ais that can deceive it posits a risk for human autonomy if the ai system deceives me i could be manipulated to doing things i wouldn't otherwise do of course in fundamental research like in particular games we may want to develop something like deception this could give us some important insights but in terms of ai that is publicly released i haven't yet seen an application where it would be desirable for an ai to deceive so far in this episode we've mainly explored how ai agents interact with each other but as we heard at the start those aren't the only partnerships worth considering the future is also likely to require quite a lot of cooperation between ai and humans in the real world it's rarely the case that there's a task where an ai agent is clearly better or if so it's a very specialized task but often the team of humans and artificial intelligence can do it better together just think about a radiologist we can now train ais that are very good at classifying these medical images but do you think that ais will replace radiologists no they make them better right because there are other parts of their job to talk to the patient to understand the bigger picture of treatment there's that famous quotation by curtis langlotz a radiologist at stanford ai won't replace radiologists he says but radiologists who use ai will replace radiologists who don't but what's it actually like for humans to cooperate with ai well many of us already do this in our daily lives when we talk to our smart speaker or use facial recognition systems to organize our photos but what would happen if an ai and a human tried to cook a meal together well you are about to find out here's what happened when i donned my chef's wipes and joined an ai in a collaborative cooking game called overcooked here's kevin mckee to explain so two players partner together they have to prepare dishes to serve in a kitchen and you are fully sharing the kitchen space you first have to grab ingredients you have to prepare them by let's say chopping them up putting them in a pot allowing them to cook and then serving them on a dish you might say this is relatively simple but actually if you've ever cooked with a family member or partner you know that especially if you're under time pressure it can be a challenge to kind of keep cool tempers and so maybe we won't necessarily be cooking with our ai systems but certainly we hope that we'll be collaborating with them in close proximity once we deploy them to the real world now i am never one to shy away from a challenge so i fired up the engine and let the game begin all right oh here i am that's my little chef not being funny but my chef's got a lot of swag it's pretty cool floppy hat now it's telling me my chef is going to be in a kitchen it's quite a simple kitchen we're talking maybe circa 1998 computer graphics you have to kind of earn your way to the more advanced kitchen the game has a pixelated rectangular kitchen with a stack of tomatoes on the right a cooking pot in the middle and a serving station on the left using keyboard keys to navigate the objective is to pick up the tomatoes put them in a pot to start cooking and once they're ready take the freshly made tomato soup to the serving station delicious my reward a bonus of 10 cents for every dish delivered first i played a practice round by myself to get the hang of it right let's pick up a tomato oh gone too far there we go right so i've got to go get my soup lovely and pop it on the serving station great that was easy it seems like you've been practicing this is pretty good kevin then paired me up with two different ai co-players one of them looks exactly like me they have long red hair and they're wearing an orange suit so i'm looking forward to going up against them or with them i should say because we're a team although i didn't know this until after playing the first had been trained by playing the game with a replica of itself lots of times this is a strategy which works well in competitive games like go or chess but it's not always conducive to cooperation okay here we go let's play oh crikey they're fast hang on chill out sorry hang on hang on hang on i'm trying to get involved but they keep blocking me excuse me thank you they're definitely a lot faster than i am i mean we're doing well but i wouldn't say i felt like i was contributing fairly oh hang on i've got the dish hold on after 90 seconds of gameplay we managed to deliver a total of four dishes to our hungry customers not a bad score but i can't say i played much of a role in that my second teammate was trained on a range of partners with different playing styles and was in theory the more cooperative one oh you're much slower aren't you there we go you take your turn this partner in comparison to the other one is a lot slower but they're getting in my way a lot less which i'm enjoying oh hang on i've got too many tomatoes again so actually i sort of think we're working slightly better as a team this time but maybe it's because we're just both equally rubbish so last time you got four dishes and this time you got three well i mean here's the thing it's asking me which should i prefer the first part or the second the reality is even though we won with the first partner i enjoyed the experience more with the one who was better matched to my speed levels is that an unusual choice do you think kevin no i i think that's the usual choice you felt that your partner was more responsive to your behavior and the way that you were playing the first one was as you're saying kind of making a mad dash for for all of the tomatoes it's fair to say i got pretty into this game after a while when i was finished making tomato soup kevin told me more about why i might have preferred playing with the second partner whereas the traditional approach to training ai to play games like starcraft and go has relied on getting them to play endless games against themselves in order to get as good as possible for cooperative context the way that your partner manages to maybe align their behavior with yours probably matters a lot and so we should be paying more attention to it because winning is not just about having the best player on your team it's about working together exactly one thing you might have noticed by now is that a lot of this cooperation work is currently being done in simulation or in gaming scenarios in simulation it's absolutely safe nothing gets broken tori grapel again we have complete read out of what's happening but then we want our agents ultimately to work in the real world and the real world will always differ from simulation unless you believe we're living in a simulation so how do we overcome this simulation to reality gap there are different general strategies one of course is to make the simulation as accurate in reflecting reality as we can another strategy is to create as much diversity in simulation as we can so that the resulting agents are not changing their behavior depending on details of the environment do you think that an agent can learn to be really cooperative in simulation the question of this true cooperation is maybe a bit of a red herring if the agents behave in a way that is beneficial for their cooperation partner and for themselves then i would call it cooperation i wouldn't then drill down on the question if they meant it it reminds me a little bit of my son when i ask him to mow the lawn we always joke and he says okay i'll do it and then i say and you have to enjoy doing it he goes yeah that's ask too much isn't it but even if researchers like tori and kevin are highly successful in getting agents to cooperate in simulation they will still need to bridge the simulation to reality or sim to real gap that toray alluded to after all the real world is messy full of unpredictable actors and unforeseen circumstances and there are researchers who believe that it's only by having embodied ai in the real world that true artificial general intelligence can be achieved in the next episode we're going to be paying a visit to one of my favorite places the deepmind robotics lab looks like it's a drunk robot so it's trying to walk backwards but it's sort of um it's just given up it's given up and it's really full that's next time on the deepmind podcast presented by me hannah fry and produced by dan hardoon at whistledown productions special thanks to goodkill music who produced the catchy tune for the overcooked game which we used in this episode if you've been enjoying this series please do leave us a rating or review if you can goodbye\n"