**The Power of Monte Carlo Tree Search**
In recent years, Monte Carlo Tree Search (MCTS) has emerged as a powerful algorithm for decision making and planning in complex environments. This algorithm is particularly well-suited for situations where there are many possible actions to take, and the goal is to find the best one. In this article, we will explore how MCTS works, its strengths and weaknesses, and some of its applications.
**How MCTS Works**
At its core, MCTS is an anytime algorithm that uses a combination of sampling and search to evaluate policies and models. The process begins with an empty tree, which represents the current state of knowledge about the environment. The initial state is considered a leaf itself, and a default policy is used to roll out until an episode termination. This gives us an estimate of the value of the root node, which serves as the average of all previous simulations.
From there, the algorithm expands one node at a time, using a combination of knowledge about the current state and the policy being used. For instance, if we have no additional knowledge, we simply randomly choose to pick the best action. The new leaf is then reached, and another rollout is performed to get an evaluation for this new state. This evaluation is used to update not just the newly expanded node, but all its ancestors, including the root node.
**Expanding Actions**
As we navigate through the tree, we expand each action by rolling out until the end of the episode. For example, if we pick the left action and reach a leaf, we use another rollout to get an evaluation for this new state. We then back up all the updates to ensure that the newly expanded node has an accurate value.
**Sampling and Best-First Search**
A key feature of MCTS is its ability to sample states effectively. The algorithm uses a combination of knowledge about the current state and the policy being used to select the next action to take. This allows for a highly selective best-first search, where states are evaluated dynamically based on their estimated value.
This sampling process is essential for breaking the curse of dimensionality, which refers to the challenge of dealing with an extremely large number of possible actions in complex environments. By using MCTS, we can avoid having to store values for all possible states, making it a much more efficient algorithm than traditional table lookup approaches.
**Limitations and Applications**
While MCTS is a powerful algorithm, there are some limitations to its use. For very large search spaces, the tree can become too large to be stored in memory, making it difficult to use effectively. Additionally, the algorithm relies on accurate estimates of the policy and model being used, which can be challenging in practice.
Despite these limitations, MCTS has been successfully applied in a variety of domains, including game playing and robotics. For example, AlphaGo, a state-of-the-art Go-playing AI, uses a combination of MCTS with value function approximation to guide its search. This allows it to efficiently explore the vast space of possible moves and select the best one.
**Monte Carlo Tree Search vs. Table Lookup**
It's worth noting that MCTS is often contrasted with table lookup approaches, which are naive in the sense that they assume that all states can be stored in a table. However, for model-based reinforcement learning (3RL), table lookup is actually less naive because we only need to store values for easily reachable states that are likely under the policy and under our Q-value estimate.
This means that MCTS is still a partial instantiation of a table, but one that is much more efficient in practice. By using MCTS, we can avoid having to store values for all possible states, making it a much more scalable algorithm than traditional table lookup approaches.
"WEBVTTKind: captionsLanguage: enhello i'm matthew hassell and i'll be teaching the reinforcement learning module together with hado and diana and today i'll be talking about models and planning in the past few lectures we already covered quite a bit of material so we started with the bandits this minimal instantiation of the rl problem where we are allowed to disregard the sequential nature of decision making and focus exclusively on trading off exploiting existing knowledge versus exploring alternative courses of actions that might seem so optimal now but that might turn out to be even better as we gain experience in the environment and the world around us we then discuss dynamic programming and model-free algorithms as to families of ideas and approaches that allow us to solve both prediction and control problems with and without access to a complete specification of the environment finally we introduce function approximation as the leading approach to scaling up these families of ideas to settings where we might have very large state spaces or even continuous state spaces if we look back at dynamic programming and model 3 algorithms we can roughly sketch the underlying principles and differences between the two in the following way so in dynamic programming we assume we're given access a privileged access to a complete exact specification of the environment dynamics that includes both the state transition and the reward dynamics we then just need to solve for this perfect model and we don't need to interact with the real environment at all in model 3 algorithms instead we do not rely on any given form of model instead we learn values and policies directly from interaction with the environment in between these two extremes there is a third the prototypical family of algorithms which is typically referred to as model-based reinforcement learning here we are not given a perfect model but we do attempt to learn at least an approximate model but while interacting with the environment and then we use this learned model to derive values and policies a visual depiction of the model-free algorithm would be seen as implementing this loop where we act according to our current value estimates and our policies to generate experience but then we use this experience to directly update our values and our policies and instead the model based algorithm would add a step of indirection in this loop whereby we act according to our current value estimates and policies just like in model 3rl but we use the resulting experience to learn a model of the state transition and the reward dynamics then we can use the learn model instead of the experience directly to update the values and policies thereby closing still the loop and of course boundaries between these two families of algorithms don't need to be extremely rigid and we can perfectly imagine to combine the two ideas we use experience to both update values and to learn the model we can still then use the model to perform additional updates and to refine our value estimates and today we will actually discuss the popular instantiation of this idea which is the dyna algorithm but whenever we delve into a new topic i think it's always good to stop a second and reflect on the motivations that underlie a family of ideas for instance there is obvious disadvantages to this line of attack if we first learn a model then we use the model to plan a value function there are two sources of errors because whatever model we learn from data is likely not to be perfect it will incorporate certain approximations or certain errors and these will compound with additional errors that arise when planning with the model to estimate values in contrast if you consider model 3 algorithms they use real data that is always correct to update the values and therefore there is only one source of approximation which is the value estimation process itself despite this though there are reasons to believe that model-based reinforcement learning could could be quite effective the first of these reasons is that models can be learned with supervised learning methods which are very well understood and very powerful and effective additionally explicitly representing models can give us access to additional capabilities we could use them to represent uncertainty or to drive exploration and finally but not less importantly in many real world settings generating data by interacting with the environment is either expensive or slow or both of these and instead computation is getting cheaper and cheaper by the year therefore using a model to perform additional updates could allow us to trade off compute for a reduction in the number of interactions that we need with the environment and this will make our algorithms more data efficient and therefore extend the range of problems we cannot successfully apply them to we will start now with a bit of due diligence in a sense so we'll start by discussing different types of learned models and how to train them and only then we will delve into concrete algorithms for planning but before we go there if you need a moment before jumping in this is a good spot for a break if we're going to learn a model and use user model for planning the first thing to discuss is what constitutes a model in the first place and in this course a model will be an approximate representation of an mdp and for now we will assume that the dynamics of the environment of this mdp will be approximated by a parametric function with some parameters data and that states and actions in the model are the same as in the real mdp and also we will assume that given states and actions sampling from the model will estimate the immediate successor states and the immediate rewards that we would get by executing those same actions in those same states in the real environment and it's good to realize that this is not the most general characterization of a model there are other types of models we could consider they are quite related to this but different and we will actually discuss a few of these in the in following lectures so for instance we will consider and discuss non-parametric models where we use the data directly to sample transitions in the environment and also backward models that models for instance the dynamics of an inverse mdb rather than the mdp you are directly interested in and also jumping models where the actions do not directly map on the set of primitive actions in mdb we will discuss some of these later but for now let's focus on the simplest kind of model because this is already sufficient to discuss quite a big range of ideas and the support of a good range of planning algorithms so with this premises the problem of learning model from experience basically can be formulated as a supervised learning problem where our data set contains a set of state action spheres and the labels of these are the true states that we add up the rewards that we will observe when executing those actions in those states and the parameters of the model will then have to be chosen as in standard supervised learning so as to minimize some prediction error on this data defining a procedure for learning a model will then amount to picking a suitable parametric function for the transition and rewards functions picking a suitable loss to minimize and choosing a procedure for optimizing the parameters it does so as to minimize this loss so for instance we could choose to parametrize transitions and rewards using some linear mapping from a suitable encoding of states and actions onto for instance the rewards and the some encoding of the successor state and then minimizing the parameters could be achieved by using least squared this would give us an expectation model for instance and specifically will give us a linear expectation model note that this is not the only kind of expectation all we could consider so for instance using the same kind of loss this same mean squared error we could for instance parameterize the model itself both the reward and transitions using a deep neural network and maybe use gradient descent instead of least squares to optimize the parameters regardless of the specific choice of the parametrization though it's good to understand what are the pros of cons of expectation models in general so some disadvantages for instance are quite obvious suppose you have some macro action that randomly puts you either to the left or to the right of some wall well an expectation model will would for instance put you inside of the wall which might not be a physically meaningful thing to predict in the first place interestingly though the values could still be correct so if the values specifically are a linear function of the state or absorb the encoding of the state then actually the linear transformation will commute with expectation and therefore the value of the expected value of the next state will actually be the same as the value of the expected next state even though the expectedness state might have some peculiar features like placing you in the middle of the wall and this actually applies regardless as to the parametrization of the model itself so it only requires the values to be a function a linear function of their state representation but additionally if both the value and the model are leaner then this even applies to enrolling the model multiple steps into the future which is quite powerful and interesting property it might still be that we're not willing to accept this trade-off so we might not be it might not be possible to make either the value or the model linear and in this case maybe expected state might not be the right choice because the values that we can expect from expected states might not have the right semantics we might not be able to derive expected values from expected states and in this case we could consider a different type of model so we could consider what's called a stochastic or generative model so generative model generates not expectations like an expectation model but samples so it takes as input a state an action and a noise variable and it it will return you a sample of a possible next state and a possible reward and if we can effectively train a generative model then we can plan in our imagination so we could for instance generate long sequences of hypothetical transitions in the environment and compare them to choose how to best proceed and this could be done even if the both the model and the values are not linear the flip side of course is that generative models add noise via the sampling process there's also a third option which is to actually attempt to model the complete transition dynamics so model the full transition distribution and full reward distribution and if we could do so actually both expectation and generative models and even values themselves could be all derived as a sub-product of our model of our full transition and reward model the problem is that it might be hard to model the complete distribution and or it might be computationally intractable to actually apply it in practice due to the ranching factor so for instance if we were to try to estimate the value of the next state with a one step look ahead this would still require two sum overall actions and for each action sum over all possible successor states and things get exponentially worse if we even want to look multiple steps ahead and as a result even if we had this full distribution model some form of sampling would probably be needed for most problem of interest and in that case the advantage of the complete distributions compared to for instance just a generative model might might be smaller in the in addition to the type of model we might want to train there is the second additional important dimension which is how we parametrize the model itself so a very common choice is to first of all decompose the dynamics of the rewards and the transitions into two separate parametric functions and then for each of these we can choose among many different parameter parameterizations in this course for instance we'll consider table lookup models or linear expectation models and finally also deep neural network models in the table lookup case the model is an explicit representation of ndp where for instance we may estimate the probability of observing a given successor state for a given state action pair as the empirical probability based on past experience and similarly we could parametrize the reward function as the empirical mean of the rewards that we did actually observe in practice when executing a given action in a given state in this case we would actually get a mixed model because our our table account model would use a full distribution for the transition dynamics and an expectation model for the rewards it is of course possible also to model the complete distributions for the rewards themselves so let's consider for instance our minimal reinforcement learning example where we have the usual two-state mdp with no discounting and we observe eight episodes of experience so once in one in one of the episodes we start in a we transition to b and we observe no reward in all the other episodes we start in b and we terminate immediately and in all but one we observe a reward battle of one and in one of them we observe a reward of zero a table lookup model would model the complete distribution of the environment dynamics as in the picture on the slide so in state a with probability one we transition to b and we absorb the reward of zero and in state b with probability one we terminated but with probability 0.75 we observe the reward of one while with probability 0.25 we absorb no reward at all if instead we consider a linear expectation model it's good to remember that first of all we need some feature representation to be given to us because the linear model doesn't have the capability to build and its own state representation but given this some feature representation we could encode any state s as some feature set for instance a vector and then parameterize separately reward and transitions as two linear functions of the features of the state so for instance expected next state could be parametrized by a square matrix ta one for each action a in such a way that the matrix times the features of the current state give us the expected next state and similarly rewards could be parameterized by a vector wa for each action a so that the dot product between the vector and the current state representation gives us again the expected reward if we do so then on each transition on each transition where we execute some action a some state s then observe some reward r and next state as prime there we can apply a gradient step for instance to update both wa and ta so as to minimize the mean squared error on both the reward and the expected state we are now ready to delve into the true heart of this topic which is how to actually use models to plan and specifically in this section is how to do so for the purpose of improving our value function estimates and our policies so the core let's say value proposition of planning in general of model-based rl is the capability of improving such value estimates and these policies by absorbing compute but without requiring actual interactions with the environment as you might remember dynamic programming is actually a good example that we have already seen of a process that allows us to do just this the problem is of course that in dynamic programming we're given a privileged access to a perfect specification of the environment but in reinforcement learning in general we are actually interested in planning algorithms that do not make this assumption and instead perform planning for instance with alert models such as any any of the ones that we discussed in the previous section in this case the simplest approach to get around this limitation of dynamic programming is maybe to just directly mirror dynamic programming algorithms but plugging in a learn model in place of the privileged specification of the environment if we do so then we can just solve for the approximate mdp with what whatever dynamic programming algorithm you like so you could use value iteration or policy iteration and by doing so we we still require some interactions with the environment because we will still need to learn the model but at least hopefully we can reduce quite a bit of reliance on on the unreal interactions and and therefore be more data efficient so this approach actually um can work and it's it's a reasonable things to consider but it's also not not the only option so for instance there's another route we could take so we could combine learn models not with dynamic programming but instead with the ideas and algorithms that we discussed in the context of model free agents so this is also referred to as sample-based planning and very simply it works by using the learn model to sample imagine transitions imagine trajectories and then applying your favorite model three algorithms to the imagined data just as if it was real datas so for instance you could apply monte carlo control sarsa or q learning and just apply the same updates on the imagine data as sampled from the approximate ndp that you have learned as if it was real data sampled from the environment and let's consider a concrete example to make this clear so this is the same example that we discussed in the previous section so on the left we have the real experience we have access to so eight episodes um most of them starting in uh in a state b and then observing either a zero or one reward and one actually starting in state a and the first transition into b and in the middle is actually the learn table lookup model that we had learned from such data this gives us a hundred percent likelihood of transitioning from a to b and a hundred percent likelihood of terminating it b but with a 75 probability of getting a reward of one and 25 percent of getting a reward of zero the idea of sample-based planning would be take this model and then sample trajectories episodes from this model and this could be for instance the data that we report on the right so all of this data is sampled from the from the model that we have learned and first we have now two episodes the sardine both which transition to b because this is forced by the model and then in this case both episodes were sampled the reward of one when they were in b plus the number of episodes starting in b and terminating immediately again consistently with the model so what happens if we apply monte carlo learning to this data well we will get for instance that the value of a is one because in both episodes we when we start in a we collected a reward of one at the end and the value of b that is equal to 0.75 because in the sample episodes that we that we got from our model three three-quarters of the time would serve the reward of one and b so in this case it's perfectly in in line with the troop with the learned probabilities from the model so this is a cool idea and and also using dynamic programming directly with the lower model could be a very interesting approach but but what are the potential limitations with both of these ideas as always the the greatest concern is that the learning model might not be perfect actually in general we know it will incorporate some errors and some approximations and in this case the planning process may compute a policy that is optimal for the approximate mdb but that is sub-optimal with respect to the real environment and so our presentation couldn't be complete if we didn't discuss how to deal with these limitations and and there are many ways we could go for this so one approach would be very very simply whenever the model is not good enough it's not reliable enough just use model 3 algorithms another idea would be let's try to reason explicitly about our uncertainty over our estimates of the model so for instance this would lead us more towards the bayesian approach and finally the third approach which is what we will delve more in today is actually to combine model-free and model-based methods integrating learning and planning in a single algorithm how can we do this well in the previous discussions somewhat implicitly we already discussed that there are two potential sources of data to potential sources of experience so one is real experience this is data transitions trajectories that you generate by actually interacting with the environment and another source is simulated experience where the data is sampled from a model for instance a learn model and therefore is sampled from some approximate mdp well the core idea behind dyna which is a powerful approach to integrate learning and planning is let's just treat the two sources of experience as if they were the same thing so while in model 3 we learn values from real experience and in sample based planning we plan values from simulated experience well in dyna let's do both so let's apply the same updates to values and policy regardless of whether data comes from real or simulated experience this very directly implements the picture i had also shown at the very beginning of the chapter so where we where i suggested we could maybe act according to our value estimator policies to generate experience and then this experience could be used both to update the values and policies directly but also used to learn a model that could then generate an additional simulated update simulated experience so let's now discuss concrete instantiation of dyna because this this idea is very general but of course we could plug many different algorithms to this and apply very different updates to both the simulated and real experience so what does this actually mean in practice well a popular version of this idea is what is typically be referred to as dyna queue where we use q-learning on both the real and simulated experience and this is what it looks like first we of course need to initialize now both value function like in q learning but also a model because we'll be training it alongside and then on on each step we can select actions using an absolute greedy policy for instance like in standard q learning and after executing this action we of course get from the environment successor state and a reward with this real transition we can directly apply q learning for instance by updating the current q values towards the usual target constructed by summing the reward and the discounted max q value in the next step but we can also update now the model in a supervised fashion for instance in a tabular deterministic environment this would just amount to storing the next state and the reward the next point is where the secret sauce in some sense of dyna comes in so basically for each real step in the environment we now perform n planning updates where we take some past the state action pair that we have encountered during training and we use the current model to sample a possible next state and a possible subsequent reward and we use this imagine transition to perform an additional q-learning update and of course if as we increase the number of imagine steps that we do we can basically reduce our reliance on on the real transitions and become more data efficient as we were hoping interestingly of course there are many different ways of extending or implementing this idea so for instance we could use a different model algorithm instead of q-learning also the algorithm writes in the slide is written for the tabular setting but of course we could use linear functions or deep neural networks to represent both the values or the models and alternatively we could also vary the amount of planning that we do over time so very intuitively the model typically is improved and updated along the way so it's hopefully getting better over time so maybe we could imagine to perform more planning updates as training progresses and we become more confident in our model ideally we would actually even have some form of uncertainty estimates that we can use to choose how much planning to perform how much to trust our planning updates depending on the accuracy of our learned models but regardless of the of the details and potential enhancement the important thing is that even this basic algorithm already instantiates the most fundamental properties that we were looking for so we we can sync in more compute to learn more efficiently which i just want to recall again it's it's really critical if the data is slow or expensive or unsafe to collect and and this i can say this uh enough it's really often the case in many real world applications so let's take a look at what what happens when we actually run these algorithms on not a simple problem just to gain more understanding and intuition about planning for credit assignment to do so we will consider a simple maze which is drawn on this slide at the top right where on each episode we use the agent starts on the cell marked with an s and from any cell the agent can take one of four actions which correspond to moving up down left and right respectively the dark cells instead are walls meaning that if we try to move into one of these darker cells the action will have no effect to just bang on the wall and and stay where we are the reward is zero everywhere except for a single positive reward that you collect when you reach the goal cell which is the one marked with a g in the top right corner under this setting the it's it's fairly obvious that the optimal policy for any discount smaller than one is going to be a policy that leads you from s to g following the shortest path because any any any choice of actions that will take you along the longer path will basically result in the final reward reward be discounted more strongly and this means that we can evaluate the quality of the policy learned by an agent for instance by plotting the number of steps that the agent takes in order to reach the goal as a function for instance of the number of episodes that it has experienced and this is what is shown in this plot on this slide this plot is taken from the saturn embargo book so you can also read there about more details about how exactly this was instantiated and run but at the fundamentally this plot just shows three instantiations of the algorithm that i showed in the previous slide that were ran with different amounts of planning specifically it was ran with zero steps of planning meaning that the algorithms just revert to vanilla learning or with five steps of planning or with 50 steps of planning and what is really apparent already from this plot is that vanilla q learning takes many tens of episodes to actually converge to the optimal solution and instead by just adding a tiny bit of planning so just doing five additional updates on for every real update the agent manages to converge to the optimum much quicker in less than 10 episodes and as we increase for instance to 50 planning steps the the agent actually converges in just a couple of episodes to the optimal policy which is which is really impressive might be it might even seem too good to be true in a sense like what is actually happening behind the scenes and in this slide i'm trying to give you basically an opportunity to peek behind the scenes and see what happens to the q values estimate as we run this algorithm and specifically let's consider what happens on the very last step of the first episode so this is when the is basically the first time the agent observes the rewards so for simplicity we will also assume that q values were initialized as zero everywhere under this assumption without planning the the first episode is actually fairly boring so on all intermediate steps the values where zero when the agent enters the cell and actually stays zero because the td error is always zero our estimates are zero their awards are zero the value max q value we bootstrap on is also zero so no update is done throughout all the first episodes only a single q value is updated right on the last transition and this is the q values corresponding to moving upwards from the cell immediately below the goal immediately below that top right corner so and when when the agent is doing this final transition of the first episode that q value is correctly updated towards the goal reward and this immediately makes a greedy policy optimal in that one cell because all the other action values stay at zero but this section value of moving up from the state just below the goal becomes actually has a non-zero value so all other states that we have seen throughout the first episode have not been updated q values stay all at zero and for the behavior stay random assuming assuming we are breaking ties at random but this very last state is shown in the bottom left of the in the maze shown on the bottom left well that one gets updated to the correct to the correct policy on the main stream on the bottom right we show instead the state of our estimates of the q values when we run the same algorithm with 50 steps of planning and again we freeze the algorithm at the immediately after we have done run it for just one episode but suddenly you can see that the q values are now have been updated throughout the maze so it's not just the final the final state that has uh realized what the optimal action is but actually the the model the estimate of the q values that the agent has updated throughout this first episode show a meaningful direction for many many different other states and basically the information of how to move from the state to goal has propagated almost all the way to the start which is quite impressive and the reason is that when you do update that first that last cell in the in the in the episode you then start sampling 50 more updates on imagine transitions from the model which means that on each of these other 50 planning updates we will we will be able to propagate information from the new updated q values backwards in state space and this means that well in this case it doesn't quite solve the problem in a single episode but it will just take just a couple more episodes to correctly propagate all information for the state space and figure out what the optimal policy is of course this is this is not all roses right then there are always certain critical issues with model based rl including the fact that in some cases our model will be wrong and in in this experiment i'm trying to give you some intuition of how wrong models can affect learning in a model-based rl system so to to gain some intuition of this let's use again the same algorithm to investigate how an incorrect model may affect training and we will do so by simulating the this this model being wrong by having a sudden change in the dynamics of the problem this means that the model learned from the agent in the first half of training will suddenly be wrong because this the world itself has changed but the effect of this on learning will be quite different depending on the type of error let's see what i mean by this well in the first setting we assume that at the beginning of training the environment looks like on the grid on the left where there is a short path between the state marked as with s as usual and the goal marked as g in the top right corner and once the dyna q agent has learned this policy correctly which is shown in the plot by the cumulative reward becoming linear in the number of steps well at that point we change the environment to the one on the right which is very similar very similar grid but now the optimal policy is not anymore to move to the right to reach the goal but rather the agent needs to take a detour to circumnavigate that wall that is now blocking the path when you do this change in environment suddenly the cumulative reward collected by dyna q flattens because that at the beginning the agent will try to continue move toward through the same path that it had learned at the beginning but of course that path is now blocked because the walls have shifted this means that for a few steps the agent will be unable to collect any rewards but if you look at the plot for dyna queue after a while the agent does start collecting more rewards because the agent will update the model to now correctly understand that that it cannot move through that cell because it's now a wall and through planning it will then re-route the the policy of the agent through by updating the q values in order to take the a different route and which which now results in the reward again growing linearly with the number of steps there are several specific agents on this plot and again this plot is taken from the sutton invarta book but that correspond to slightly different instantiations of dyna and the middle one is dyna q which is the canonical one i described while dyna q plus adds some small exploration box which means that actually is able to reroute slightly faster but the core message remains that if the model is inaccurate in this way where the the world actually is harder than the model was initially assuming the behavior of the agent is such that it will actually lead to real transitions that will correct for the model error and therefore the in some sense the learning dynamic is overall quite benign it's good to realize that this is not always the case so let's now consider a very similar setting with a slight difference so now again the world changes but the change in environment is actually easier in some sense so at the beginning you have the situation on the left where the agent needs to take this detour around the wall in order to reach the goal but after changing there is now the detour is still available to the agent but another path has opened up so the agent could take pass to the right of the wall which is a slightly shorter path which would result in higher rewards what happens is that once the q agent has learned to correctly take this detour long detour around the wall in order to reach the goal there is nothing that pushes the agent to explore this other path the the way in which the model is incorrect the fact that there is actually a small is a shorter path is is not reflected in the data that the agent sees because the the error in the model doesn't result in data that can correct for the model error itself so what happens is that once the agent has converged on the solution of taking the long detour it actually might stay just stick with this behavior and will take a very very long time in order to to realize that there was actually an error and this is why in this plot the dyna q algorithms exactly which is exactly the one i described in the pseudocode before will actually continue learning at the same speed the slope is the same because it's still taking always the same long path in order to reach the goal after also after that the world itself has become easier and here you can see actually the dyna q plus that adds this exploration bonus is actually able to figure out this this additional path and i don't want to go into much details about dyna q plus because the details are not important here but the important thing to understand is that thinking deeply for instance about exploration is still important even model-based rl because the way the certain types of errors in your in your model might not necessarily be discovered unless you have something that pushes your agent to to execute this behavior that results in useful data the dyna algorithms i described in the previous section is a beautiful instantiation of a system that integrates learning and planning so model-free and model-based updates but it's not the only one and in this section i want to discuss experience replay and its connection to more conventional model based algorithms such as dynam traditionally rl algorithms did not explicitly store experience so it was trivial to place them in one of two groups so model-free methods that do not attempt to explicitly model the dynamics of the environment and directly update values and policy online with the latest data a model-based methods that learn transitions and reward model and then plan value functions using such a model but this kind of sharp distinction is blurred not just by algorithms like dyna that do both but by other developments for instance in the space of model 3 algorithms and indeed many modern rl systems now store transitions in an experience replay buffer and then apply model-free algorithms not just to the latest transitions but also to pass transitions in the buffer and this is actually a fairly common technique for model free agents but it also can be seen as just instantiating dyna but with an experience replay playing the role of a non-parametric model and this is not just a superficial similarity because both systems actually share a fundamental property like which is their ability to sync in compute as i said many times with no additional environment interactions to improve values and policies and in this case the experience replace systems would do so by making many updates on past experience for every new real steps in the environment in in this plot here i i can show that how this this property of the scalability property of planning algorithms um is instantiated by both dyna and a replay q learning agent in the case of a simple grid world so this is a simple grid world where as usual you start in the state s and you need to reach the goal g as fast as possible because there is some some discount strictly smaller than one and what i show on the right side is basically the total number of steps required to reach the goal on the on the top right 25 times if you every time you start from the cell s on the left as a function of the amount of planning so of the amount of updates that you do for each real step in the environment and you can see that regardless of whether the entire transition is replayed like an experience replace system or whether a model is used to infer the subsequent rewards and state given a state action pair the profile actually looks very similar the more additional updates you do for every real step in the environment the more the performance improves and so the faster you reach the goal thing which is basically the lower the total number of steps that are required to reach the goal 25 times and you might might be even tempted to take this to the extreme so our experience replace system and algorithms based on learn parametric models completely equivalent and well in the tabular settings we can actually derive exact equivalences between certain model-based and model-free algorithms and even more in general somewhat trivial if we had a perfect model that outputs exactly the same true reward and successor state as the non-parametric replay system well the two would be exactly equivalent by construction in general though of course any model that we use will not be perfect so one question that you may ask is will it ever be better could the inaccuracy that almost certainly come from using a parametric model provide benefits on top of a non-parametric replay system this seems unlikely and it's i think it's fair to say that if we only use the model to reconstruct their words and successor states corresponding to an observed state action pair it seems hard for a parametric model to improve over just replaying those transitions but the reason that algorithms like dyna and model base rl that use parametric learn models are so important is actually a parametric model allows flexibility that goes way beyond that so for instance we could use the model to plan for action selection and this is something we'll discuss in great detail in the next section because if you can query a parametric model for actions that you could want to take in the future then you can use this to construct a plan to decide how to act and this is something that you cannot do if you're just have a non-parametric replay system that can only tell you which action you took in the past in a given state additionally with a parametric model you could do things like counterfactual planning so queria models for actions that you could have taken in the past but did not and related to contrafractual planning you could do what is called backward planning so if instead of modeling the actual dynamics of the problem you model the inverse dynamics so given a state and an action you you the model tries to predict what was the previous reward and previous state well then you can use this to assign credit to states that could have led to a certain outcome in addition to assigning credit as in standard rl to the states that actually preceded that outcome and finally if you have a parametric model you could train this model to add different time skills like an algorithm like dyna doesn't need not to be restricted to train a model exactly on one step transitions you and on the true native time scale of the environment you could train a model to do jumpy predictions and therefore call support what is called jumpy planning and in conclusion it's actually worth noting also that there are computational differences between the parametric model and experience replay so and this might also uh actually play an important role in choosing between the two so for instance querying a replay buffer is very cheap given a state action pair it gives you immediately what was the reward and what was the successor state that you observed in the past well given a state action pair generating a subsequent reward in state using a learned model could be very expensive if the learn model for instance is parameterized by a big neural network and if you look at memory of course things change and for instance the memory requirements of a replay buffer can be quite harsh because the memory will scale linearly with its capacity well a parametric model could achieve good accuracy with a fixed and and comparably smaller memory footprint and overall i think the the the key takeaway that i would like you to get is that both are important and powerful ideas that implement this core principle of planning this capability of syncing in compute in order to improve our learning and our algorithms and regardless of the labels that we want to attach to them it's really more important to just think deeply about the problem that you want to solve and whether a parametric or a parametric model can be the best fit for for that specific problem so far we discussed how planning can be used to improve our estimates of a global value function or a global policy that is applicable in all the states of our mdp but now i want to tell you about a different form of planning where we sync and compute as usual without requiring additional interactions with the environment but for the purpose of just selecting a single action in a single state this is something is also called planning in the nail and it may seem a special case of the previous problem and of course in a sense it is because if we could get perfect values and perfect policies everywhere then we could just use these perfect values in any one state but the motivation for investigating planning for action selection is that sometimes it's actually easier to make a very accurate local value function than it is to actually fit the global by function and the reason is that in in planning if a value function for the current state we only need to take into account the distribution of states that are are reachable from the current state and that that might be a very small portion of the overall mdp and additionally planning for action selection has a few other appealing properties like the fact that suppose you have some inaccuracy in your model in in some states well that will only affect the local throwaway values that you're using to select an action right now but it won't pollute like a shared global value function that is going to be reused everywhere so it might result in sure selecting a sub-optimal action in certain poorly understood states but maybe that just leads to reasonable exploration and for instance behaviors that can help you correct your the model itself the simplest form of planning for the purpose of action selection is what is called forward search so this approach allows you to select the best action in any given state by basically building the entire search tree that has the current state as root and then follows all possible trajectories from from the current state onwards until episode terminations basically this approach is amounts to representing as a tree the entire sub-mdp of reachable states from the current one and in some cases this sub mdp might actually be fairly tiny and then you can maybe solve that in full every time you need to select an action in general of course this might not be the case so in in general the number of the states in the tree will actually grow exponentially for instance with the depth and so even with a fairly small action space it might be computationally intractable if the horizon that you need to look look to goes beyond a few handful of steps but this is still a reasonable thing to consider at least from a conceptual perspective because then sure we have a problem with branching but we have dealt with branching in the past so similarly to how we did for learning global value functions we can again use sampling also for solving these local mdps just for the purpose of action selection and this actually results in what is called simulation based search here you we we construct a partial tree so basically we start as usual from the current state as the root of some tree but then we construct a subset of the full tree by simulating multiple episodes of experience that all start from the current state and then use the model to roll forward possible sequences of actions if you construct such a partial tree then then you can just for instance apply model 3rl to the simulated episodes to estimate the values in the root so for instance you could instantiate simulation-based search by using multi-current prediction to estimate the state value of a root node by doing just by using just two components you have some learned model and some simulation policy and then the state value in the root will just be estimated by sampling using the model and the policy k episode referred and then just averaging this to estimate the state value of the root of course if we're interested in planning for action selections and we are just the state the value in the root might not be that useful so often we will want to actually construct action values but the same principles apply very very naively just also to plan the local action value functions in this case we'll need k times the number of actions episodes to ensure we sample each action k times in the root but apart from that then we can use the same mechanism to generate complete episodes from the model and the simulation policy and then just again use averaging to estimate the the value of each action in the root as the average of the returns that did execute that that action immediately in the root and then followed the simulation policy afterwards if we construct such a local action value function then it's trivial to turn this into a mechanism for action selection because we we could for instance in each state always pick the action with the highest value according to our local search in the simulation based search algorithms that that i just described each simulation is independent of the previous ones and this means that we might not actually be making the best use of the available compute so it has some computational advantages because it's fully parallelizable but at the same time we are not using what we have learned by rolling the model forward until the end of the episode to guide our behavior in the next simulation so in this in in this second part i want to discuss a different approach where we build the tree incrementally so that we can leverage knowledge from previous simulation in order to focus our computational resources where they matter most so the algorithm is called the monte carlo research and was for instance at the heart of the famous alpha zero system and alphago and more recently the mu0 system that that showed that they could literally learn at the level of world champions games like go chess shogi and even video games and the algorithm is fairly simple because it's just based on repeating the following four steps until you have exhausted the computational time or computational resources that you have allocated to the action selection and on each step for each simulated episode what you do is just you start from the root and you use your current estimates of the q values based on the previous simulations to walk all the way from the root to a leaf node in the tree then what you do is you add the node to the tree by expanding the the action with the highest value in the leaf node you then use a rollout until episode termination using a fixed stimulation policy to get a complete evaluation of of that that that path in the tree and then you walk backwards all the way to the root updating all the q values of all the ancestor ancestor nodes in the tree and once you have exhausted the available time and resources you just as in the previous algorithm just select the action in the root that has the highest value the the important feature of this approach is that effectively we have two policies we are not always expanding nodes and building the tree using a fixed stimulation policy instead we have a tree policy that we use to navigate from the root to a leaf that is based on all the previous simulations so it's guided by all that we have learned in the previous simulations in the in this state and then we have a rollout policy that is this indeed fixed that we use to get the monte carlo estimate of the return from the leaf onwards and often the the the advantage is that we the rollout policy might be very cheap so for instance we could just have a random policy even that can be surprisingly effective but because we are iterating on the tree policy that we use to navigate all the way to the leaf the the system actually allocates resources and compute quite effectively and can result in very good value estimates of course at this stage it might still feel a bit fuzzy so as usual let's try to walk through a concrete example so we can understand how this algorithm works in practice and let's consider a situation where we have some state that we are interested to select an action in and we have always two actions in every state so at the beginning the tree is empty so what we do is we just the the initial state is is a leaf itself and we just use a default policy to to roll out until an episode termination so this will for instance give us if we roll out the policy in the model until we the model tells us that the episode has terminated maybe we get a reward a return of one then we update our value of the root node to be one because it's the average of the existing simulations then we expand one node so for instance we have no additional knowledge let's say we simply randomly choose to pick the the right action then what we do is again we have reached the leaf we use the default rollout policy to get in the entire episode evaluation this time we observe that we get an absolute return of zero but now we go back and we update not just the node we had just expanded but all the ancestors ancestors which now included the root node so the evaluation of the node we just expanded will be zero but the evaluation of the value of the root will have been updated to one half since the value in the root is higher than the value of the actions that we have selected then let's try to to expand the different action and again after reaching this leaf node we use the rollout policy to get an evaluation this time we serve a return of one so we update the value of this node we just expanded but also the root so the root is now has the value of two thirds and each of the actions in the root have an estimate of one and zero respectively since the the the value in the on the left side is higher let's now again we navigate until the leaf we expand one node we do an evaluation this time we observe a zero we back up all the updates so the the newly upped added node has a value of zero his parent has now a value of one half and the root node now have also a value of one half because two out of four the episodes that started in the route uh had the reward of one and the other two had no reward at all again we navigate until the leaf we pick the action with the with the highest value and we expand that node we roll out until the end we get an evaluation of one now again we have we have to update all the values in the parent nodes to include this the latest information that we have generated with this rollout again you start in the node you follow the q values to reach the the leaf by always selecting the highest value and then expand again and if you iterate this process you basically get a very highly selective best first search where states are evaluated dynamically and we are using sampling to break the cursive dimensionality and while still being computational efficient because the algorithm is is anytime so you can basically iterate until you have computational resources until you have time to think and then at any point you get a best estimate of what what the the agent can assume the values in the route to be given the policy and the model that it had available an important thing to realize though is that monte carlo research that actually basically all the simulation based and even forward search are essentially table lookup approaches but based not on instantiating the table of all possible states but instantiating a partial table that only includes the most likely state or that are directly reachable from the current state so we discussed quite extensively that for model 3rl table lookup is maybe a bit naive so you cannot possibly store values for all states you don't and because you don't get any generalization you you will likely not have trained the value or a policy for any new state that you encounter when interacting with the environment but for simulation based search the table lookup is actually less naive because you only need to store store values for the easily reachable states that are likely under the policy and under your q value estimate so even if you don't have extensive generalization this can actually be a system that works quite well in practice at the same time there are limits so it is still a partial instantiation of a table and that that could grow to be very large so for very big search spaces it will still be useful to combine these ideas from mcts with for instance value function approximation this is for instance what actually the alphago system did it actually used value functions and policies to guide the search and in its later iterations even to replace the rollout the rollouts to get a multi-car evaluation by rolling a leaf all the way to the end replacing that with just a learn value function to to basically make the process more efficienthello i'm matthew hassell and i'll be teaching the reinforcement learning module together with hado and diana and today i'll be talking about models and planning in the past few lectures we already covered quite a bit of material so we started with the bandits this minimal instantiation of the rl problem where we are allowed to disregard the sequential nature of decision making and focus exclusively on trading off exploiting existing knowledge versus exploring alternative courses of actions that might seem so optimal now but that might turn out to be even better as we gain experience in the environment and the world around us we then discuss dynamic programming and model-free algorithms as to families of ideas and approaches that allow us to solve both prediction and control problems with and without access to a complete specification of the environment finally we introduce function approximation as the leading approach to scaling up these families of ideas to settings where we might have very large state spaces or even continuous state spaces if we look back at dynamic programming and model 3 algorithms we can roughly sketch the underlying principles and differences between the two in the following way so in dynamic programming we assume we're given access a privileged access to a complete exact specification of the environment dynamics that includes both the state transition and the reward dynamics we then just need to solve for this perfect model and we don't need to interact with the real environment at all in model 3 algorithms instead we do not rely on any given form of model instead we learn values and policies directly from interaction with the environment in between these two extremes there is a third the prototypical family of algorithms which is typically referred to as model-based reinforcement learning here we are not given a perfect model but we do attempt to learn at least an approximate model but while interacting with the environment and then we use this learned model to derive values and policies a visual depiction of the model-free algorithm would be seen as implementing this loop where we act according to our current value estimates and our policies to generate experience but then we use this experience to directly update our values and our policies and instead the model based algorithm would add a step of indirection in this loop whereby we act according to our current value estimates and policies just like in model 3rl but we use the resulting experience to learn a model of the state transition and the reward dynamics then we can use the learn model instead of the experience directly to update the values and policies thereby closing still the loop and of course boundaries between these two families of algorithms don't need to be extremely rigid and we can perfectly imagine to combine the two ideas we use experience to both update values and to learn the model we can still then use the model to perform additional updates and to refine our value estimates and today we will actually discuss the popular instantiation of this idea which is the dyna algorithm but whenever we delve into a new topic i think it's always good to stop a second and reflect on the motivations that underlie a family of ideas for instance there is obvious disadvantages to this line of attack if we first learn a model then we use the model to plan a value function there are two sources of errors because whatever model we learn from data is likely not to be perfect it will incorporate certain approximations or certain errors and these will compound with additional errors that arise when planning with the model to estimate values in contrast if you consider model 3 algorithms they use real data that is always correct to update the values and therefore there is only one source of approximation which is the value estimation process itself despite this though there are reasons to believe that model-based reinforcement learning could could be quite effective the first of these reasons is that models can be learned with supervised learning methods which are very well understood and very powerful and effective additionally explicitly representing models can give us access to additional capabilities we could use them to represent uncertainty or to drive exploration and finally but not less importantly in many real world settings generating data by interacting with the environment is either expensive or slow or both of these and instead computation is getting cheaper and cheaper by the year therefore using a model to perform additional updates could allow us to trade off compute for a reduction in the number of interactions that we need with the environment and this will make our algorithms more data efficient and therefore extend the range of problems we cannot successfully apply them to we will start now with a bit of due diligence in a sense so we'll start by discussing different types of learned models and how to train them and only then we will delve into concrete algorithms for planning but before we go there if you need a moment before jumping in this is a good spot for a break if we're going to learn a model and use user model for planning the first thing to discuss is what constitutes a model in the first place and in this course a model will be an approximate representation of an mdp and for now we will assume that the dynamics of the environment of this mdp will be approximated by a parametric function with some parameters data and that states and actions in the model are the same as in the real mdp and also we will assume that given states and actions sampling from the model will estimate the immediate successor states and the immediate rewards that we would get by executing those same actions in those same states in the real environment and it's good to realize that this is not the most general characterization of a model there are other types of models we could consider they are quite related to this but different and we will actually discuss a few of these in the in following lectures so for instance we will consider and discuss non-parametric models where we use the data directly to sample transitions in the environment and also backward models that models for instance the dynamics of an inverse mdb rather than the mdp you are directly interested in and also jumping models where the actions do not directly map on the set of primitive actions in mdb we will discuss some of these later but for now let's focus on the simplest kind of model because this is already sufficient to discuss quite a big range of ideas and the support of a good range of planning algorithms so with this premises the problem of learning model from experience basically can be formulated as a supervised learning problem where our data set contains a set of state action spheres and the labels of these are the true states that we add up the rewards that we will observe when executing those actions in those states and the parameters of the model will then have to be chosen as in standard supervised learning so as to minimize some prediction error on this data defining a procedure for learning a model will then amount to picking a suitable parametric function for the transition and rewards functions picking a suitable loss to minimize and choosing a procedure for optimizing the parameters it does so as to minimize this loss so for instance we could choose to parametrize transitions and rewards using some linear mapping from a suitable encoding of states and actions onto for instance the rewards and the some encoding of the successor state and then minimizing the parameters could be achieved by using least squared this would give us an expectation model for instance and specifically will give us a linear expectation model note that this is not the only kind of expectation all we could consider so for instance using the same kind of loss this same mean squared error we could for instance parameterize the model itself both the reward and transitions using a deep neural network and maybe use gradient descent instead of least squares to optimize the parameters regardless of the specific choice of the parametrization though it's good to understand what are the pros of cons of expectation models in general so some disadvantages for instance are quite obvious suppose you have some macro action that randomly puts you either to the left or to the right of some wall well an expectation model will would for instance put you inside of the wall which might not be a physically meaningful thing to predict in the first place interestingly though the values could still be correct so if the values specifically are a linear function of the state or absorb the encoding of the state then actually the linear transformation will commute with expectation and therefore the value of the expected value of the next state will actually be the same as the value of the expected next state even though the expectedness state might have some peculiar features like placing you in the middle of the wall and this actually applies regardless as to the parametrization of the model itself so it only requires the values to be a function a linear function of their state representation but additionally if both the value and the model are leaner then this even applies to enrolling the model multiple steps into the future which is quite powerful and interesting property it might still be that we're not willing to accept this trade-off so we might not be it might not be possible to make either the value or the model linear and in this case maybe expected state might not be the right choice because the values that we can expect from expected states might not have the right semantics we might not be able to derive expected values from expected states and in this case we could consider a different type of model so we could consider what's called a stochastic or generative model so generative model generates not expectations like an expectation model but samples so it takes as input a state an action and a noise variable and it it will return you a sample of a possible next state and a possible reward and if we can effectively train a generative model then we can plan in our imagination so we could for instance generate long sequences of hypothetical transitions in the environment and compare them to choose how to best proceed and this could be done even if the both the model and the values are not linear the flip side of course is that generative models add noise via the sampling process there's also a third option which is to actually attempt to model the complete transition dynamics so model the full transition distribution and full reward distribution and if we could do so actually both expectation and generative models and even values themselves could be all derived as a sub-product of our model of our full transition and reward model the problem is that it might be hard to model the complete distribution and or it might be computationally intractable to actually apply it in practice due to the ranching factor so for instance if we were to try to estimate the value of the next state with a one step look ahead this would still require two sum overall actions and for each action sum over all possible successor states and things get exponentially worse if we even want to look multiple steps ahead and as a result even if we had this full distribution model some form of sampling would probably be needed for most problem of interest and in that case the advantage of the complete distributions compared to for instance just a generative model might might be smaller in the in addition to the type of model we might want to train there is the second additional important dimension which is how we parametrize the model itself so a very common choice is to first of all decompose the dynamics of the rewards and the transitions into two separate parametric functions and then for each of these we can choose among many different parameter parameterizations in this course for instance we'll consider table lookup models or linear expectation models and finally also deep neural network models in the table lookup case the model is an explicit representation of ndp where for instance we may estimate the probability of observing a given successor state for a given state action pair as the empirical probability based on past experience and similarly we could parametrize the reward function as the empirical mean of the rewards that we did actually observe in practice when executing a given action in a given state in this case we would actually get a mixed model because our our table account model would use a full distribution for the transition dynamics and an expectation model for the rewards it is of course possible also to model the complete distributions for the rewards themselves so let's consider for instance our minimal reinforcement learning example where we have the usual two-state mdp with no discounting and we observe eight episodes of experience so once in one in one of the episodes we start in a we transition to b and we observe no reward in all the other episodes we start in b and we terminate immediately and in all but one we observe a reward battle of one and in one of them we observe a reward of zero a table lookup model would model the complete distribution of the environment dynamics as in the picture on the slide so in state a with probability one we transition to b and we absorb the reward of zero and in state b with probability one we terminated but with probability 0.75 we observe the reward of one while with probability 0.25 we absorb no reward at all if instead we consider a linear expectation model it's good to remember that first of all we need some feature representation to be given to us because the linear model doesn't have the capability to build and its own state representation but given this some feature representation we could encode any state s as some feature set for instance a vector and then parameterize separately reward and transitions as two linear functions of the features of the state so for instance expected next state could be parametrized by a square matrix ta one for each action a in such a way that the matrix times the features of the current state give us the expected next state and similarly rewards could be parameterized by a vector wa for each action a so that the dot product between the vector and the current state representation gives us again the expected reward if we do so then on each transition on each transition where we execute some action a some state s then observe some reward r and next state as prime there we can apply a gradient step for instance to update both wa and ta so as to minimize the mean squared error on both the reward and the expected state we are now ready to delve into the true heart of this topic which is how to actually use models to plan and specifically in this section is how to do so for the purpose of improving our value function estimates and our policies so the core let's say value proposition of planning in general of model-based rl is the capability of improving such value estimates and these policies by absorbing compute but without requiring actual interactions with the environment as you might remember dynamic programming is actually a good example that we have already seen of a process that allows us to do just this the problem is of course that in dynamic programming we're given a privileged access to a perfect specification of the environment but in reinforcement learning in general we are actually interested in planning algorithms that do not make this assumption and instead perform planning for instance with alert models such as any any of the ones that we discussed in the previous section in this case the simplest approach to get around this limitation of dynamic programming is maybe to just directly mirror dynamic programming algorithms but plugging in a learn model in place of the privileged specification of the environment if we do so then we can just solve for the approximate mdp with what whatever dynamic programming algorithm you like so you could use value iteration or policy iteration and by doing so we we still require some interactions with the environment because we will still need to learn the model but at least hopefully we can reduce quite a bit of reliance on on the unreal interactions and and therefore be more data efficient so this approach actually um can work and it's it's a reasonable things to consider but it's also not not the only option so for instance there's another route we could take so we could combine learn models not with dynamic programming but instead with the ideas and algorithms that we discussed in the context of model free agents so this is also referred to as sample-based planning and very simply it works by using the learn model to sample imagine transitions imagine trajectories and then applying your favorite model three algorithms to the imagined data just as if it was real datas so for instance you could apply monte carlo control sarsa or q learning and just apply the same updates on the imagine data as sampled from the approximate ndp that you have learned as if it was real data sampled from the environment and let's consider a concrete example to make this clear so this is the same example that we discussed in the previous section so on the left we have the real experience we have access to so eight episodes um most of them starting in uh in a state b and then observing either a zero or one reward and one actually starting in state a and the first transition into b and in the middle is actually the learn table lookup model that we had learned from such data this gives us a hundred percent likelihood of transitioning from a to b and a hundred percent likelihood of terminating it b but with a 75 probability of getting a reward of one and 25 percent of getting a reward of zero the idea of sample-based planning would be take this model and then sample trajectories episodes from this model and this could be for instance the data that we report on the right so all of this data is sampled from the from the model that we have learned and first we have now two episodes the sardine both which transition to b because this is forced by the model and then in this case both episodes were sampled the reward of one when they were in b plus the number of episodes starting in b and terminating immediately again consistently with the model so what happens if we apply monte carlo learning to this data well we will get for instance that the value of a is one because in both episodes we when we start in a we collected a reward of one at the end and the value of b that is equal to 0.75 because in the sample episodes that we that we got from our model three three-quarters of the time would serve the reward of one and b so in this case it's perfectly in in line with the troop with the learned probabilities from the model so this is a cool idea and and also using dynamic programming directly with the lower model could be a very interesting approach but but what are the potential limitations with both of these ideas as always the the greatest concern is that the learning model might not be perfect actually in general we know it will incorporate some errors and some approximations and in this case the planning process may compute a policy that is optimal for the approximate mdb but that is sub-optimal with respect to the real environment and so our presentation couldn't be complete if we didn't discuss how to deal with these limitations and and there are many ways we could go for this so one approach would be very very simply whenever the model is not good enough it's not reliable enough just use model 3 algorithms another idea would be let's try to reason explicitly about our uncertainty over our estimates of the model so for instance this would lead us more towards the bayesian approach and finally the third approach which is what we will delve more in today is actually to combine model-free and model-based methods integrating learning and planning in a single algorithm how can we do this well in the previous discussions somewhat implicitly we already discussed that there are two potential sources of data to potential sources of experience so one is real experience this is data transitions trajectories that you generate by actually interacting with the environment and another source is simulated experience where the data is sampled from a model for instance a learn model and therefore is sampled from some approximate mdp well the core idea behind dyna which is a powerful approach to integrate learning and planning is let's just treat the two sources of experience as if they were the same thing so while in model 3 we learn values from real experience and in sample based planning we plan values from simulated experience well in dyna let's do both so let's apply the same updates to values and policy regardless of whether data comes from real or simulated experience this very directly implements the picture i had also shown at the very beginning of the chapter so where we where i suggested we could maybe act according to our value estimator policies to generate experience and then this experience could be used both to update the values and policies directly but also used to learn a model that could then generate an additional simulated update simulated experience so let's now discuss concrete instantiation of dyna because this this idea is very general but of course we could plug many different algorithms to this and apply very different updates to both the simulated and real experience so what does this actually mean in practice well a popular version of this idea is what is typically be referred to as dyna queue where we use q-learning on both the real and simulated experience and this is what it looks like first we of course need to initialize now both value function like in q learning but also a model because we'll be training it alongside and then on on each step we can select actions using an absolute greedy policy for instance like in standard q learning and after executing this action we of course get from the environment successor state and a reward with this real transition we can directly apply q learning for instance by updating the current q values towards the usual target constructed by summing the reward and the discounted max q value in the next step but we can also update now the model in a supervised fashion for instance in a tabular deterministic environment this would just amount to storing the next state and the reward the next point is where the secret sauce in some sense of dyna comes in so basically for each real step in the environment we now perform n planning updates where we take some past the state action pair that we have encountered during training and we use the current model to sample a possible next state and a possible subsequent reward and we use this imagine transition to perform an additional q-learning update and of course if as we increase the number of imagine steps that we do we can basically reduce our reliance on on the real transitions and become more data efficient as we were hoping interestingly of course there are many different ways of extending or implementing this idea so for instance we could use a different model algorithm instead of q-learning also the algorithm writes in the slide is written for the tabular setting but of course we could use linear functions or deep neural networks to represent both the values or the models and alternatively we could also vary the amount of planning that we do over time so very intuitively the model typically is improved and updated along the way so it's hopefully getting better over time so maybe we could imagine to perform more planning updates as training progresses and we become more confident in our model ideally we would actually even have some form of uncertainty estimates that we can use to choose how much planning to perform how much to trust our planning updates depending on the accuracy of our learned models but regardless of the of the details and potential enhancement the important thing is that even this basic algorithm already instantiates the most fundamental properties that we were looking for so we we can sync in more compute to learn more efficiently which i just want to recall again it's it's really critical if the data is slow or expensive or unsafe to collect and and this i can say this uh enough it's really often the case in many real world applications so let's take a look at what what happens when we actually run these algorithms on not a simple problem just to gain more understanding and intuition about planning for credit assignment to do so we will consider a simple maze which is drawn on this slide at the top right where on each episode we use the agent starts on the cell marked with an s and from any cell the agent can take one of four actions which correspond to moving up down left and right respectively the dark cells instead are walls meaning that if we try to move into one of these darker cells the action will have no effect to just bang on the wall and and stay where we are the reward is zero everywhere except for a single positive reward that you collect when you reach the goal cell which is the one marked with a g in the top right corner under this setting the it's it's fairly obvious that the optimal policy for any discount smaller than one is going to be a policy that leads you from s to g following the shortest path because any any any choice of actions that will take you along the longer path will basically result in the final reward reward be discounted more strongly and this means that we can evaluate the quality of the policy learned by an agent for instance by plotting the number of steps that the agent takes in order to reach the goal as a function for instance of the number of episodes that it has experienced and this is what is shown in this plot on this slide this plot is taken from the saturn embargo book so you can also read there about more details about how exactly this was instantiated and run but at the fundamentally this plot just shows three instantiations of the algorithm that i showed in the previous slide that were ran with different amounts of planning specifically it was ran with zero steps of planning meaning that the algorithms just revert to vanilla learning or with five steps of planning or with 50 steps of planning and what is really apparent already from this plot is that vanilla q learning takes many tens of episodes to actually converge to the optimal solution and instead by just adding a tiny bit of planning so just doing five additional updates on for every real update the agent manages to converge to the optimum much quicker in less than 10 episodes and as we increase for instance to 50 planning steps the the agent actually converges in just a couple of episodes to the optimal policy which is which is really impressive might be it might even seem too good to be true in a sense like what is actually happening behind the scenes and in this slide i'm trying to give you basically an opportunity to peek behind the scenes and see what happens to the q values estimate as we run this algorithm and specifically let's consider what happens on the very last step of the first episode so this is when the is basically the first time the agent observes the rewards so for simplicity we will also assume that q values were initialized as zero everywhere under this assumption without planning the the first episode is actually fairly boring so on all intermediate steps the values where zero when the agent enters the cell and actually stays zero because the td error is always zero our estimates are zero their awards are zero the value max q value we bootstrap on is also zero so no update is done throughout all the first episodes only a single q value is updated right on the last transition and this is the q values corresponding to moving upwards from the cell immediately below the goal immediately below that top right corner so and when when the agent is doing this final transition of the first episode that q value is correctly updated towards the goal reward and this immediately makes a greedy policy optimal in that one cell because all the other action values stay at zero but this section value of moving up from the state just below the goal becomes actually has a non-zero value so all other states that we have seen throughout the first episode have not been updated q values stay all at zero and for the behavior stay random assuming assuming we are breaking ties at random but this very last state is shown in the bottom left of the in the maze shown on the bottom left well that one gets updated to the correct to the correct policy on the main stream on the bottom right we show instead the state of our estimates of the q values when we run the same algorithm with 50 steps of planning and again we freeze the algorithm at the immediately after we have done run it for just one episode but suddenly you can see that the q values are now have been updated throughout the maze so it's not just the final the final state that has uh realized what the optimal action is but actually the the model the estimate of the q values that the agent has updated throughout this first episode show a meaningful direction for many many different other states and basically the information of how to move from the state to goal has propagated almost all the way to the start which is quite impressive and the reason is that when you do update that first that last cell in the in the in the episode you then start sampling 50 more updates on imagine transitions from the model which means that on each of these other 50 planning updates we will we will be able to propagate information from the new updated q values backwards in state space and this means that well in this case it doesn't quite solve the problem in a single episode but it will just take just a couple more episodes to correctly propagate all information for the state space and figure out what the optimal policy is of course this is this is not all roses right then there are always certain critical issues with model based rl including the fact that in some cases our model will be wrong and in in this experiment i'm trying to give you some intuition of how wrong models can affect learning in a model-based rl system so to to gain some intuition of this let's use again the same algorithm to investigate how an incorrect model may affect training and we will do so by simulating the this this model being wrong by having a sudden change in the dynamics of the problem this means that the model learned from the agent in the first half of training will suddenly be wrong because this the world itself has changed but the effect of this on learning will be quite different depending on the type of error let's see what i mean by this well in the first setting we assume that at the beginning of training the environment looks like on the grid on the left where there is a short path between the state marked as with s as usual and the goal marked as g in the top right corner and once the dyna q agent has learned this policy correctly which is shown in the plot by the cumulative reward becoming linear in the number of steps well at that point we change the environment to the one on the right which is very similar very similar grid but now the optimal policy is not anymore to move to the right to reach the goal but rather the agent needs to take a detour to circumnavigate that wall that is now blocking the path when you do this change in environment suddenly the cumulative reward collected by dyna q flattens because that at the beginning the agent will try to continue move toward through the same path that it had learned at the beginning but of course that path is now blocked because the walls have shifted this means that for a few steps the agent will be unable to collect any rewards but if you look at the plot for dyna queue after a while the agent does start collecting more rewards because the agent will update the model to now correctly understand that that it cannot move through that cell because it's now a wall and through planning it will then re-route the the policy of the agent through by updating the q values in order to take the a different route and which which now results in the reward again growing linearly with the number of steps there are several specific agents on this plot and again this plot is taken from the sutton invarta book but that correspond to slightly different instantiations of dyna and the middle one is dyna q which is the canonical one i described while dyna q plus adds some small exploration box which means that actually is able to reroute slightly faster but the core message remains that if the model is inaccurate in this way where the the world actually is harder than the model was initially assuming the behavior of the agent is such that it will actually lead to real transitions that will correct for the model error and therefore the in some sense the learning dynamic is overall quite benign it's good to realize that this is not always the case so let's now consider a very similar setting with a slight difference so now again the world changes but the change in environment is actually easier in some sense so at the beginning you have the situation on the left where the agent needs to take this detour around the wall in order to reach the goal but after changing there is now the detour is still available to the agent but another path has opened up so the agent could take pass to the right of the wall which is a slightly shorter path which would result in higher rewards what happens is that once the q agent has learned to correctly take this detour long detour around the wall in order to reach the goal there is nothing that pushes the agent to explore this other path the the way in which the model is incorrect the fact that there is actually a small is a shorter path is is not reflected in the data that the agent sees because the the error in the model doesn't result in data that can correct for the model error itself so what happens is that once the agent has converged on the solution of taking the long detour it actually might stay just stick with this behavior and will take a very very long time in order to to realize that there was actually an error and this is why in this plot the dyna q algorithms exactly which is exactly the one i described in the pseudocode before will actually continue learning at the same speed the slope is the same because it's still taking always the same long path in order to reach the goal after also after that the world itself has become easier and here you can see actually the dyna q plus that adds this exploration bonus is actually able to figure out this this additional path and i don't want to go into much details about dyna q plus because the details are not important here but the important thing to understand is that thinking deeply for instance about exploration is still important even model-based rl because the way the certain types of errors in your in your model might not necessarily be discovered unless you have something that pushes your agent to to execute this behavior that results in useful data the dyna algorithms i described in the previous section is a beautiful instantiation of a system that integrates learning and planning so model-free and model-based updates but it's not the only one and in this section i want to discuss experience replay and its connection to more conventional model based algorithms such as dynam traditionally rl algorithms did not explicitly store experience so it was trivial to place them in one of two groups so model-free methods that do not attempt to explicitly model the dynamics of the environment and directly update values and policy online with the latest data a model-based methods that learn transitions and reward model and then plan value functions using such a model but this kind of sharp distinction is blurred not just by algorithms like dyna that do both but by other developments for instance in the space of model 3 algorithms and indeed many modern rl systems now store transitions in an experience replay buffer and then apply model-free algorithms not just to the latest transitions but also to pass transitions in the buffer and this is actually a fairly common technique for model free agents but it also can be seen as just instantiating dyna but with an experience replay playing the role of a non-parametric model and this is not just a superficial similarity because both systems actually share a fundamental property like which is their ability to sync in compute as i said many times with no additional environment interactions to improve values and policies and in this case the experience replace systems would do so by making many updates on past experience for every new real steps in the environment in in this plot here i i can show that how this this property of the scalability property of planning algorithms um is instantiated by both dyna and a replay q learning agent in the case of a simple grid world so this is a simple grid world where as usual you start in the state s and you need to reach the goal g as fast as possible because there is some some discount strictly smaller than one and what i show on the right side is basically the total number of steps required to reach the goal on the on the top right 25 times if you every time you start from the cell s on the left as a function of the amount of planning so of the amount of updates that you do for each real step in the environment and you can see that regardless of whether the entire transition is replayed like an experience replace system or whether a model is used to infer the subsequent rewards and state given a state action pair the profile actually looks very similar the more additional updates you do for every real step in the environment the more the performance improves and so the faster you reach the goal thing which is basically the lower the total number of steps that are required to reach the goal 25 times and you might might be even tempted to take this to the extreme so our experience replace system and algorithms based on learn parametric models completely equivalent and well in the tabular settings we can actually derive exact equivalences between certain model-based and model-free algorithms and even more in general somewhat trivial if we had a perfect model that outputs exactly the same true reward and successor state as the non-parametric replay system well the two would be exactly equivalent by construction in general though of course any model that we use will not be perfect so one question that you may ask is will it ever be better could the inaccuracy that almost certainly come from using a parametric model provide benefits on top of a non-parametric replay system this seems unlikely and it's i think it's fair to say that if we only use the model to reconstruct their words and successor states corresponding to an observed state action pair it seems hard for a parametric model to improve over just replaying those transitions but the reason that algorithms like dyna and model base rl that use parametric learn models are so important is actually a parametric model allows flexibility that goes way beyond that so for instance we could use the model to plan for action selection and this is something we'll discuss in great detail in the next section because if you can query a parametric model for actions that you could want to take in the future then you can use this to construct a plan to decide how to act and this is something that you cannot do if you're just have a non-parametric replay system that can only tell you which action you took in the past in a given state additionally with a parametric model you could do things like counterfactual planning so queria models for actions that you could have taken in the past but did not and related to contrafractual planning you could do what is called backward planning so if instead of modeling the actual dynamics of the problem you model the inverse dynamics so given a state and an action you you the model tries to predict what was the previous reward and previous state well then you can use this to assign credit to states that could have led to a certain outcome in addition to assigning credit as in standard rl to the states that actually preceded that outcome and finally if you have a parametric model you could train this model to add different time skills like an algorithm like dyna doesn't need not to be restricted to train a model exactly on one step transitions you and on the true native time scale of the environment you could train a model to do jumpy predictions and therefore call support what is called jumpy planning and in conclusion it's actually worth noting also that there are computational differences between the parametric model and experience replay so and this might also uh actually play an important role in choosing between the two so for instance querying a replay buffer is very cheap given a state action pair it gives you immediately what was the reward and what was the successor state that you observed in the past well given a state action pair generating a subsequent reward in state using a learned model could be very expensive if the learn model for instance is parameterized by a big neural network and if you look at memory of course things change and for instance the memory requirements of a replay buffer can be quite harsh because the memory will scale linearly with its capacity well a parametric model could achieve good accuracy with a fixed and and comparably smaller memory footprint and overall i think the the the key takeaway that i would like you to get is that both are important and powerful ideas that implement this core principle of planning this capability of syncing in compute in order to improve our learning and our algorithms and regardless of the labels that we want to attach to them it's really more important to just think deeply about the problem that you want to solve and whether a parametric or a parametric model can be the best fit for for that specific problem so far we discussed how planning can be used to improve our estimates of a global value function or a global policy that is applicable in all the states of our mdp but now i want to tell you about a different form of planning where we sync and compute as usual without requiring additional interactions with the environment but for the purpose of just selecting a single action in a single state this is something is also called planning in the nail and it may seem a special case of the previous problem and of course in a sense it is because if we could get perfect values and perfect policies everywhere then we could just use these perfect values in any one state but the motivation for investigating planning for action selection is that sometimes it's actually easier to make a very accurate local value function than it is to actually fit the global by function and the reason is that in in planning if a value function for the current state we only need to take into account the distribution of states that are are reachable from the current state and that that might be a very small portion of the overall mdp and additionally planning for action selection has a few other appealing properties like the fact that suppose you have some inaccuracy in your model in in some states well that will only affect the local throwaway values that you're using to select an action right now but it won't pollute like a shared global value function that is going to be reused everywhere so it might result in sure selecting a sub-optimal action in certain poorly understood states but maybe that just leads to reasonable exploration and for instance behaviors that can help you correct your the model itself the simplest form of planning for the purpose of action selection is what is called forward search so this approach allows you to select the best action in any given state by basically building the entire search tree that has the current state as root and then follows all possible trajectories from from the current state onwards until episode terminations basically this approach is amounts to representing as a tree the entire sub-mdp of reachable states from the current one and in some cases this sub mdp might actually be fairly tiny and then you can maybe solve that in full every time you need to select an action in general of course this might not be the case so in in general the number of the states in the tree will actually grow exponentially for instance with the depth and so even with a fairly small action space it might be computationally intractable if the horizon that you need to look look to goes beyond a few handful of steps but this is still a reasonable thing to consider at least from a conceptual perspective because then sure we have a problem with branching but we have dealt with branching in the past so similarly to how we did for learning global value functions we can again use sampling also for solving these local mdps just for the purpose of action selection and this actually results in what is called simulation based search here you we we construct a partial tree so basically we start as usual from the current state as the root of some tree but then we construct a subset of the full tree by simulating multiple episodes of experience that all start from the current state and then use the model to roll forward possible sequences of actions if you construct such a partial tree then then you can just for instance apply model 3rl to the simulated episodes to estimate the values in the root so for instance you could instantiate simulation-based search by using multi-current prediction to estimate the state value of a root node by doing just by using just two components you have some learned model and some simulation policy and then the state value in the root will just be estimated by sampling using the model and the policy k episode referred and then just averaging this to estimate the state value of the root of course if we're interested in planning for action selections and we are just the state the value in the root might not be that useful so often we will want to actually construct action values but the same principles apply very very naively just also to plan the local action value functions in this case we'll need k times the number of actions episodes to ensure we sample each action k times in the root but apart from that then we can use the same mechanism to generate complete episodes from the model and the simulation policy and then just again use averaging to estimate the the value of each action in the root as the average of the returns that did execute that that action immediately in the root and then followed the simulation policy afterwards if we construct such a local action value function then it's trivial to turn this into a mechanism for action selection because we we could for instance in each state always pick the action with the highest value according to our local search in the simulation based search algorithms that that i just described each simulation is independent of the previous ones and this means that we might not actually be making the best use of the available compute so it has some computational advantages because it's fully parallelizable but at the same time we are not using what we have learned by rolling the model forward until the end of the episode to guide our behavior in the next simulation so in this in in this second part i want to discuss a different approach where we build the tree incrementally so that we can leverage knowledge from previous simulation in order to focus our computational resources where they matter most so the algorithm is called the monte carlo research and was for instance at the heart of the famous alpha zero system and alphago and more recently the mu0 system that that showed that they could literally learn at the level of world champions games like go chess shogi and even video games and the algorithm is fairly simple because it's just based on repeating the following four steps until you have exhausted the computational time or computational resources that you have allocated to the action selection and on each step for each simulated episode what you do is just you start from the root and you use your current estimates of the q values based on the previous simulations to walk all the way from the root to a leaf node in the tree then what you do is you add the node to the tree by expanding the the action with the highest value in the leaf node you then use a rollout until episode termination using a fixed stimulation policy to get a complete evaluation of of that that that path in the tree and then you walk backwards all the way to the root updating all the q values of all the ancestor ancestor nodes in the tree and once you have exhausted the available time and resources you just as in the previous algorithm just select the action in the root that has the highest value the the important feature of this approach is that effectively we have two policies we are not always expanding nodes and building the tree using a fixed stimulation policy instead we have a tree policy that we use to navigate from the root to a leaf that is based on all the previous simulations so it's guided by all that we have learned in the previous simulations in the in this state and then we have a rollout policy that is this indeed fixed that we use to get the monte carlo estimate of the return from the leaf onwards and often the the the advantage is that we the rollout policy might be very cheap so for instance we could just have a random policy even that can be surprisingly effective but because we are iterating on the tree policy that we use to navigate all the way to the leaf the the system actually allocates resources and compute quite effectively and can result in very good value estimates of course at this stage it might still feel a bit fuzzy so as usual let's try to walk through a concrete example so we can understand how this algorithm works in practice and let's consider a situation where we have some state that we are interested to select an action in and we have always two actions in every state so at the beginning the tree is empty so what we do is we just the the initial state is is a leaf itself and we just use a default policy to to roll out until an episode termination so this will for instance give us if we roll out the policy in the model until we the model tells us that the episode has terminated maybe we get a reward a return of one then we update our value of the root node to be one because it's the average of the existing simulations then we expand one node so for instance we have no additional knowledge let's say we simply randomly choose to pick the the right action then what we do is again we have reached the leaf we use the default rollout policy to get in the entire episode evaluation this time we observe that we get an absolute return of zero but now we go back and we update not just the node we had just expanded but all the ancestors ancestors which now included the root node so the evaluation of the node we just expanded will be zero but the evaluation of the value of the root will have been updated to one half since the value in the root is higher than the value of the actions that we have selected then let's try to to expand the different action and again after reaching this leaf node we use the rollout policy to get an evaluation this time we serve a return of one so we update the value of this node we just expanded but also the root so the root is now has the value of two thirds and each of the actions in the root have an estimate of one and zero respectively since the the the value in the on the left side is higher let's now again we navigate until the leaf we expand one node we do an evaluation this time we observe a zero we back up all the updates so the the newly upped added node has a value of zero his parent has now a value of one half and the root node now have also a value of one half because two out of four the episodes that started in the route uh had the reward of one and the other two had no reward at all again we navigate until the leaf we pick the action with the with the highest value and we expand that node we roll out until the end we get an evaluation of one now again we have we have to update all the values in the parent nodes to include this the latest information that we have generated with this rollout again you start in the node you follow the q values to reach the the leaf by always selecting the highest value and then expand again and if you iterate this process you basically get a very highly selective best first search where states are evaluated dynamically and we are using sampling to break the cursive dimensionality and while still being computational efficient because the algorithm is is anytime so you can basically iterate until you have computational resources until you have time to think and then at any point you get a best estimate of what what the the agent can assume the values in the route to be given the policy and the model that it had available an important thing to realize though is that monte carlo research that actually basically all the simulation based and even forward search are essentially table lookup approaches but based not on instantiating the table of all possible states but instantiating a partial table that only includes the most likely state or that are directly reachable from the current state so we discussed quite extensively that for model 3rl table lookup is maybe a bit naive so you cannot possibly store values for all states you don't and because you don't get any generalization you you will likely not have trained the value or a policy for any new state that you encounter when interacting with the environment but for simulation based search the table lookup is actually less naive because you only need to store store values for the easily reachable states that are likely under the policy and under your q value estimate so even if you don't have extensive generalization this can actually be a system that works quite well in practice at the same time there are limits so it is still a partial instantiation of a table and that that could grow to be very large so for very big search spaces it will still be useful to combine these ideas from mcts with for instance value function approximation this is for instance what actually the alphago system did it actually used value functions and policies to guide the search and in its later iterations even to replace the rollout the rollouts to get a multi-car evaluation by rolling a leaf all the way to the end replacing that with just a learn value function to to basically make the process more efficient\n"