AlphaZero from Scratch – Machine Learning Tutorial

Creating a Connect 4 Environment with Kegle and Implementing MCTS Search

The process of creating a standard Connect 4 environment is achieved by setting `n` equal to `kegle environments do make` and then using `connect X` to get this environment. This also makes the environment tweakable, so we can modify its parameters as needed.

To create players for our agents, we define them as lists of agents. In this case, we want two players, so we will define two kegle agent classes. The first one will implement pre-processing for the state and then call either MCTS merch search or get a prediction directly from our model depending on the arguments.

Before creating the players, we need to define our game, model, and arguments. Our game is simply Connect 4, and we don't need a player for this case. We also have arguments that we can adjust as needed. For example, we want to search set `search` to True, set `temperature` to zero so that we always get the ARX of our policy and the MCTS distribution guided by our model, and set `Epsilon` to 1 to introduce some randomness.

Next, we create our device and model using the provided path. Our model is trained on a previous checkpoint, and we want to use this saved model to train our players.

Now that we have defined our game, model, arguments, device, and kegle agents, we can start creating our players. We set `player one` equal to a kegle agent and define the same for `player two`. We fill the list with `player one.run()` followed by `player two.run()`, which will run these two agents against each other.

The result is an animation of our models playing against each other in Connect 4, which comes to a draw. This shows that our model can defend against all attacks and is able to play a game of Connect 4.

Creating a Tic Tac Toe Environment with Kegle and Implementing MCTS Search

To create a Tic Tac Toe environment, we simply set the `game` variable to `'tic-tac-toe'`. We also need to adjust our arguments slightly. For example, we can set `number_of_searches` to 100 for the rest net, and we want to update the path of our model.

Here's how we can do this:

```python

import keggle

# Set game to Tic Tac Toe

game = 'tic-tac-toe'

# Define arguments

args = {

'search': True,

'temperature': 0,

'Epsilon': 1,

}

# Create device and model

device = ...

model = ...

# Create kegle agent for player one

player_one = keggle.Agent(

game=game,

args=args,

model=model,

path='path/to/model.pt',

)

# Create kegle agent for player two

player_two = kagle.Agent(

game=game,

args=args,

model=model,

path='path/to/model2.pt',

)

```

Finally, we can run the `Cale` function to create an animation of our models playing against each other in Tic Tac Toe. The result is a nice animation of the two players making moves on the board.

This shows that our model can defend against all possible attacks and play a game of Tic Tac Toe to a draw. We can also adjust the arguments as needed, such as setting `search` to False so that our models play directly without running MCTS merch search.

"WEBVTTKind: captionsLanguage: enAlpha zero is a game playing algorithm that uses artificial intelligence and machine learning techniques to learn to play board games at superhuman levels in this machine learning course from Robert Forester you will learn how to create Alpha Zero from scratch thank you for tuning in in this video we are going to rebuild Alpha zero completely from scratch using Python and The Deep learning framework pytorch Alpha zero was initially developed by Deep minds and it is able to achieve magnificent performance in extremely complex board games such as go where the amount of legal board positions is actually significantly higher than the amount of atoms than in our universe so not only is this very impressive just from an AI standpoint on but additionally to that the machine Learning System of alpha zero learns all of this information just by playing with itself and moreover the various domains in which this algorithm can be used are almost Limitless so not only can this algorithm also play chess and shogi in a very impressive way but the recent Alpha tensor paper that was also published by Deep Mind showed that Alpha zero can even be used to invent novel algorithms inside of mathematics so together we will first of all understand how this algorithm works and then we will build it from ground up and train and evaluate it on the games of Tik tac toe and connect for so that you will also understand how flexible Alpha Zer really is when it comes to adapting it to various domains so let's get started okay great so let's start with a brief overview of the alpha Zer algorithm so first of all it is important that we have two separate components and on one hand we have this self play part right here and yeah during this phase our Alpha zero model basically plays with itself in order to gather some information about the game and while we play with ourselves we also generate data and this data will then be used in The Next Step during training so in this training phase right here which is the second component we like to optimize our model based on the information it gained while playing with itself and then next we basically want to fulfill the cycle and then use this optimized model um to play with itself again so the idea is essentially to repeat the cycle and number of times until we have yeah reached a Nur Network so a model right here that is capable of playing uh a certain game better than any human could because it just play with itself used The Information Gain to optimize uh yeah itself and then play with itself again basically um and then we just like to repeat that so in between these two components we have this alha Zer model right here which is just a n network and now let's look at this n network more closely so let's see how this actually is architectured so basically we take the state SS input and for the case of tikto for example the state would just be board position um so just turn to numbers and then we once we have the state has input we receive a policy and the value as output so the policy here is this distribution and for each action basically it will tell us how promising this action would be based on this state we have received here input so it will basically tell us where to play based on this state right here so this is the policy and then the value is just a float and it will basically be a measure telling us how promising the state itself is for us as a player to be in so maybe a concrete example that make might make all of this easier to understand uh is something we have down below here so here for the state we just take in this position and remember that we are player X in this case and yeah if you're familiar with the roads of titico you should see that we should play down here because um yeah then we would have three exis on the lowest row and yeah we would win the game so if our model also um understands how to play this game then we would like to receive a policy looking some word like this where basically the highest bar right here is also this position uh down below uh that would basically help us to win the game so yeah this is how the policies should look like and for the value uh we can see that this St is quite nice for us as a player because we can win here so we would like to have a very high value so yeah in this case the value is just float in the range of negative 1 and positive one so we would like to have a value that is close to positive one because this state is quite optimal okay now we uh yeah looked at how the general architecture is built and how the model uh is built but next we want to understand the safe play and the training part more in depth so before we do that you should also understand what a monal researches because that is vital for selfplay and the general algorithm so the monteal research is this search algorithm right here and it will take a state on our case a Bo position as input and then it will find the action that looked most promising to us and this algorithm gets this information by just first of all declaring a root note at the initial state right here and then by building up this tree into the future so just like we can see right here so basically the idea is that we um create these notes into the future uh right here and we get these notes by just taking actions that lead us to the Future so this note right here for example example is just the place we arrive at when we take this action one uh based on this initial State at our root note so each note first of all stores a state like S1 in this case and um so S3 right here would just be the state we get when we first of all play action one based on this initial State and then when we would play action three based on this state right here so playing these actions um after each other basically then additionally to the state each note also um stores this variable W and W is just the total number of wins that we achieved when we played in this direction into the future so basically when we took action one initially and arrived at this note we were able to win three times when we walked into the future even further so next to this variable we also have this n variable right here and N is just the total visit count so basically the total number of times that we went into this direction so here we can say that we um took action one four times and out of these four times we were able to win three times right so after we have um basically built up a tree like this um we can then find the action that looks most promising to us and the way we do this is by looking at the children of our root node so these two nodes right here and then for each node we can calculate the winning ratio so here we can say that this node won three out of four times where this node one one out of two times then we can say that you know what this node right here has a higher winning ratio and thus it looks more promising to to us and because of that we can also say that action one looks more promising than action two because this action was the one that even led us to this note here in the first place so basically the result of this tree right here that we got at this time would be that action one U should be the action we we would like to play then we could just use this information to uh say you know what in our actual game of tic tac to for example we now want to use this action right here so this is great but now the question becomes how can we even build up a tree like this and also how do we get the information whether we win at this given position or not right so the way we come up with all of uh this data right here is basically by working down uh all of these four steps for a given number of iterations so at the beginning we like we are in our selection phase and uh during selection we basically walk down our tree but we walk down our tree until we have reached a so-called Leaf node so a leaf note is a note that could be expanded even further um into the future so uh one example of a leaf note would be this note right here because you can see that we have only one child is this note but actually we could also expand in this direction if we say that for each note there are always two possible ways in which you could expand in this example right here so this is a leaf note and then also this would be a leaf note right here because this note doesn't have any child right and then also we can say that this note also is the Lea note because it only has one sh right here but we could still expand it in this direction so yeah these are Leaf notes for example while these aren't any Leaf notes because yeah this one already has two children and this one also has two children already so yeah this is basically how selection looks like but now the question also becomes um when we walk down uh which direction we should pick right so when we just start at the top we now have to decide whether we want to walk down in this direction or whether we want to walk down in this direction right here basically the way of finding out the direction we choose when walking down is by taking the direction or the child right here that has the highest UCB formula like this so while this might look a bit complex um in reality we just take the child that um tends to win uh more often while also trying to uh take a child that has has been visited just a few number of times relatively right so on one hand uh we when we maximiz this formula we want to take the child with the highest winning ratio and also um want to take the child that has relatively few visits compared to this large n uh which is just the total number of visits for the parent so if we would calculate the UCB formula for both of these nodes right here we can see that the UCB formula here is higher because um this note has been visited uh less often right and because of that we would like to walk in this direction right here okay so during selection we can now say that we walk down in this direction right here and now we have arrived at this note so first of all we have to check if we work down walk down further or if we have in fact reached the leaf note and now we can say see that we have reached a leaf note right here right because we could still expand in this direction so we could still have a child right here and because of that we actually skip to the next phase of the MCTS and that is the expansion phase so right here um in expansion we want to create a new node and depended to our tree so I will just create this node right here and I will call it I give it the state as seven then we can say that we took the action A7 to get to this newly created note right here and at the beginning uh this note should have a winning count of zero and also a visit count of zero so I'm just going to declare W7 right here and also N7 like this and both of them should just be equal uh to zero at the beginning right because we have just created uh this note and eded it to our tree but we have never been there actually so now we can say that we further work down here um to the note we have just created and um so now we have also finished this expansion phase right here and next uh we have arrived at this simulation phase and here the idea is basically to play randomly into the future and until we have reached a state uh where the game is over right so we want to play into the future until the game is finished and basically here we can just say we play using random actions so we just play randomly and finally we have just reached this node right here and yeah here the game is over so this is uh the simulation phase and now uh when we have reached this terminal node basically we then have to check whether we have won uh the game or we have lost or a draw has ured so in this concrete uh case we actually have one and with that we can also finish the simulation phase so just finally now uh we have arrived at the back propagation phase and here we like to get the this information from our simulation and then we want to give this information all the way up to our parents uh until we have reached the root note so here because we have won the game actually and during back propagation we can um first of all higher the number of wins for this note right here so we can say that we have one one time actually then we can also hire the total number of visits so here we have have visited this node one time and we have won in this case and because we want to back propagate all the way up we can now also say that here for this note um the number of wins now is two so the number of wins we got when we walk in this direction right and then also the total number of visits should now be three because we have yeah visited it when we selected inside yeah in this direction and now we can also back propagate all the way to our roote and here we can just say that the general visit count is seven and the total number of wins would be five right okay so now this would be a whole iteration of this MCTS process and um yeah then if we would again um choose an action we would just uh check the children of our rout again and again could just choose the action based on the winning um yeah amount of winds right here so this is great uh let's also just have an example where we start from all the beginning so that you yeah might find it easier to understand the search algorithm as well so again we want to perform these steps right here so at the beginning we are just in our selection phase so we want to walk down until we have reached the leaf node but here you can see that this root node already is a leaf node right so basically we can now move on to expansion and create a new node right here so let's say that we just create this Noe right here by taking action one and then we will arrive at State one here as well and then we can say that at this um note that we have just created the total number of wins right now should just be zero and also the visit count should just be zero um so we have W1 and N1 just created right here and yeah both of them can just equal zero like this so now we have also finished the expansion part for the first iteration and next comes the simulation part right so we have walked this way basically right here and now we want to simulate into the future just by taking these random actions so let's write this right here and now we have reached a terminate node and now we have to check whether we have won or lost or a draw is secured again so in this case we can just say that we have one again so because of that we actually can back propagate now and in back propagation we can say that the number of wins here is one with the visit count being one as well and then we can do the same for our root node right here so we can just say that the number of wins for our root note is also one and the visit count of our root note is also one so now that we have finished with this iteration I'm just going to clean everything up here and we can move on to the next iteration right so again we start at selection and we walk down until we have reached a leaf note but again we can see that our root note actually still is a leaf note so let's just expand uh this direction now and here we will just use action two and we will arrive at state two and here again we have to set the total number of wins to zero so W2 should be equal to zero so this also S2 and then the visit count so N2 should also equal Z right here so let's write N2 equal Z as well so let's do it like this and yeah now basically we walk in this direction right here and next we can do simulation again so let's simulate all the way down until we have reached a termin Noe and in this case uh we can say that we have actually lost the game so when I back propagate all the way up I will only increase the visit count right here so the physic C should now be one but the total number of wins should still stay the same uh being zero because we have lost here when we simulated so because of that when we back propagate to our root mode we should also only increase the visit count while leaving the uh total number of wins uh the way it was earlier so let's also finish up with this iteration right here and now we can move on to the next iteration so first of all we want to do selection again so we want to walk down until we have reached the leaf note and now our root node is fully expanded so we have to calculate the UCB score for each of our children right here and then we have to select the child that has the highest UCB score right so when we calculate the UCB score for both of them we can see that the UCD score here is higher because we have won the game in this case so we just walk down this direction right here and now uh basically we can move on to expansion so let's expand in this direction here so let's create this new node right here and append it to our tree and we took action three actually when we um expanded in this direction and then the State here should also be S3 because of that and now again we can declare this total number of wi W so W3 and the visit count N3 and both of them should just equal zero at the beginning so let's set them to zero and now again we can perform this um basically can perform Sim simulation again so we walk down in this direction right here and yeah so now let's do simulation and when we do that um we just arrive at this terminal note again and again we can now see that in this case we have just lost uh the game after just uh playing randomly and because that when we back propagate we only want to increase the visit count right here while not changing the total number of wins so let's set the visit count to two years well and and then we can set the visit count to three okay so this is this iteration right here great um so now we can just Al clean all of this up right here and then we can move on so next uh when we want to keep doing this process right here we again have to do selection right so now we have to again calculate the UCB formula for both of these values right here and here we can see that the UCB formula actually gives us a higher result for this note just because of the visit count that is lower right here so when we now walk in this direction we again have to check if we have reached the leaf not or not so here we have just reached a leaf note so because of that we move to expansion and create this new node right here and this should just be equal to S4 then so we can say that we took Action Four right here and then again the W and the this account um should be set to zero sorry color um so W4 and and in for here being our number of wins and our visit count and let's set them to zero l so now we also walk in this direction right here and again after um expansion we can now move to simulation so when move here um the game is terminal and in this case we can just say that draw is ured and you know what in case of a draw we just would like to add 0.5 to the total number of wins because uh no player has won here or no one has won so um because of that when we back propagate here first of all we can increase the visit count and then let's set the number of wins to 0.1 and then up here we can set the visit count to two and again the total number of wins should be 0.5 and when we back propagate all the way up to our root node then we have the total number of visits as being four and our um total number of winds should be 1.5 so this is just a an example of how the multicol research might actually look when you just start at the beginning and actually you can repeat uh this cycle for just a set number of iterations and you uh set this number manually at the beginning and alternatively you could also set a time and during this time you would just perform as many iterations as you can but in this case for example you could just stop it for iteration ations but in practice you might run for thousands of iterations okay so now we can also look at the way how our monal research changes when we adapted to this General Alpha Zer algorithm so there are two key changes that have been made here so the first thing is that we also want to incorporate the policy that was yeah gained from our model into the search process right and especially we want to add it to the s ction face right here and we can do this by incorporating it into this updated UCB formula so this way you see this P of I part right here so basically when we select a child and we want to yeah take the yeah CH with the highest UCD formula then we will also look at the policy that was assigned to it from its parent perspective so remember that the policy is just this distribution of likelihoods and basically for each child when we expand it we will also store this policy likelihood at the given position um here for the notes as well and because of that we then also tend to select children more often that were assigned a high policy by its parent right so this way our model can guide us through the selection phase inside of our multical research so yeah this is the first key change here uh so just generally this updated UCB formula right here and then for the next change we also want to use the information of the value that we got from our Nur Network and we can use this information by first of all completely getting rid of this simulation phase yeah right here so we don't want to do these random roll outs into the future anymore until we have reached a terminal state but rather we will just use the value that we got from our Nur Network when it basically evaluated a certain State and this value will then be used for back propagation right so this way we use use both the policy for our selection and the value for our back propagation of our Nur Network and because of that uh we know that our Monti car research will improve drastically when we also have a model that understands how to play the game so this way at the end uh we then yeah have a better search with a better model that can even create a much better model right so we can this uh yeah keep the cycle up this way okay so there's also just a small change uh here as well uh just a minor one and remember that when we get policy from our um Nur Network we immediately get this distribution right with the probabilities for each potential child and because of that it's much more convenient now to expand in all POS possible directions during the expansion phase so we won't only create one new note but rather all of the possible notes uh that we can yeah expand on right so this just a minor change and so in order to make this easier to understand we can just do some iterations here on the Whiteboard together so yeah this way it might be easier to fully understand how this multicult research is adapted so first you might also see here that we also store this policy information here instead of our note and yeah this would just be the probability that was assigned to it from its parent right and for the root node we just yeah have no policy probability because we also have no parent but for other notes we will also store this policy here instead of our notes okay so let's start with iteration here so first of all we want to walk down until we have reached the leave mode and here we can already see that our root node is in fact the Lea note because we can stay expanded uh in these two directions here so because of that we will then yeah move to the next step right here which is the expansion phase and here we will actually create our new nodes so now when we expand we also want to get a policy and the value from our Nur Network so because of that I will first of all write p and then value equals just calling our Nur Network it's the state of our root no and let's say this policy right here will just be 0.6 for the first CH and then maybe 0.4 for the second CH and let's say that this value so the estim of how good this state of our roote is is just 0.4 here okay so now when we expand we will create this these two new nodes right here so these two and here we have action one right and here we have action two so because of that we also have state one here and we have state two right here so now we can also add the number of wins the visit count and the policy uh information to our notes right here so first of all I will create this W1 and set that equal to zero then I will create this n one and set that also equal to zero and now we have this policy probability one and this should be equal to the policy for action one and here the policy for action one is just 0.6 so yeah the policy see here should be 0.6 because of that so now we can also add the information here to the other child so let's first the number of Wis or the value later on here zero way for W2 let's set N2 to Z as well and the policy probability for the this child right here should now be equal to 0.4 right um like this okay so now we have expanded here and we skip this step here so now we move to the back propagation step and here we just want to back propagate this value right here so we will just do this by um yeah just adding it right here so basically in this case it doesn't matter much for our root note but maybe it's some nice information so we can just add 0.4 here and then because of that also change the visit count to one minut okay so this was one iteration and now we can move on to the next iteration so I will just clean this up okay so now again we want to walk down until we reach a leaf note so now we have to decide between those two children right and for each we have to calculate the UC B formula and then also uh we will just walk down in the direction where the U UCB formula is the highest right so when we calculate this UCB formula we will see that no longer um can we basically get the winning yeah um winning probability here because we have a visit card that is equal to zero So currently the way this is implemented we get we would get a yeah division by zero error so here we have to make sure that we will only check for the winning probability if we have a visit count that is larger than zero so because of that I will just set that in Brackets right here so we will only check for this if we have a visit count that is larger than zero and in all other cases we will just mask it out so we will just check for this part right here okay so now we can find out the child where this value is the highest and we can see that we will just pick this ch right here because it has the higher policy probability and because of that it's UCB formula will also be higher so because of that we would just walk in this direction right here so now we have reached a be note here and we would like to expand this note here so let's expand in these two directions so first of all this should be action three and this should be Action Four right here so when we create these new States we have state three and we have state four like this okay so now again U we also have to calculate a new policy and a new value so we will get this policy and this value by calling our Nur Network f with this State here as so S1 because yeah this is the position we have walked down to and from here we want to find out what the policy would be for these children and also what the value would be for this notes directly so let's say that the policy here is just 05 and 0.5 for both and let's say that this value right here is just 0.1 so now um first of all we can also add this information here to our notes again so let's first of all set W4 equal to Z then we can say N4 equal to Z and then policy probability for should be equal to 0.5 and then here we also have your W3 which should just be zero and we have N3 that should also be zero and then we have also our policy probability three here and this should be equal to 0.5 you can read that here okay so yeah now we have expanded these two children right here and yeah created them and Stor the policy information here inide them so now we want to back propagate again so here we will just use this value of 0.1 so actually we will add 0.1 here and we will raas this visit count by one so let's change that one right here and remember that we want to back propagate all the way up to our root note so here we will change the value so w0 here to 0.5 so also add 0.1 to it and then we will just erase this number visit by two is by one as well so it will be two afterwards so yeah this is just this updated tree after just two iterations but yeah again if we would repeat this we would just first of all walk down until we have reach reached the leaf note then we would expand and use the policy um here when we store the policy information uh in the new children and then we would just use the value for back propagating up all the way in our tree to our root mode right so yeah these are just the small changes that have been added to our multi research great so now that you know what a multicol research is and how we adapt our multicol research in during our Alpha zero algorithm we can actually move to the self play part right here so basically during self play we just yeah play uh games with ourselves from the beginning to end and so we start by just getting this initial state right here with this uh which is just a completely blank State and yeah we also are assigned to a player and then based on this state player um yeah position uh we perform a monticola research and yeah this multi research will then give us this distribution as a return uh which is equal to the visit count distribution of the children of our root node right and once we have this distribution we then would like to sample an action out of this distribution so in this case right here uh we for example sampled this action uh which is highlighted here because yeah it turned out to be quite promising during during our multicol research so once we have sampit this action we want to act based on it so we play here basically as player X and then uh yeah we move to the next um State and we also switch the player around so that we are now player Circle or player all and we have this state basically as input at the beginning and then again uh we want to perform this multicolor Tre search right here and we gain this MCTS distribution then we just uh yeah sample uh one action out of it uh we play based on this action and um then we can basically change the perspective again so that we are now play player X and we basically yeah continuously perform a Monti car tree search sample an action out of it play uh this action right here and then change the perspective so that we are now the other player or the opponent and we do that until we actually have reached a terminated state so once a player has won or a draw has occurred and yeah so when this is the case we want to store all of this information we gain well playing um to the training data so let's first of a look how this information is structured so basically for each uh State uh we also want to store the MCTS distribution and then we also need to find out the reward for the given State the reward is equal to the final outcome for the player uh that we are on this given state so basically in this example right here x won the game so that means that for all states when we were player X we want the final uh reward or final outcome be equal to one and in all cases when we were player o we want the reward to be negative one because we lost the game so basically that means that you know what we want this game is play X right here so we might just guess that this state also is quite promising because yeah this state led us to win the game eventually so this is why we turn change this reward to positive one here when we are play X and this also the reason why we change the reward to negative one when we are player um player o here so these um combinations of the state the MCTS distribution and the reward will then be stored as two pits to to our training data and then we can later use these um for training right in order to improve our model so this is great but now we have to understand uh how training works so yeah let's look at this right here so at the beginning we just take a sample from our training data and yeah uh you should know now that this sample is the state the MCTS distribution and pi and the reward Z right here then we will use the state S as the input um for our model then we will get this policy and this value um out as a return and now for training The Next Step basically is to minimize the difference between the policy p and the MCTS distribution at the given State Pi on one hand and then we also want to minimize the difference of our value V here and the uh final reward or final outcome z uh we sampled from from our trending data and the way we can minimize um the difference basically in a loss is first of all by having a mean squared error uh between uh the reward and the value here and then by also having this multitarget cross entropy loss between our MCTS distribution pi and our policy P right here then we also have some form of a true regularization at the end but yeah so essentially we want to have this loss right here and then we want to minimize the uh the loss by back propagation and this way we actually update the weights of our model Theta right here and we also thus get a model that better understands how to play the game and and that has been optimized and then we can use this optimized model to again play with itself in order to gain more information in order to train again and so on right um so this is how alpha zero structured and now we can actually get to coding so let's actually start by programming Alpha zero so first of all we're going to build everything inside of a Jupiter notebook since the interactivity might be nice for understanding the algorithm and also this might help you if you want to use Google cab for example to use quid's gpus to train alha zero more efficiently and we will start everything just by creating a simple game of tic tac toe and then we will build the monteal Tre search around it and after we have gone so far we will eventually build Alpha zero on top of the multicol research we had previously and then we will expand our portfolio to connect for as well and not only should this be easier to understand but it should also show how flexible f z really is when it comes to solving different environments or board games in this case so for tic TCT toe we want to use numai as a package so I will just import numi S&P right here and if you are wondering my version currently is this right here so let's just create a class of t t to and a simple init and we first of all want to have a row count which equals three the column count as well should also be three and then we also need an action size and the action size is just the amount of rows um multiplied with the amount of columns so s. row count times s. column count uh so that's great and now we can write a method to get our initial state so this is what we will call at the beginning and at the beginning the states just full of zeros so we will just return np. zeros and we AP here is the row count for the number of rows and the column count for the number of columns and next we also want a method that will give us the next state after action has been taken so we write get next state here and as input we want the previous state the action and also the player so the action itself will be just an integer between 0o and8 where zero will be the corner up left and eight will be the corner down right so actually we want to encode this action into a row and a column so that we can use it inside of our number array so the way we do this is that we Define the row is the action divided by the column count but we use an integer division here so we will get no floats and then for the column we can use the mod of the action and the column count as well so if our action would be four for example then uh our row would be for integer division by three which would just return one and our column would be for mod low 3 which is also one so we would have Row one and then column one as well as so the middle of our board which is exactly what we want so then we can set the state at the row at the given column to the player we had um as input and let's also return the state so our code is more readable so that's awesome and next we also want a method that will tell us uh what moves are actually legal so these are just uh um the moves where the field is equal to zero so let's write a method get valid moves and also give a status input and then we can just return um state. reshape 1 so we flatten out the uh State and then we can check if the state is equal to zero and this will give us a Boolean array back so true or false but it's quite helpful if we have integers here so I will turn the type to np. and8 so unsigned integers and yeah that's awesome so now we also want a method that will tell us if a player has won after he um took or he acted in a certain way so let's also write a method for that so just Dev check win and here we want the state and then action so let us first get the row again so we just say action integer division s. column card and let's also get the column here um nice so now we can also get the player uh that played in the given position and the way we do this is just that we check the state at the row at the column so so basically we turn this expression here the other way around um and actually in Tic Tac Toe there are four different ways in which you could win a game so first of all you could have three in a row like this or you could have three in a column like this or three in this diagonal or three in this diagonal right here and we want to check for all of these four conditions and if one of them turns out to be true then we will return true as well and none of them are true uh then we just return for it so let's write out all of these Expressions right here so first of all we can just check if there are three in a row so um we will use np. sum of the state of the given row and all of the columns inside of that row and we want to check if that sum right here is equal to the player times the column count uh yeah and then we can also check whether there are three in a column like this so we just use np. sum of state of all of the rows for a given column and then again we check if that equals player times. column count or row count at this case I mean let's be more flexible here and that's great so now we want to check for this diagonal and the way we can do this is by using the np. DI method so I pulled up the documentation right here and if we would have an array like this which we also want to have for Tic Tac top and then we can use this di method right here to get um the values from this diagonal so let's use this to also check the uh someone has one so we use np. sum of np. di of slate and we'll check if that's equal to player times se. row count or se. column count in this case it doesn't matter because you can only get this full diagonal if row count and column count is the same and next we also want to get this opposite diagonal right here and there is no way to do it with the normal di method but we can just flip our state and then use this old di method again and since our state is flipped it would be as if our DIC itself is flipped if that makes sense so we can just use all np. sum of np. DIC of np. flip of the state let's just say x is equal zero so we flip like this and then we can take the opposite di which is what we want so I also check if that equals player time sa. R so awesome um and next we also want to check if the if there has been a draw so um if the game terminated in any other way um since this might also be a scenario where we want to stop the game so let's just write a new method get value and terminated and we want a state and in action and first of all we will check if uh check win is true in this case so if check win or save to check win of state of action then we just return one for the value and true for terminated and then we can also check if there has been a draw and that we know that there has been a draw if the amount of valid moves is zero so we can just take the sum of the valid moves and check if that's zero so I say if NP do sum of save. get valid moves of the given State equal zero then we just return zero for the value since no one has won the game and we will also turn return true for terminated and in all other cases the game must continue so let's just return zero and FSE awesome so now we got a working game of Tik toac to and additionally would also want to write a method to change the player since that might be different for different board games so let's just write method get opponent and yeah take take a players input and then we just return the negative player so if our initial player would be negative one we would return one and if our initial player would be one then we just return negative one so that's great and now we can test our game we build right here so I just say tick tock toe equals Tick T toe then I just say player equals one and I say State equals Tik Tac toe. get initial State awesome so then let's just bu while Loops um just say while true we'll print the state right here then we want to get our valid moves so we know where to play so I just say valid moves equals take T to do get Val moves of the state we have currently and we can also print that so I just say valid moves and you know what let's print the valid moves in a more readable perform since currently we only have zeros and on but we actually want to get the position where we can play so I just create this array right here and I say e for e in range and then we can just say Tic Tac toe. action size So currently we get all the indices for um the possible actions and then we want to check if the ini itself is valid so we say if valid moves of I equit one right so uh we want to get all these uh Ines and this can just be printed out like this and then we want to get an action so the action will just be an input that will be casted to an integer and let's just um write out the player like this um and get the action right here so then we want to check if the action is valid so we'll say if valid moves of action equates zero then we just say print um action not valid and we will continue so we'll ask again basically and in all other cases we want to move so we create a new get a new state and the new state will be created by calling Tik Tac toe. get next state and we want to give the Old State the action and the players input um great um so then we can check if the game has been terminated so we'll just say value is terminal equals Tic Tac toe. get value and terminate and then we want to give the state and the actions input and then we say if this terminal print the state so we know where we are and then we will check if value equates one and then we say player has one so uh just say player so one or negative one for the player and in all other cases um we can just print that there has been a draw great and then we also want to break our while loop so uh we can end the game and in all other cases if the game continues we also want to flip the player um so we just say player equals Tick Tac to.get opponent of player so nice this should be working so let's test this out so here are the valid moves so this looking nice so let's just pick zero for starters see nice okay we played here so we are player negative one now so let's just play four at play Zero we can just say eight Play 1 one for example play Zero i' say two and play negative 1 we can just say seven and nice we see here we got three negative ones inside of this column right here and thus we get the result that play negative one has one perfect so now since we have got our game of TK already we can actually build the monticolo research around it so let's just create a new s right here and then we want to have a class for our multical research so I'm just going to call that MCTS for now and then we have our inet here and we want to pass on a game so take to in this case and then also some arguments so these are just hyper parameters for our multicol research so inside of our init we can say s. game equals game and s. ARS equs ARs and then below we'll Define our search method and this is what we'll actually call when we want to uh get our MCTS distribution de so let's just write the search and for the search search we want to pass a state as input and inside here first of all we'd like to define a root Noe and then we want to do all of our search iterations so for now we can write for search in range of safe. ARS num searches and then for each search we'd like to do the selection phase then the expansion phase then the simulation phase and then the back propagation phase and then after we have done all of these things for the given number of searches we just want to return the visit count distribution and uh it's the distribution of visit counts for the children of our root not so let's just say return visit counts yeah at the end yeah so that's the structure we have inside of our multicol research and next we can actually define a class for a node as well so let's write class node here and then have our init again and first of all we'd like to pass on the game and the arguments from the MCTS itself and then also we want to have a state as a note and then a parent but not all nodes have parents so for the root node uh we can can just write none here as a placeholder and then we also want to have an action taken but again we have a placeholder of none um for the root n here as well so inside of our inet we can just say save. game equals game safe. arss a safe. state will equal State and then again save. parent equals parent and sa. action taken equates action taken so next we also want to store the children inside of our note so let's write s. children equals just an empty list so these are yeah the children of our note and then also we want to know in which uh ways we could further expand our Noe so for example if the root node would just be an initial State then we could expand in all nine fields for a Tic Tac to board and yeah we want to store this information here so let's just say self do expand moves equals game. get valid moves of State at the beginning then as we keep expanding here we will also remove uh these expandable States from our list here and additionally to all of these attributes we also want to store the visit count and the value sum of our notes so that we can later perform our UCB method and get the dist distribution back at the end so let's say sa. visit count equal zero at the beginning and sa. value sum should also equal zero uh when we start so yeah that's uh just the back or the beginning structure of our node and uh now we can Define our root node here so let's say root equals note and we have save. game save. ARs and then the state we took here as input and yeah parent and action can just uh stay none then inside of our search we just want to say node equates root at the beginning and next we want to do selection so from the video you might remember that we want to keep selecting downwards the tree as long as our nodes are fully expanded themselves so for doing so we should write a method here that would tell us so def is fully expanded and a node is fully expanded if there are no expandable moves so that makes sense and also if the number if there are children right so um if a note would be terminated you obviously can't select past it because the game has ended already so let's say return np. sum of s do expand a moves and that should be zero right and Len of s. children should be larger than zero so this way we can check whether node is fully expanded or not so during our selection phase we can just say while Noe is fully expanded and here we just want to select downwards we say node equals node do select so now we can clear this right here and next we want to write a method that will actually select down so I'm going to write that here and the way selection works is that we will Loop over all of our children as a note and for each child we will calculate the UCB score and then additionally at the end we will just pick the child that has the highest UCB score so for the beginning we can just write best child equals none since we haven't checked any of our children yet and then best UCB should just be negative mp. infinity and then we can write for child in safe. children and then UCB equal save. getet UCB of this child right here and then we can check if UCB is larger than best UCB and if that is the case we say best child should be child and best UCB should also equal youb now right and at the end we can just return best child here um so next we should actually write this method right here we use for calculating the UCB scope and yeah we want to take a child here as input and for the UCB scope we first of all have our Q value and the Q Valu should be the likelihood of winning for a given note and then at the end we have our C which is a constant that will tell us whether we want to focus on exploration or exploitation and then at the end we have a math square root and then we have a log at the top with the visit count of the parent and then below we have the visit count of the child so we want to uh first of all select notes that are promising uh from the win uh likelihood uh perceived by the parent and then also we want to select notes that haven't been visited that often compared to the total number of visits so first of all we should Define the Q value and uh that should generally be the visit uh sum of our child divided by its visit count uh sorry its value sum so the value sum of our child divided by its visit count and the way we implemented our game would actually be possible to have the negative value and generally we would like our Q value to be in between the range of 0 and one because then we could turn it into a probability so um currently it's between negative 1 and positive one so let's just add a one here and then at the end we can also divide everything up by two and now there's just one more thing we could we should think about and that is that actually the Q value right here is what um the child thinks of itself and not how it should be perceived as by the parent because uh you should know that in Tik Tac Toe the child and the parent generally are different players so as a parent we would like to select a child that itself has a very negative or very low value because as a player we would like to put our opponent in a bad situation essentially right so we want to switch this around and we do this by saying one minus uh our old Q value here so if our child has a now normalized value that is close to zero then as a parent this actually should give us a q value that is almost one because this is very good for us to uh walk down the tree in a way that our child um is in a bad uh position so now we can just return our UCB right here so I'm just going to write Q Value Plus safe. ARS of c and then that should be multiplicated with math. square root and we should also import meth up here so that we oh sorry like that um so that we can use it and inside of our square root we have math.log and here we have the visit count uh as a note ourselves so just save. visit count and then we can divide everything up by child. Vis account yeah so that should be it um perfect so now we actually have a working selection method here and now we can move on so now we have moved down all the selected down um the tree as far as we could go and we have reached a leaf note or a note that is a terminal one so before we would like to expand our Leaf node we still have to check whether our node we have finally selected here is a terminal one or not and we can do this by writing value is terminal equals s. game.get value and terminated and inside so we are referencing this method right here um so here we just want to write State and action uh as the input so we can take the state of O note here we have selected and then also The Last Action that was taken so just node doaction take and we still have to remember one thing and that is um that the note always gets initialized this way so the action taken was action that was taken by the parent and not the not so it was an action that was taken by the opponent from the notes perspective so if we would actually have reached a terminal note and if we would get the result that yeah a player has one here then the player who has one is the opponent and not the player of the note itself so if we want to uh read this value here from the uh noes perspective we would have to change it around so here uh ins inside of our game method we should write a new method def get opponent value and here we can just say negative value um so if um from the parents perspective we have got a one right here is a return and we would like to turn it into a negative one from the child's perspective because uh it's uh the opponent that has one and have the child itself uh yeah so let's use it right here so let's say value equals save. game. getet opponent value of the old value and we should also keep one thing in mind and that is that we're using this no. action taken method right here right but actually at the beginning we just have our simple root node and we can see that the root node is not fully expanded right because it has all possible ways we could expand it on so we will call this right here immediately with our root Noe and the way we initiated our root no um we still have action taken is none right and if we call this method right here with action taken being none um so we call this with action taken being none and we'll move to check one with action being none and then we will uh do this expression right here and this will just give an error so we have to say if action equates none just return forwards right because for the root note we always know that uh no player has won since no player has played yet so we can just uh leave it like this and perfect so now we're finished with uh all of this and next we want to check if uh this note is termined and then we could back propagate immediately and if not then we would do expansion and simulation so I'm writing now uh here if is if not is terminal then we want to do this right here and at the end we can just back propagate right okay so yeah now we can do expansion and inside of this if statement I can just write note equit node. expand um so we just like to return the note that was uh created by expanding here and then inside of our node class we can write a new method uh that will do the expansion job so let's say Dev expand and okay nice so the way expanding works is that we uh sample one expandable move out of these expandable moves we have defined up here and then once we have sampled this move or action uh we want to create a new state for our child we want to act inside of this state and then we want to create a new note um with this uh new state we have just created and then we'll append this node to our list of children here so we can later reference it inside of our select method and uh yeah at the end we can just return the child that was created newly so for now let's get an action here and yeah we want to sample it from above so let's use np. random. choice so this would just pick a random value out of a list on nonay array and inside we can say np. where and then uh the condition is that s. expandable moves should be equal to one right because the way we programmed it all uh values that are one are actually legal and we could expand on them and with mpw you always have to use the first argument you get when using the method so yeah this should work now so yeah we just first of all check what moves are legal then we use np. Weare to get the indices um of from all legal moves and then we use np. random. choice to randomly sample one indic and this uh legal indic will then be the re action we got right so that's great and now we also have to um make this move not uh expandable anymore right since we already sampled this so we can say save. expandable moves of action and that should just be zero here nice so now we can create the state for our chart and at the beginning we can just copy our own State and now we would like to act inside of the state of our child so we can say child State equals safe. game.get next state so we reference this right here and then we can use State action and player so let's move up again so for State we have just child State and for Action we have the action right here and for the player we will use one so you might have noticed that we haven't even defined the player yet which is weird right because for Tik Taco you have two players obviously but the way we will program our Monti car research is that we will never change the player or never give the node the information of the player but rather we will always change the state of our child so that it is perceived by the OPP opponent's player while the opponent's player still thinks that he is player one right even though so the parent and the child both think that they are player one but instead of changing the player we flip this state around so we turn all positive numbers into negative ones and vice versa and this way um we can actually create a child state that has a different player but still thinks it's player one so this just makes the logic much more easy for our game and it also makes the code um here valid for one player games even so that's very nice so now we can say child State equal safe. game. change perspective of child State and then the player should be negative one right because in Tic Tac to uh when you change the perspective um or when you create a CH note you always want to be the opponent here so uh let's write a new method def do change perspective of a state and of a player and then T to we can just um return the state multiply with the player so like I said when we would change the perspective to player negative on so to opponent then we would uh turn all positive ones to negative ones instead of our ttic t to game um so nice so now we have come here and we have got and our child state is ready now so next we can just create the child itself which is a new note so let's say child equs note and up here we can see that first of all we need the game so it should just be save. game then save. ARs and then the state which is just child State and for the parent we can choose ourselves as a note and then for the action taken we can use this action right here we samp it above so let's just take that and now we want to append this child to our children so I'm say going to say save. children. append child right here and at the end of our method we can return the child we have just created so awesome that's the code finished here so we have gotten this expansion part done and after we have expanded we want to do simulation so from the video you might remember that we perform these roll outs and the way roll outs work is that we just perform random actions until we have reached a note that is terminal and then we will look at the value so the final outcome of this game so uh who won essentially and uh then we use this information who won um to to back propagate basically right so uh nodes uh where the player of the node itself one after randomly choosing actions are more promising than um the ones where the opponent won right so yeah we use this for our back propagation so here we can just say value equals node. simulate and then uh we obviously need to write simulate method up here so let's do that for now so def simulate of self and first of all again we have to check if this node that has just been expanded is a terminal one so up here we just say value is terminal equal self. game do get value and terminated and inside we can again use s. State and self doaction t and we have to again uh flip this value around uh to the opponent's perspective because um always uh when a Noe has one one it's not because of the node itself but because of its parent that has uh taken this action here so um we have to say value equals sa. game. get opponent value of the value we have gotten here and yeah then we could just say if is terminal then we would like to return this value up here and in all other cases we now want to perform our roll out so in order to do that we can just say roll out State equal s. state. copy and then we'll say roll out player equals one at the beginning and here we actually going to use the player and just that's just because we need to find out if the value we got at the end should be perceived uh by this get opponent Value method or not when we use it to back propagate our node uh value right here so instead of our roll we can say while true and then again we want to sample an action um like we did up here so first of all we can say valid moves equals sa. game.get valid moves of the r out State and then here we can write action equals np. random. choice of np. whereare of valid moves equals one take the first one here and yeah that should be our action and now we can use that action to get the next state so we can write roll out State equals and then we can say save. game.get next state and then again we use the rooll out State and the action here and then for the player we'll use rooll out player and after we have gotten our new state we have to check again whether this state is terminate so let's write value is terminal equal save. game dot get value and terminated and we just use this newly updated Roy out State as input and also the action we took and now if this termin is true we'd like to return our values so let's write that as an if statement and for the value we again have to check whether we were rooll up player one uh when we won the game or hit a terminal note so we say if roll out player equit negative one then we would flip the value up um so that we don't get a positive uh return uh since the opponent won and not we as a note so we can say value equal s. game. get opponent value of value and generally we can just return this value here like we did up here above so nice now uh in all other cases if we haven't reached the terminal node immediately we want to flip our player around so we can say roll out player equal save. game.get opponent of the old rooll out player yeah so that should be it um so yeah now we can do simulation after we have expanded and the final step would now be to do back propagation and we can just write a new method here so let's say node. back propagate and then we pass the value here as input we have gotten either here or here and we can write that here inside of our note class so Dev back propagate and then take value here's input so when we back propagate first of all we want to add the value to our value sum and then we want to add uh our visit count with just a one so count that up and then we want to keep that propagating up all the way to our root note so let's write save. value sum plus equates value and then let's write save. visit count plus equal one then uh when we back propagate up we again have to remember that uh our parent is a different player than us so we should write value equals save. game.get opponent value off value here and then we say if parent is not none which is always true except for the root note where we have this as our placeholder um then we just going to say parent. backprop gate value right here and that should be so yeah we have this recursive back propagate method and that should be all we need for CTS so finally we have gotten all of the results we have back propagated all of our visit counts and values and so on so now we would actually like to return the distribution of visit counts so to do that we should just um write a new variable action props so the probabilities of uh which actions look most promising and at the beginning that should just be np. Z and the shape here can just be the ACT size of our game so save. game.get doaction size so for Tic Tac Toe just this one right here and um now we want to Loop over all of our children again so let's say for child in. children and then we'll write action props at child. action taken and this probability should equal to the visit count of our child so let's say child. visit count and and now um we want to turn these into probabilities and the way we do that is we just divided by its sum so that the sum will be equal to one and let's just write mp. sum of action props right here and then we can return this nice so that's looking promising so let's test this out and um there is still invalid syntax here so I just have uh should these should remove this here and now this working so let's test this out and we can create um MCTS object here inside of our test script we used for the game so let's write MCTS equals MCTS and for the game we'll use Tik Tac to and now we still have to Define our arguments so for C which we use in our UCB formula um we can just uh roughly uh say that we want the square root of two which is what you might use generally and then for the number of searches we can set that to a th nice um so let's also pass on the arcs here inside of our multicol research and then during our game I should remove that um we should say that we only do all of this so acting ourselves if we are player one and in all other cases so in take to when player is negative one we want to do uh a multicol research so we can say MCTS props quits MCTS do search of State but you should remember that we always like to be player one here when we do our multicol research so first of all let's write neutral State equids and then we can say save. game do uh change perspective and also not save. game but Pi yeah um and then we'll use this General State here and for the player we can just set a player which is here always negative 1 so we always change the perspective then instead of MCTS search we should just use this Nutri State we have just created and that's great and now out of these probabilities we want to sample an action and to keep things easy we can just uh sample the action that looks most promising and the way we do that is by using np. ARX and this would just return the child that has been visited most number times and inside here we can use the props um of our MCTS yeah and this just the action we have here so let's test this out and see if this works okay so we still have a problem here or also um for parent we need to use sa parent and here I have a typle and so just want to say best child and here we don't want to use sa the children obviously but the child children of our root note so instead of MCTS we can say for child in root of children so perfect so now we can see that this mon col research moved here inside of the middle and now we as a player let's play a bit stupid here to check this out uh we can just play here my MCTS played here then we can say we play seven and now the MCTS play here and from the roots of Tik 2 you know that we the MCTS has hit three negative ones here instead of this row here so it has one so nice so now we can see that this MCTS working well okay so now that we have our Standalone multical research implemented right here um we can now start with building the nurin network right uh so that we can then later use the alpha zero algorithm in order to train this model that can understand how to play these given games right and before we start uh yeah with building um let me just briefly talk about the architecture that we have for our nural Network so yeah this is just a brief visualization here and first of all we have this state right here that we gave us give us input to our INR Network and yeah in our case this state would just be a board position so for example a board of tic tac tool right here and actually we encode this state so that later we have these three different planes next to each other like in this image right here and actually we have one plane for all of the fields in which player negative one has played uh so where these fields will then be turned to ones and all other fields will just be zeros then we also have a plane for all empty fields and this will then be turned to ones or other fields being zeros and then we also have this last plane for all the fields in which play a positive one played and yeah here again the these fields will be turned to ones and all other fields will be turned to zeros so by having these three planes right here it is easier for our n network to basically recognize patterns and understand how to play this game right so essentially we encode our board game position uh almost so that it looks like an image afterwards right because we have instead of these RGB planes we have these planes for player positive one uh player zero or empty fields and player negative one right right and um because we our state looks like an image we also want to use um Nur Network architecture that works best for images right uh so because of that we have these convolutional blocks then inside of our Nur Network right so first of all we have this backbone right here and this backbone will just be the main part of our Nur Network um where we do most of the computation and here for the backbone like I said we use these convolutional blocks and we also use this res net architecture so the residual Network and this is just yeah a research paper that turned out to work really uh yeah well with um yeah understanding images and here basically what this resonet architecture does is that we have these skip connection like this so when we do the feed forward through our backbone we won't just always update the X by moving forward right but rather we will also store this residual uh so basically we will store the x value or the yeah value that is passed through before we give it to our conf block and then at the end the output will just be the sum of the x value before our conf block and the x value after our conf block or blocks and because of that our model is much more flexible because it could technically also mask out conf blocks by not changing the X at all and just using this residual skip connection X right so here's also just an image from the paper so yeah generally the main idea is to also store this x right here at this position as a residual and then have the skip connection later on so that the output will be the output from the conf blocks summed with the initial X so this skip connection right here so this is just the reset architecture that we use um inside of our backbone right um so after we have gone through our backbone um we split our model architecture into these two parts so first of all we have this policy head here above and at the beginning here we just have a singular con block this conf block will take the output from our backbone as input like this and then after we have gone through that conf block we will then flatten out the results here and then we just have this linear layer or fully connected layer between the outputs for conf block and then also these neurons right here and at the end we just want to have uh yeah nine neurons in the case of tictac toe because we want to have one neuron for each potential action right and then Tic Tac to there are nine possible actions so yeah at the end we will just have these nine neurons right here and and normally we would just get these logits so just these standard outputs but when we want to get this readable distribution telling us where to play we also have to apply this soft Max uh function and this will basically turn the outputs from our nine neurons to the is distribution of probabilities and yeah then each probability will basically uh indicate us how promising a certain action is right okay so yeah this is why we have these nine neurons here and then why we also in practice later call the soft Max method right so the second head is this value head right here and here we also have this singular conf block that is different yeah from this one right here so we have uh this connection here as well and this also takes the output from our backbone as input and then again we would like to flatten the output of this Con block here out and we want to have this linear layer or fully connected layer and finally we just want to have one neuron right here because remember uh this that this value head should yeah tell us how good uh this state right here is and it should just yeah give a float in the range of negative one to positive one is output and because of that we only need one neuron right and the best way to get this um range of negative one to positive 1 is by also applying this 10 H activation function onto this last singular neuron right here so here's just a v visualization of the 10 function and basically this will just squish all of our potential values into this range of 1 to positive one and yeah this exactly what we want for our value head so ultimately we have this model right here that takes this encoded status input and on one hand it will output the policy so just yeah this distribution telling us where to play and on the other hand uh it will give us this value so just this estimation telling us how good this state is right here okay so now we can actually build this instead of our Jupiter notebook and here the first thing is that we want to import py torch because we want to use the Deep learning framework pytorch so first of all I will also print the np. version so we always get that right here and then we will import torch like this let's also print torch do verion uh right here okay so let's just that okay so this is my current torch version and I think uh for yeah the loss we will use later on we will have this multitarget cross entropy loss and this uh has been added as a feature just recently to py toch so uh I would just recommend to get this version right here but or any newer version um so if you have any problems you can just use this one and I also have uh Cuda support right here because of my Nvidia GPU but you could also just get the CPU Standalone py toch version uh depending on your setup right okay so now additionally to torch we also want to import torch.nn snn right here and then we want to import torch. nn. functional uh like this s this big F right capital F okay so now we can import these things right here and then I we just create this model and I will just yeah create this new cell above our MCTS implementation so I will just call this class rest net then we have nn. module and inherit from that so here we have our init and first of all we have a game and then we have a num of rest blocks so this will just be the length of our backbone right inside of our architecture and then we also have num hidden here so this will be the hidden size for our con blocks right okay so inside first of all we have to call Super do init and now we can Define the start block this should be equal to nn. sequential and first of all we just have this conf block at the beginning so NN doc 2D and the input your hidden size is just three because remember that we have these three planes at the beginning and then the output uh dim should be numb hidden and then for the kernel size we can just set that to three and also we can set padding to one right because uh when we have K size as three and padding S one this Con block won't actually change the shape of our game uh State directly so it always be will be number of rows times number of columns uh for these images essentially right for these pixel values if you want to look at this look at it like an image so after our con 2D we also want to have this batch Norm so nn. batch Norm 2D and then here we have just num hidden as our hidden size yeah and this bch number we just increase the sped speed for us to train um essentially so now we can also have this n NN value and yeah this radio will just turn all negative yeah values to zeros so clip them off basically and this way it's also faster to train and safer okay so this just our start block right here so first of all just this conf to D then this bch num to D and then our Rue and now we can also have this s. backbone like this this should be equal to nn. module list and then here we just want to have this array and basically want to have this array of different uh rest blocks right uh inside of our backbone uh remember from this image right here and in order to do that we also have to create this rest block class so let's write class rest block like this of NN do module and then first of all we have our init again and then we just have num hidden as input here and again we have to init our super class as well um like this okay so now for each uh res block we also want to have uh our first conf block so save do con one this should just be equal to nn. con 2D of num hidden for the input uh hidden dim then num hidden for the output hidden dim and then Kern size should be three and then ping should be one here so this our first con block then we also have our first bch Norm block so bn1 this should be equal to nn batch Norm 2D of num hidden is the hidden them then we can have our second conf block so safe.on 2 should be set to nn. 2D uh like this and then again we have num hidden here and num hidden here as well and then a c size of three and a padding of one and now we can also Define the second batch S.B and2 set that equal to nn. batch Norm Tod of num hidden as well so then let's also Define this forward method so let's write Dev forward then first of all we have save and then here the X so the input and then remember that we uh instead of our rest block want to have this residual so that we can later have this skip connection so let's create this residual and set that equal to X and then now we want to update uh this x right here separately and we can do this by first of all calling sa. conon one of X then calling sa. bn1 on top of that and then also we want to have a rue here so just F dot sorry capital f f. value like this and so this is our X at this position and then we also want to feed through our second conf block here so let's update X by calling sbn2 of S.C 2 of X right here so this way we have yeah went through uh these two c blocks right here here and then now we can just sum up um this output with the residual right so X plus equals residual right here and then um now we can just uh again call the value here so X should be um f value of x and now we can just return that X right here so this this way we have this nice residual connection here and at the end we just return the sum right okay so now we actually can create this backbone right here so in order to do that we want to create a rest block of num hidden just for I in range of num rest blocks like this so this way we also have our backbone right so this great and next we can create this policy head so save. policy head like this this should again be equal to nn. sequential and inside of the sequential first of all we want to have just this conf block so nn. con 2D and here we have a num hidden as our input hidden size then we have just 32 as our output hidden size and then we can also Define a kernel size and set that to three and have a padding of one and then again we want to call it batch Norm here and just this R here okay so after we have uh you went through this conf block remember that we want to flatten our results so let's just call NN do flatten here and now we want to have this linear layer right so nn. linear and for the input we want to have these 32 planes multiplied with the number of rows and multiplies with the number of columns so time game. row count times game do column count right because remember that we have the curent size of three and the padding of one for all of our con blocks because of that we won't change the actual shape of our state right so um because of that we can just when we want to get the input size of our flattened output here and we can just multiply the number of hidden uh planes so 32 with the row count with the column count here as well then for the output of our linear layer we just want to have game doaction size right okay um yeah so this great and now we can also Define a value head so save to Value head like this and this should be set to nn. sequen and here first of all we also have nn. con Tod and for the input hidden dim we have just n hidden and for the output di we have just three and for the C size we can set that to three and ping to one again and now we can also have a bchn here so bch number Tod and have three as the hidden size and then again call just a radio on top of it and flatten our results here okay so now we also want to have this linear layer at the end and again for the input we have just three times game. row count times game. column count and then for the output we just want to have one right because we just want to have one neuron here and then also um we want to have this 10h activation function right so we can just add NN do 10h right here so now um we have here this part here finished and next we can write this forward method for our rest net class so it's right there forward um and take the input X here and first of all we want to um send this x through our start block so X should be save. start block of X right and then we want to Loop over all of our rest blocks right here instead of our backbone so let's write four rest Block in save do um backbone like this um and then here we can just set x to the rest block of X right um so this great and now we have these X values right here and next we yeah can just get this policy and this value here back so first of all we can write policy and here we can just set that equal to sa. policy head of X then this value should be save do value head of X and now we can just return our policy and our value right here okay so let's just run that for now and now we also want to test that this is working so I would just create this new s right here and here let's just start by creating a game of Tik toac to so this should be equal to this Tik tooe class right here and let's get this state by calling Tik to toe. getet initial State like this and then let's just update that state by first of all setting state to Tick T toe. get next state of State let's just say that I don't know we play at the position two for the player negative one or POS one in this case and let's also play at um position seven for the player negative one okay um so now we can just print the state for now and yeah we get this state right here okay so next we should um remember that we also have to encode our state when we give it to our model so we also have to write this new method right here so here we just write this get encoded state of save and then just of this state right here and remember that we want to have these three planes right so we can get this by writing encoded State equals np. stack and then here we want to Stack first of all the state for all fields that are equal to negative one with the state of all empty Fields so that the fields that are equal to zero with the state where all for all of the fields they equal posi for like this so yeah this should be our encoded State and here these will just be booleans but rather we want to have floats so let's just set the type here np. float 32 and now we can just return this encoded State okay um so let's test this out by writing encoded State and yes think that equal to uh tick tck toe get encoded state of this state right here and let's just print that here okay so yeah now we see that we have this uh just state right here and this is our encoded state right so first of all we have this plane for all of the fields in which play negative one is plate so just this field right here which is encoded as a one here and then we we have these empty Fields right here and then we also have these fields where player positive one play it so just this field right here right okay um so this great and now we want to get this policy and this value right so first of all we have to um turn this state to a tensor now that's so because of that we can write tensor State and set that equal to torch. tensor of encoded State like this and then when we give a tensor to our model as input we also always need this um batch Dimension right um but here we just have one state and not a whole batch of States so because of that we have to UNS squeeze at the zero AIS so what this does is that will that it will essentially just uh create um further brackets around this encoded state right here and uh this this way we can pass it um through our model right okay so now we have this tensor state right here then we can get this policy and this value by calling our model first of all I even need to Define our model so it's right model equals rest net and for the game we have Tic Tac Toe for the number of rest blocks we can just set that to four and for the number Hidden we can set that to 64 okay so now we can write policy and value equals model of this tens Source state right here and then we want to yeah process the policy and the value so let's first of all get this float from our value by calling value do item right here since this will just be a tensor and we rather like to have the float and we get this by calling do item on this tensor here and then we can also change the policy so here we can just write policy equ equ torch. softmax of this policy right here and for the X's we have to uh choose one because we don't want to apply the soft Max on our badge axis but rather on this axis where we have these nine neurons right so this AIS one and now uh we first of all have to uh squeeze this again so that we remove the uh batch axis essentially and then we can detach that so so we don't have a grad here when we turn this to num now so then let's also take this to CPU and let's call this to num here like this okay so now we have this policy and let's just print out both of these things so print out the value and print out the policy and see whether this works okay so nice uh now first of all we get this value of 0.3 for this state right here and then this is our policy distribution telling us where to play so just this uh number of probabilities now you know what let's also make this a bit more visual so let's import met plot lip like this do p plot um right here yeah s p and then here we can just write p. bar first of of the range um from our Tic Tac Toe do X size and then these policy values right here let's just write p. show okay so yeah nice uh so this is basically how our policy looks like so for each action we have this bar right here um yeah telling us how promising this action is and then uh obviously our model was just initiated randomly currently so we can't expect too much here so nothing actually but um when we later have this train model we can expect this nice uh distribution of bars telling us where to play right okay so now we also have this mod model build and next we can start with incorporating our Nur Network into this multicol res search right so before we do so uh I'd still like to apply a small change here so um let's just set a seat for torch so that we have producible results so that you can make sure that you at least build uh something similar to what I've built right here so let's write torch. manual seat and let's just set the seat to zero like this so perfect and uh then we can update this uh multicolor Tre switch right here so the first thing is that we now want to give a model here input so this will later on just be our rest net so let's also declare safe. model as model and then there are two key changes uh we want to do so basically after we have reached the leaf note we want to predict a value and a policy um based on the state of our Leaf note and then we want to use this policy uh when we expand so that we can in incorporate the probability um inside of the policy into the UCB formula if that makes sense so that when we choose a note during our selection phase we are more likely to choose notes with higher uh policy values because these were the ones that seemed more promising to our model right so we want to choose these more often and walk down the tree inside of these directions where our model guides us so this is the one uh thing we use our model for and the next thing is um that we want to use the value we get from our model to back propagate once we have reached the leaf note so actually we can completely remove this simulation part so we don't want to do any roll outs with random actions but instead we just use the value we get from our model so let's call this right here so we can write policy and value equals safe. model and here we want to pass in the state of our node right but remember that we want to encode it beforehand so that we have this three planes based on the players um we have in Tic Tac Toe or just for the ENT Fields right so uh you might remember this from the last checkpoint so first of all we can write safe. game.get in coded state and here we just have no dot State and now we want to turn this encoded State into a tensor uh instead of just a nump array so that we can give it as input to our model directly so we can write torch. tenso of save. game.get encoded state of no. state so now we have to unque at the zero um AIS and um so let's write UNS quiz of zero right here and this is because we always need an access for our batch right and currently we aren't batching any states um so we just create this empty bch basically with just this one state right here so this is why we un squeeze here and yeah then we get a policy and value as result and now remember from the model checkpoint that we want to turn um this policy um which is just uh with which are just Logics currently so just nine uh floats we want to turn this into a distribution um of likelihoods so first of all we can write policy equals torch. softmax of policy and then we have to set AIS to one because we want to apply the soft Max not on the badge axis but rather on the AIS of our nine neurons right in the case of tictac toe so after we have for the softmax we would like to squeeze again at the zero AIS so that we remove this batch AIS now so actually want to reever this uh statement right here after we have feed forwarded our state through our model and next we can call CPU in case we might lose G use GPU later and then we can use num right here and also we should use the torch. nod decorator on top of our search method um so because we don't want to use uh this policy and value for training immediately rather we just want to use it for prediction so we don't want to store the gradients um yeah of the these tensors right here so we just use no gr as a DEC rator so this way um yeah it's faster to run our multical research so great so now we have our policy and we have turned it into a number and we have applied the softmax activation function but next we also want to mask out all of the illegal States uh all of the legal moves I'm sorry because uh when we expand we don't want to expand um in setad of directions where a player has already played for example right so we have to mask out our policy um on these illegal positions so we get the valid moves first of all by calling safe. game.get valid moves and then for the state we can just use node. state so the state of our Leaf node right here and um then we want to multiply our policy with these valid moves so now all illegal moves will have a policy equal to zero which is great but again we have to rescale our policy so that we have percentages and we do this by by just dividing with its own sum so that the sum of our policy will turn to one um and yeah this will give us this nice distribution we want so let's actually also change the value and we want to just get the float uh from our one neuron of our value head so we get this by calling value equals value do item so in py torch uh if you call do item on the tensor that has only one uh float you will just get the float as result um so this great so now we have our value and we have our policy so then we want to use this value for back propagation and we want to use this policy for expanding right so let's call not. expand with the policy as input and let's remove this simulation part directly so that we uh will use this value right here for back propagation later on in case we have reached a leavea note here and we expand right so this great and now there's some smaller things we can just remove so first of all we can just um completely delete this simulation method here and then there are some more things and first of all when we call the expand method we immediately expand in all possible directions because we just have this policy right here and um it doesn't make sense anymore to only expand and One Direction once you had reach the leaf note but rather we expand in all directions immediately so um this also means that we don't need these expandable moves right here anymore um because yeah um we will check the expandable moves by calling do valed moves um later on so also um we can change this is fully expanded method here and now we can just check whether the Len of our children is larger than zero because remember when we expand once uh we expand in all directions so if uh we have one sh we have uh we know that there aren't any more um children we could have right so yeah this should be enough here so yeah now this is working but next we have to update um this no. expand method right here so let's do that here so first of all we take this policy here's input and then we want to look over our policy right so we can write for Action uh and then prop so just the probability uh at a certain action in enumerate policy and then we can check for each action whether the probability at the given action is larger than zero so let's write if prop is larger than zero and we can remove these two statements here and if this is the case let us just um create this new child right here and let's appart it to the children but uh remember that we want to use this probability here inside of our UCB formula later on during selection so we want to store this probability inside of our node object and we do this by creating this new prior input here so let's just prior to uh set prior to zero per default and let's define PRI sa to Prior here as prior so the prior would just be the probability um that was given here when we initiated child and it is equal to the policy at the given action from the parents perspective when we um expanded our parent so um yeah now we have defined this prior right here but we also want to um yeah use it here when we create a new um note so let's just give your prop as input for the prior uh at the end like this so great so now we have uh all of this ready so um next we have to update the UCB formula since Alpha Zer uses a different um UCB formula than just a standard um multicolor research so I just brought an image up right here so actually we remove the math.log and we just wrap the square root around the visit count of our parent and then also um we just add a one to the visit count of our child and yeah so so that's basically it um so first of all we have to say that if we call this UCB formula on a child that has a visit count of zero then we can't calculate a q value right so let's just set the Q value to zero and that case as well so let's say if child. visit count equal zero then Q value should equal zero and then else um sorry Q value should be equal to this formula right here so because now uh we don't back propagate immediately on a note that has just been created during expanding so it is actually possible now to call this UCB method on um yeah on a child that hasn't been visited before so yeah this way we change uh this part right here and then also we want to update this part here below so let's write uh it like on this image here so let's first of all remove this math.log from here and then um here we want to add a one the visit count of our child and then we also want to multiply everything up with the uh prior of our child right so that we uh also use the policy when we um yeah when we select um downwards the tree yeah so this should be working now so let's see see if we got everything ready so yeah I think that's it so let's just run this right here so first of all we just update C to be equal to two and then we can create a model right here so let's write model equals rest and then for the number for the game we can just s t to for the number of rest blocks let's just say four and for the number of our hidden planes we can just set that to 64 and then you know what also eval our model right here so currently this model uh has just been initiated randomly right so we can't expect too much but um let's try this out so oh sorry uh we're still missing the model here's argument inside ofs so yeah this so nice so now we see that with our updated MCTS um we played right here and you know what let's say um we would like to play on one so yeah the MCS played here uh and let's play uh I don't know just on this a right here and the MCTS played here so negative one and has won the game so now we have our updated multical research ready and next we can actually start to build this main Alpha Zer algorithm so in order to do that uh let us just create a alpha Z class and I will Define that just below our updated MCTS sorry let's remove that here and Define it down below here so let's write class alpha0 and for the init we would like to take a model then also an Optimizer so this will just be a py toch Optimizer like Adam we will use for TR and then also a game and some additional arguments here as input and next we can Define all of these things right here so let's write s. model equals model s. Optimizer equs Optimizer and then WR save. game equs game and save. ARS should equate ARS as well and uh at the end we can now also Define a multical research object inside of our Alpha zero class so let's write safe. MCTS should be equal to MCTS and when we initiated when we initiate our multicol research we want to pass in the game the ARs and the model so let's just write game Arc model just like this so yeah this just the standard init of our F Zer class and then next we want to define the method inside and from the overview you might remember remember that we have these two components which is the safe play part and then the training part and outside we just have this Alpha zero repper right here so let's create our safe play Method this and let's just pass that right now and let's also create our Training Method like this and let's pass that here as well so next we can have this learn method and this will will be uh the main method we will call when we actually want to start this cycle of continuous learning where we run self play get data use this data for training and then optimize our model and then call self play again with a uh model that is much smarter right um so for this cycle we want to first of all Loop over all of the iterations we have so let's say for iteration in safe. ARS of num iterations and then for each iteration we want to create this memory class so just the training data essentially uh for one cycle so let's write uh a memory list here let uh so let's define memory right here and then we want to Loop over all of our selfplay games so let's write for self play iteration in s. ARS of s or in range sorry uh we need to Define range here um so in range of save. ARS of num safe play iterations like this and then for each iteration we would like to extend uh the data we just got um out of our sa play Method and we want to extend this to this memory list right here so let's write memory plus equals save. save play um just like this so that we extend the new memory and actually we'd like to change the mode of our model so that we so that the model is now in EV mode so that we don't have uh these batch Noms for example during safe play and now we want to move on to the training parts so first of we can write save. model. train like this and then we can save for EPO in range of safe. ARS num EPO and yeah for each EPO we like to call this train method right here and we like to give the memory is input so let's write save. train of memory like this so let's also add memory here as an argument so that's great and at the end of an iteration we like to store uh the weights of our model right so let's write torch. save of save. model. stact like this and for the path uh we can just write model and then uh the iteration with instead of an F string and write PT for the file ending so let's also save the STA of our Optimizer so let's write torch. save of save. Optimizer do state dict then for the path again we can write Optimizer and then iteration and take PT s f ending here and we have to turn o string into an F string yeah so this great and um now that we have gotten our main Loop ready um we would like to focus on this safe play Method right here so inside uh we again have to define a new memory so this will just be uh yeah the memory inside of one own uh selfplay game and then we have to define a play at the beginning so let's say player should equal one at the start and then we also have to create an initial State and we get the state by calling save. game. getet initial State like this and then we want to have this true Loop and inside of this Loop we want to first of all call the Monti car tree search based on the state we have and then we want to get yeah these action props as result and next we like to sample an action from this distribution of action probabilities and yeah so after we have SED an action we want to use this action to play essentially so we will get a new state um yeah by calling safe. game.get next State uh based on the Old State and then once we have this new state we want to check if this state is a terminal one so if the game has ended and if indeed the game is ended already uh then we want to First of uh of all um return all of this data to the memory here and yeah the data should be structured in this tup form where for each instance we have a given State we have the action probabilities of our MCTS and then we have the final outcome right so whether the player who played at the instance also was the player that won the game at the end or the player that lost the game or the player that hit and draw right so this should be incorporated in the final outcome so so we can write while true like this and here first of all we have to remember that when we call MCTS do search we always like to be player one so we first of all need to get this neutral State and uh we can get this by calling safe. game. change perspective so let's write it right here and for the state we just like to take the state right here and for the player we just want to take the player above and yeah so that's great so now that we have our neutral State we can get our action props back from this MCTS search right here and we call save. MCTS do search and we take this neutral stck State as input so now that we have our neutral State and our action props we can store this information instead of our memory so that we can later use this information um to GA gather this training data so first of all we can write s. memory. append then we append this trle right here of the neutral State and the action props and then also the player so yeah now we want to sample an action out of the action props right and we can do this by calling np. random. choice and for the number of different options we can use the action size of our game so in this case of T factor that is equal to 9 so let's say save. game doaction size and then for the probabilities uh we use for sampling we just want to use these action props right here we got from our MCTS so like this so now that we have sampled an action we like to play based on this action right so we can say stage should equal save do game. getet next state and we take give the Old State the action and the player here's input and now that we have gotten our up updated State we want to check if this state is terminal or not so we write value is terminal equal save. game.get value and terminated and we give the updated State and the action we use to get this updated State here's input and then one can say if is terminal is true so if is terminal uh then we want to return uh this data right so first of all we can write return memory should be equal to this new list right here and then we want to Loop uh over all of these instances inside of our memory or sorry let's not write s. memory but rather memory just like this because we want to use this memory right here so we can write four hist neutral State hist action props then his player in memory and for each instance we now want to get the uh final outcome right at the given instance uh because we need to store that inside of our training data so we can say his outcome and this outcome should be equal to this value right here if his player is also equal to the player um we were when we achieved this value right so if we were player positive one um when we uh played here and uh have indeed won the game uh then we will receive a value as positive one as well and we want to set the H outcome for all instances in which player positive one play to positive one and we want to set the his outcome to all instances in which player negative 1 play to negative one right so this outcome should be equal to Value if um his player is equal to player and then else we just want to receive the negative value right um and now we can um append this information to our return memory so let's write return memory. pend and and yeah again we need to append a tup so have brackets inside of these brackets right here and we want to attend the neutral state but remember that we like to encode this state when we give it as input to the model so we can already encode it right here so let's write save. game.get encoded State off is neutral State for the state and then just the action prop from our MCTS distribution and then just the his outcome we got here yeah so just like this and sorry we want to append this to our return mem but not return immediately but after we have looked through all of this uh we now can actually return this return memory just like and all in all other cases if we haven't reached the terminal note yet then we want to flip this player around so that we are player negative one now uh if we were player positive one previously so let's write player equal sa. game.get opponent of player like this so there's still some minor things I would like to update right here so first of all let's make this code more General by changing negative value right here to the value is perceived by the opponent depending on the game we have uh inside of our Alpha zero implementation so that we could also um yeah run this uh in games where we have only one player right so let's write sa. game. get opponent value of value uh like this so that's just more General if we um call this method right here so get opponent value and um the next thing would be that we want to visualize these Loops right here and the way we can do this is by having these progress bars and we can get them by uh using the tqdm package so let's write import TQ dm. notebook and you know what from tqdm notebook we want to import a t range like this so then we can just replace these range CS uh below with a t range so just a small difference right here and now we want to check that this I zero implementation uh is working as we have build it so far so um we want to create an instance of alpha zero and in order to do that we need a model an Optimizer a game and some arguments so we get the model first of all by building reset but initially let's just create an instance of tic tac toe let's write Tic Tac Toe equals Tic Tac Toe like this and then we can write model equals rest net and for the game we want to use this tic tac toe instance we have just created and for the number of rest blocks we can just say four and for the hidden dim we can just say 64 and now we want to Define an Optimizer right so let's use the atom Optimizer as uh and build in py torch so let's write Optimizer equals torch. optim do addom and then for the parameters we want to use the parameters of our models just model. parameters like this and for the learning rate we can just use 0.01 just a standard learning rate for now and um next we also need to Define these arguments right here and we can do this by creating just another dictionary and for the exploration conent we can just choose uh two again for the number of searches we can choose 60 and then for the number of iterations on the highest level we can just choose three for now and then for the number of safe play iterations um so the number of safe play games we play for each iteration um play iterations uh we can set that to 500 for now and the number of epo can just be equal to four currently you know what let's also lower this amount to 10 for now so that we can see that we can save our models and let's just run that right here oh sorry and now we want to create this instance of alpha Z and this should be equal to this Alpha Zer class and the input should now be this model we have created the optimizer we have created and the game and ARS we have created so let's write model Optimizer and for the game we can write Tic Tac to and for the arguments we can use the here then we can run Alpha 0. learn like this let see if this works so nice so now we get this a good looking bar we can see that we are running these safe play games great so now we can actually implement the Training Method inside of our Alpha Zer implementation so let's just change this method right here and the first thing we always want to do when we call this method is that we want to shuffle our training data so that we don't get the same badges all the time so we can do this by calling random. Shuffle here of the memory and if we want to use use this we have to import random here off just like this and and nice so now we can actually continue um with this implementation right here so next we want to Loop over all of um our memory and we want to Loop over it in batches right so that we basically have a batch index and uh for each batch index we can then sample a whole Badge of different samples and then use these for training so the way we do this is by writing for batch index then range and here we want to start at zero we want to end at the L of our memory and then we want to step in the size of save. ar. BGE size and now we want to take a sample from our memory right and we can get this sample by uh calling the memory at the badge index at the beginning and we want to call it until batch index plus sa. AR do batch size but just remember that we don't want to um call it uh on any batch index that might be higher than our Len of the memory so we have to be careful here and we can check um that we don't exceed the L of our memory by calling Min of Len of memory minus one and then um the batch index plus. a and of batch size right here so this way we won't ever exceed uh this limit uh currently yeah so now we want to get the states the MCTS props and the final rewards um from our sample and we can do this by writing State and policy targets for the mcds distributions and value targets for the final reward uh we get from safe play and uh we get this from our sample by calling the zip method so the way um this works is that we basically transpose our temp So currently we have just this list of TS and each T contains a state a policy targets and the value but by applying this asterisk here and then zipping it we actually get lists of State lists of policy targets and list of value targets so we transpose this um sample around basically so now the state is a list full of NP arrays the policy targets is a list full of MP arrays and value targets is a list full of MP arrays but it's much more helpful if we actually have all of these things as NP arrays immediately so let's change them by saying value and policy targets and value targets and these should be equal to np. array of state and np. array of policy targets and np. array of value targets like this but remember for the value targets that currently we only have this n r inside we immediately have all of our float uh values but it's much more helpful if each of these value actually is in its own subarray later on if we compare to the output of our model so we can change this by calling reshape here then we say negative one for the batch access and then we just save one so that yeah each value will be in its own subarray essentially so this is great and now we can turn all of these three things into tensor so first of all we can write State equ equ torch. tensor of state and for the D type we can just say torch. float 32 even though um yeah right this encoded State might the um just um integers right now not in float so that's nice to change the DP here and for the policy targets we want to um also turn this into a tensor and change the D type to FL 32 um let's torch. like this of policy targets here yeah and then the D type should be equal to tor float 32 like this and then we have our value targets and here we also want to turn this into a tensor and have value targets as the old input and the D type again should be equal to torch. float 32 so great um so now we have all of our ready and the next step is to get the out policy and the out value uh from our model by letting it predict the state um we have here so let's write out policy out value equals safe. model of the state right here and then now we want to have a loss for the policy and a loss for the value and we can first of all Define the policy loss and you might remember that this is a multitarget cross entropy loss so here we can write F do cross entropy like this and then first of all we have our out policy and then we have our policy targets and yeah so next we have our value loss and here we use the mean squ error so we can run value loss equals f m loss like this and then first of all we have our out value and then we have our value targets and now we want to take the sum of both of these losses to finally get just one loss so we can write loss equals policy loss plus value loss right here so now we want to minimize this loss by back propagating so first of all we can write Optimizer Z gr then we want to call loss. backward and then we want to call Optimizer do step so this way py to does all the back propagation force and actually uh optimizes uh this model right here so this way we have now have a better model and then we can use this uh during safe play so let's test this out so here I will just uh call this again you know what let's actually set n s play iterations to 500 now and I will now test this for the uh number of iterations we have here and then afterwards uh we can finally see how this model has learned and and yeah feel free to train as well but you don't have to and we will make it more efficient later on so we can also just test it here on my machine okay sorry so I forgot to define a badge size right here so let us just pick 64 as our default badge size and then we can run this sell right here and we get these nice progress bars during training and I have just trained uh yeah this s right here and it took roughly 50 minutes just using the CPU of my machine so now we actually have trained for a total number of three iterations and remember that we save our model at each iteration uh because of this expression right here so now let us actually check what the neural network we have trained under stance of the game and we can do this by moving up to this cell right here so this just where we tested our randomly initiated model so here when we Define our model let us actually load V State dict uh so let's write model. load State dict inside we can write torch. load and for the path we can write model and then two since we want to check the last iteration and then PT for the file ending and you know what let's also put our model in mode so let's just run this right here and then for this uh state right here we get this distribution uh of where to play so our model tells us that either we should play in the middle right here as player positive one or we should play on this field right here in the corner is player positive one and you know what uh let's also test it for this uh state right here so um two and four is player negative form and then as player positive one we have played on the fields of eight of six so first of all now you can see that we have copied this sport state from the image right here and yeah then we also encode it and then we get this distribution of where to play so now the Nur Network correctly tells us that we should play on position seven because we would win here and yeah for the value we get a yeah result that is close to positive one because uh we would just win if we play here so uh we have a high estimation of this position right here is a player positive one so yeah and this is even better than the distribution we have right here so uh remember this is just what this Nur Network as a mathematical function got put out and we never did a Monte carot research or anything when we got this result here so we can perfectly see that this mathematical function so this Nur Network actually internalized the montic col research by playing with itself and now we can just call this Nur network with a given State and we get this neat distribution of where to play and also uh this nice float here telling us whether the given state is a good one or not for us as a player great so now we can apply some tweaks to our zero algorithm and the first twak we want to yeah add right here is that we want to also be able to use GPU as a device since that might be way faster for training depending on your setup so we can add gpus support by declaring a device right here and for the device we can say that it should equal to torch. device and the device should be Cuda so yeah Nvidia GPU if torch. Cuda do is available and then else uh we can just set it to CPU like it was initially and yeah we want to then use this device and give it to our model as an argument so this also means that we have to add it right here to our rest net class so let's set device right here and then we will first of all add it as a variable here so s. device should equal device and then we also want to set our model to the device uh it was given so let's write s.2 device right here so that we turn our mod to GPU for example if we have Cuda support on our machine yeah so that's the first part and now we also want to use this device first of all during self play and then second of all during training right so let's move to the safe play part right here and we use our model during self play when we get these action props right here uh during our MCTS search because yeah we call our model right here so yeah when we get this policy and this value right here and yeah we also want to turn this tensor into a uh tensor that is on our device we have declared so we can do this by writing device should equal self. model. device like this and now we also want to use GPU for uh or we want to add GPU support during training so let's move to the training part right here and we can uh first of all yeah move the state to our device since we use it here as input to our model let's write device equals s. model. device and then we also use these policy targets and value targets and we compare them to the outputs of our model so we also have to yeah move them to to the device of our model so let's write self. device let's write device self. model. device again right here and also right here down below okay so yeah now we should have uh GPU support as well so that was fast and now there are several more tweaks so the uh yeah the next tweak we can apply right here is that we also want to add weight decay since um we have this L2 regularization in our loss um for Alpha zero um like you can see on this image right here so we want to write weight Decay and we can just set that to 0.001 like this so this should be fine and yeah so now we have weight Decay and GPU support added to Alpha Zer and next uh we also want to add a so-called temperature so remember when we sample an action from these action props right right here currently we just use the visit count distribution of the children of our root node as the probabilities uh and then we take these Pro probabilities for sampling right but actually we want to be more flexible because sometimes we would rather like to explore more um so also uh you know sometimes take uh visit count uh take children or actions more often that haven't been visited that often and um sometimes we also want to exploit more so so we want to tend to always just pick the actions that look the most promising to us so the way we have this flexibility is that we uh yeah get some temperature action props so we basic basically tweak this uh distribution right here and then we will use the tweak distribution as the probabilities when we sample in action so let's write temperature action props like this and we get these by um first of all using the old action props then we want to power them by 9 ided by s. ar. temperature um so like this and now we also have to Define temperature in our AR right here so let's write temperature and set that to 1.25 in this case so now we have declare this temperature right here and actually what this does uh in the case for a temperature that is larger than one uh you can see that uh yeah the number we use uh for the power actually gets much smaller and what this then does as a result is that we squish all of our uh probabilities uh much closer together so that we actually do more of exploration because uh the higher our temperature actually gets to Infinity the more it would be as if we would just take completely random uh actions sampled from just this random distribution and on the other hand and if our temperature would um move down we can see that this number would actually move up so if our temperature would turn to zero for example or close to zero and not zero directly then it would be as if we would um take just the arc Max of our um distribution right here so we want to do exploitation and not exploration right so uh this temperature right here basically gives us the flexibility to to decide whether we want to rather take more random actions that do exploration or whether we just want to yeah exploit and take the actions that look most promising at the current time so yeah this is great and now we also have uh yeah our final tweak we want to add to Alpha zero and that is that we also want to apply Some Noise to the policy that was given for our root node at the beginning of a monteal research so when we um when we give our root not policy we basically want to add some noise so that we also explore more and move in some directions that maybe uh we haven't checked before so that we also make sure that we don't miss on yeah any possible promising actions so the way we add uh this noise to our root note is that we first of all have to um yeah add these statements uh to the top right here as well so that we um also add the policy um at the beginning and basically separate that from the steps right here so first of all we should get a policy and a value by calling safe. model and then here inside we have torch. tensor then we have safe. game.get encoded State sorry and then for the state we can just take the state um of our root but this is equal to this state right here so let's just take this one and for the device again we want to use the device of our model so that we could use GPU um so s. model. device right here and then again we have to UNS squeeze at the end so that we have a access for our batch so this great um so now we get this policy and this value and actually let's just turn this value into a blank variable because we don't uh need it at the top so this should be more readable now and now we want to again tweak our policies so we have a typ right here so let's first of all use soft Max right here so policy should equal to torch. softmax of policy for the AIS we want to use the first axis um so that we don't use the soft Max on the badge AIS then again we have to squeeze at the end uh turn our tensor to a CPU and then just get the number array out of it so now again we want to uh get some valid moves and then multiply our policy with them so that we mask out the illegal moves so valid moves should equal to s. game. getet valid moves of state right here and then policy should be multiplied with these and then again we can now add some noise right here so after that we would just again turn our policy to this distribution of probability so we want to divide it with its own sum so now here we actually want to add this noise to our policy so that we yeah do some more exploration during a multicol research so the way we can at this noise is shown here on this image so first of all this should be the updated um policy for our root node and then we get this updated policy by first of all multiplying our old policy with this coefficient right here and this coefficient is just a number smaller than one so um basically we lower all of our policy values uh instead of the distribution and then we add um the uh basically opposite coefficient uh so also number smaller than one um multiply it with this noise right here and yeah this should just be our updated policy so this Epsilon value here um is just float smaller than one this n um value right here is a noise called d random noise and we get this by just using np. random. dlet right here and yeah basically for us this will just give this uh distribution of random values and we basically will add them to our policy multiplied with this coefficient that is smaller than one so this way we basically change our policy a bit to also incorporate some randomness so that we then can explore more rather than just always walking down the tree the way that our model tells us because at the beginning our model doesn't know that much about the game so we also want to make sure that we do some exploration as well so let's add this right here so policy should first of all be 1 minus s. ARS of d l Epsilon like this and then multiply it with our old policy and then we want to add this second part right here so this should be save. ARS of D Epsilon again and then we want to multiply it with np. random. dlet and yeah inside here we just um want to use this Alpha value um so give that as input to our MP random and this Alpha basically changes uh the way our random distribution looks like and is this based on the number of different actions we have for game so in the case of tictac to our Alpha is quite high so roughly uh equal to 0.3 if you just want to use it uh because that turns out to be working well but yeah it might also change depending on your environment here then we want to use s. ARS Alpha now let's actually move that to the next line so it's more readable and then we want to multiply this array um with the action size of our game so and actually we also have to move this part here below so that we mask out our moves after applying this random noise and note before so uh let's just copy them over uh okay great so now we also added this um yeah noise to our policy and now we can also use this policy we got right here to expand our root note so let's write root. expand and then um yeah just use this policy Right Here and Now there's just a minor change uh we can still add and that is that currently uh when we expand our root note at the beginning um we won't back propagate immediately so that means that our root node will have children but the visit count of our root node will still be equal to zero so that we means that when we select right here we will call this UCB formula with the visit count of our parent being zero and if this visit count of our parent is z zero then um basically this means that all of this will basically move away because we just turn to zero as well so um this also means that we don't use our prior when we select a children for the a child for the first time so in order to make this better we should just set the visit count of our root node to one at the beginning so you know what let's write visit count let's set that one right here and this also means that we have to add visit count right here by default it should just be zero but yeah for the root note it should be one at the beginning so that we um immediately uh use the information of our prior when when we select a child at the beginning uh of our multicol research okay great so now we can uh run all of these setes again and then see if this working so we also have to add DL Epsilon here and yeah this should just be equal to 0.25 and then dlet Alpha and I just set that to 0.3 like I told you okay so now let's test this out okay so we still have a CPU device somewhere and let's see where that could be oh um I haven't run this right here this the arrow sorry so here we have to also add a device and for this set right here you know what let's just say torch. device of CPU and then we can run this again okay so perfect now this running and yeah for this current setup with tic tac to it shouldn't be that much faster but later on when we use more complex games um GPU support will be much nicer and also this these other tweaks should help to make our model more flexible and actually yeah make it easier to always get to a perfect model at the end so nice so I would just train this again uh feel free to also train it but you don't have to I can I've also um uploaded all of the weights and you can find them on the link below in the video description so let's just train that and then we can briefly check it again and then we will also expand our agent so that we can also play the games of uh Connect 4 as well to tic TCT to so see you soon okay great so now I've trained everything right here and now we also want to check that our model has learned how to play the game of tic TCT to so in order to do that right here we have to move up to this cell again and then you what you know what let's also Define a device right here so let's write device equals torch. device of Cuda if torch. Cuda do is available right here and then else we can just use CPU for our device and then first of all we want to use this device right here when we Define our model and then we also want to use this device for the tensor State uh that is the state we use um for the model to predict the policy and the value so let's write device equals device right here as well and then we also have to use the device when we um yeah load the state dict uh from this file right here um and that is when we have a map location so it's right map location equs device as well so now let's just run this again and we see that we get this nice distribution and this nice value again so we can also see that our model has learned and by the way um here we aren't even masking out our um policy so the model has learned itself that you no these moves aren't that uh great right here because so you can't even play here because someone else has played uh without us even yeah masking these illegal moves out so that's nice and now we can move on to also train our model or Alpha zero on the game of Connect 4 great so let's also define a class for the game of Connect 4 but before we do that let's also add this representation method to our Tik toac to game so let's write Dev rapper like this save and then here we can just return the string of Tik toac toe and this string is yeah also uh the result we would get when we uh call this object here uh instead of a string representation so so this means that we can use this information down below when we save our model um for each iteration during uh yeah learning so we can just add save. game right here so basically we will also add the name of our game to this model and this Optimizer weight or Statics when we save them so that we won't override um our mods when we train for the game of Connect 4 um yeah so this great and yeah now that we have done this part let us actually just copy this uh game over right here so that's great and now let's use it to define the game of Connect 4 and for the game of connect for first of all we have a row count that is equal to six we have a column count that is equal to S and then the action size is just equal to the column code and then also let's add a a variable for the number of stones you need in a row to win the game so let's write save do in a row and let's set that equal to four and for the representation we can change the string to connect four here and yeah the initial State can stay the same because we just want to get this yeah blank uh array with row count being the number of rows and column count being the number of columns so that's great and now we have to change this get next State method right here because uh this action will just be a column uh so where telling us where to play and the first step now is that we have to check get a row back um and we get this row by looking at the given column then we want to look at all of the fields inside of this column that are equal to zero so the uh that are empty currently then we want to basically take the deepest empty field inside of this column because yeah this is where we would playing connect for and yeah this uh would just then be the row uh we use for playing so we get this by calling np. Max since yeah in a number array um you start at zero if the top and then move your values get higher uh if you move down basically on the rows so we take np. Max of np. whereare of the state at the given column yeah and the state here should uh equal to zero right so yeah np. whereare will just again give us the indices of all of the Mt fields and then mt. ma np. Max will have give us the highest indc because yeah this is where we would play in Connect 4 because yeah the stone would just fall down until um yeah there this is uh already non empty field so now we have a row and technically we also have a column because of this action right here so we can now write state of row of column and say that equal to this player right here we have given here input and yeah so now this great and next we want to move over to this method right here so we have this uh these get valid moves right here and for the get valid moves we just want to check the um yeah row just at the top and then we want to see whether the fields at this row are equal to zero or not and these are our valid moves so we can just write state of zero so at the Bic upmost row yeah and then this should be equal to one um if it is valid and yeah otherwise we just uh have an illegal move on this given action okay and now we want to check for the win and frankly I'm just going to copy this method over U right here so basically where we do the same checking as in TI to where first of all U we check for the vertical um yeah possibility of a win and then for horizontal one and for these two diagonals based on this action um right here that was given as input it's just that compared to Tik to it's a bit more complex where we actually have to uh walk in both directions and then keep checking whether um we have stones in these directions that are equal to our own Player sign also have this get value terminated method and this can just stay the same for both games get opponent can stay the same get opponent value can stay the same change perspective also stays the same and get encoded State um yes also the same right here oh and I just noticed a small mistake and that is when we use get next right here we can't reference column because yeah we have defined our column's action here from the top so let's just replace column with action like this and now we can check um that this is actually working in practice as well so we can do this by making our code more General and here we just replace TI TCT Toe with game then we use game inside here and we use game inside here as well use game like this use game here and also game right here so just remove this instance of T toe and yeah this one as well and yeah so this should be fine and yeah now the should be no more instances of TK TCT to and next let's replace this game of TCT Toe with the game of Connect 4 let's set the number of searches to just 20 so we can validate let's say 100 here and then let's change the size of our model to be equal to n for the number of rest blocks and 128 for the hidden dim so that we have a larger model when we at the train connect for and now we can check whether this is working or not so first of all we also miss the argument of device so let's add that uh right here as well so let's write torch dot device of Cuda if torch. Cuda do is available like this and then it's we can just use uh CPU and yeah now we want to use this device instead of our mod class so let's use it right here and yeah so that should be fine so let's run this again okay perfect so now we have this game of Connect 4 let's say that we want to play at the position of four or five sorry and here we still have a problem so de Epsilon is also not set right here so let's copy these values over um like this but let's head D Epsilon to zero so that we don't use noise here um okay so I've played here the mod has played here you know what let's play at one the model has played zero again let's play at two see where the mod played played at zero again and then we can just play at three like this and we have one so at least we checked that uh the game of Connect 4 is working now but yeah obviously home model is terrible because we don't do that many searches right here and also we haven't trained anything yet so obviously our model is garbage but let's change that now so let's also train a model for the game game of Connect 4 and then again test our results here and we can do this by just applying some minor changes right here so first of all we should also make this cell right here more General so we write game and then for the game we just will use Connect 4 here and then we also have to set the game to just game here instead of the reset and you know what because Connect 4 is harder to train let's change the number of rest block to 9 and let's change the num of our hidden Dimensions to 128 and then we also have to use game right here during our Alpha zero initiation and uh then we can also slightly tweak these um hyper parameters here so first of all for the number of searches we want to use 600 then for the number of iterations we can use something like eight maybe for now and then for for the badge size we can also increase that to 128 so yeah and next we can just run that and this should take a lot longer um but we will also increase the efficiency of it later on so feel free to just use uh the weights that I've uploaded in the description down below uh so you don't actually have to train this all yourself since it might at least on my machine currently take a couple hours so see you soon okay now I've trained Alpha zero on the game of Connect 4 using these arguments right here and frankly I've just trained it for one iteration since even that one iteration took several hours to train right so now we want to increase the efficiency so yeah that will be way faster for us to train also for these rather complex environments but yeah so still uh it was nice to train for this one interation because now we can evaluate our model and find out whether we at least learned something so let's move to this evaluation set right here and now when we create this model we also want to load the state dict from the weights that we saved yeah after we have trained for this one iteration here right so here when we create our model in this line right here we can write model. load state dig like this then inside we can write torch. load and then we want to reference this path right here and so we can write model and for the iteration we will just use zero since we started zero and we have only TR for one iteration and for the game we will use Connect 4 and for the file ending PT again and let's also Define a map location and set that equal to device so this way um if your device would change during training compared to evaluation you could still load the state dick like this right here so yeah this might be nice so let's run the sale again and you know what let's just play it six here so yeah now our model played in the middle like this so let's just play it six again and yeah our mod defend this position so let's just I don't know play it Z right here and the model build up this Tower right here let's play at zero again model building up the tower let's just play at zero again and our model finished this Tower right here and thus it has won the game so we can still see that our model isn't playing perfectly and obviously I just played in the corners to yeah check for some simple intuition about the game on the model side yeah but we can still see that it is much better compared to the initial model we had where it only wanted to play inside of this uh column right here right so um this great um and now we also want to increase the efficiency of our Alpha Zer implementation right so actually now the main part is completely done and you should have this intuitive understanding of how the algorithm works but like I said it would be nicer to increase yeah the speed uh we got when training and especially the speed during safe play with it so how can we do that and here we want to basically parallel I as much of our Alpha zero implementation as possible because these Nur networks are built in a way where we could batch up our states to then get yeah parallel predictions for the policy and the value uh especially when we run our sfe play Loop so one maybe way in which you could parallelize this Alpha zero implementation here is by using a package like Ray that would basically yeah this threats that are independent from each other and then basically use that to yeah harness all of your GPU power but I'd rather like to build this parallelized version in a more pythonic way like this uh so just inside of uh our Jupiter notebook again so let's actually do that um so we will actually basically um batch up all of these different states then get policy and value predictions and then distribute them again to our safe play games so that we play several safe play games in parallel and this way we can yeah essentially drastically increase the speed of our Alpha zero implementation so the first thing we want to do here is that we also want to add this argument um of nonparallel games here to our dictionary and for now I just set that to 100 so the idea is that we want to play all of these your s play games in parallel and then when we reference our model we will batch up all of our states instead of our MCTS search and this way we can drastically reduce the amount of times that we call our model in the first place and yeah thus on one hand we fully utilize our GPU capacities and on the other and um yeah we can just decrease the amount of times we call our model and thus also um yeah have a higher speed right so the way we can yeah change this or update our Alpha zero implementation is that first of all we want to just copy this class here over so I will just create this new class here and I will call this Alpha zero parallel and then we also want to update the multicolor Tre search so I will also copy this class here over as well and I will call this MCTS parallel like this so let's write it like this so yeah now we have our parallel F Zer and mcds classes and yeah here we want to change the implementation of the safe play Method and because of that also the implementation of this search method right here so now that we have got this this class um we also want to create a new class and here we basically want to store the some information of a given self play game so that we yeah can Outsource some information from our Loop when we Loop over all of these given games so let's create class and I will just call this SPG for S play game and then we can also have an init here and for the init we want to p in a game so take the to for example or connect for and then first of all we want to have a state here so initially this should just be the initial state of our game right so we can set safe. state to game.get initial State like this and then we also want to have a memory inside here and this should just be an empty list at the beginning and yeah then we also want to store a root and this should be none at the beginning and we want to store a note here and this should also be set to none at the beginning and yeah so this is everything we need here for this class right here and yeah so this way we can later store some information here okay so now we want to update this safe play Method right here um so first of all we want to call it less often since every time we call this method um we will call it for 100 um yeah potential G games that are played in parallel right so here um when we Loop over our num Play iterations We want divide that with the games we play parallel so safe. ARS of sorry num parallel games like this and also when we um yeah create this safe play game class right here I've just forgot to add those brackets right here so we should just add them yeah okay so yeah inside here first of all um we can remove this state right here since we store our state in the sa play games right and then also we can rename this memory to the return memory since this right here won't be the actual memory um in which we store our yes State and action props and player uh topits initially but rather this will be the state that we will return at the end so uh the list we currently have here right which we won't need later on okay so then the player can stay just the same because even though we play games in parallel we will always just flip the player after we have looped over all of our games right so this way all of our games in parallel will also always be at the same player right because we always start at player one and then we can just flip for all games parall as well okay so now we also want to store our self play games here so I would just call this spgs like this and maybe let's just or sorry maybe like safe play games like this and here we want to have a list of all of our safe play games so one safe CL game should be created using this SPG class we have defined below here and inside we want to pass the safe. game we have here is our Alpha zero class and we want to create a safe play game for SPG in range of S.S and then num parallel games right so this way we create this list for all of our C play games and yeah we will basically give them the game uh in each case okay so now we have these safe play games right here and next we want to change this while true um yeah expression here because this way uh would stop after one game has finished but rather we want to keep uh yeah running our Sur play Method until um we all of our self play games are finished so we can uh check this by saying while L of SP games is larger than zero so uh we implemented this way that once the safe play game is finished we will just remove it from our SP games list because of that we can just check here whether we have uh any any s play games inside of our list right so uh great um so the next step right here would be to get all of the states of our self play games first of all inside of our W so let's get all of the states here and we can get all of the states first of all as a list by saying SBG do state for SBG in Sp games like this so yeah this way we just have a list of all of our states that are instored inide our class and now we can turn these states into just a number array so the best way to do this is just by calling np. St here of this list of number arrays and this way we will basically get one large number array right okay so here we have all of our states and next we also want to get the neutral States so first of all we can just yeah change the naming right here and then here when we change the perspective we just have to write States here instead of state and now we will change the perspective for all of our states basically in just yeah one function call right so this will we can also increase the efficiency since um yeah changing the perspective works the same way if we have several States because we would just multiply our values with negative 1 in case that we have player uh set to NE one and yeah here we we can just do the same right so this great and now we can also pass these neutral states to our monal research and we have basically finished this part right here so let's now change the multicol Tre search search method um yeah so first of all we want to now give States here as input so these are the neutal states we have just created and now um we can also first of all change the order so let's just um paste this one here um below and then we can work on the policy and the value first um so let's move it like this okay so yeah first of all we get this policy and this value and like I said we can also do this with all of our badged states so this where we just get several policies and values back here as a return and this is great but we still have to keep two things in common so first of all we don't need to call UNS squeeze here because now we have a badge so we don't need to create this fake badge access right and then we also um have to update this get encoded State method right here so because the way it is currently implemented um is only working if we pass it one state at a time because this get encoded State method will yeah create these three planes for the fields in which play negative one played or these uh the fields that are empty or the fields in which player positive one is played right so it will always first of all give us these three planes and then inside of these planes it will give us yeah the fields right but we want to change the order here because the first AIS we basically want to have um yeah all of our different states and then for each state we would like to have these three planes right so we don't want to have these three planes at the beginning but after our batch AIS basically right if that makes sense so uh we have to basically swap the axis of this encoded state right here so that it would also work if we have several States and not just one uh we use when we call this method here so first of all we can check whether we have several States or not and we can check this by checking whether the Len of state do shape is your equal to three right because normally the L of our of just one state would be two right where we have the rows and then the columns and if we also have a batch axis then the Len would be three because first of all we have a batch then we have the rows and then we have the columns and yeah so if this is the case then we want to create this new encoded State like this um and should basically be uh equal to np. swap AIS like this and then here we want to yeah pass in the old encoded State then we want to swap the A's zero and one right so that we don't so that the basically the new shape should be the size of our batch then three and then the number of uh rows and then the number of columns right instead of having three then the size of our badge and then the number of rows and number of columns right if that makes sense so basically we swap these xes right here um so this great and now we can also copy this uh over to the game of Tik Tac to as well and yeah oh sorry I just noticed that we had np. XIs using an e here instead of an I so that's just a small mistake and let's update this for both of our games and yeah so this should be working now okay so now we can keep updating this um MCTS search method so first of all um we now get these policies and these values back as return right at the beginning we don't need our values so we just get these policies and we can rename the state here to States and then uh when we yeah um yeah keep processing the policies it's nice that we call the softmax this way but again we don't want to squeeze since we want to keep our badge AIS the way it is currently and then when we add the noise to this policy we also have to make sure that this shape of our noise is equal to the shape of our policy so we can add a size right here and the size should just be equal to policy. shape of zero right so this way we basically create noise for each different states so for each different safe play game essentially Okay so this is great and now we have this yeah process policy here ready for us and the next step would be to allocate this policy to all of our um yeah sfe play games so we first of all have to also add the safe play games here uh to our MCTS search as an argument and let's also add them here because of that and now we can work with them so here we basically want to Loop um over all of our sfe play games so I will just write for I then SPG in enumerate and then safe play games like this and yeah inside of each s play game we first of all want to get the valid moves and we can get the valid moves from the state for the given safe play game right and and we get this state by looking at the states at the position of I right here since this is the index we get when we enumerate through our safe play games and this is also the same index as uh the one we have when we basically stack all of our states up together right so we can just uh yeah call it like this so um now first of all we get our valid moves and then we multiply the policy with it but we don't want to multi play all of the policy with it but rather this just the policy at our position right so let's also find the policy uh for the safe play game so let's just write SPG do policy like this this should be also equal to the policy at the position of I so now we can multiply this SPG policy with our valid moves and then again we also want to divide this SPG policy with its own sum so that we yeah at the end have this distribution of percentages right okay so now we create this root right here and first of all we should also give yeah the states at the position I here for the state and then also we have to make sure that uh this root will be equal to SBG do root right because we want to store it inside of this class right here okay so now we have this sg. root and we can also expand it so let's write SBG do root expand like this and then we will expand it with the safe play game policy we have here okay so yeah this way we already have saved a lot of uh yeah lot of calling our model right because we batch up all of our states here so this is great and now we also want to keep doing this inside of our actual iterations right so here when we uh basically have our num searches we also want to uh Loop over our set play games again so let's again write for I and then SBG in numerate Sp games like this and yeah then we can continue with this part right here so for each safe play game first of all we want to get the node and we get this node by calling SPG root right here and then we can just continue with these things right here and yeah so now um here we would call our model again right and we want to basically yeah also parallelize this model call right here so the way we can do this is by turning this if statement around so instead of checking whether um we haven't reached a terminal node we want to check whether we have reached in facted terminal not so if is terminal and then this case we will back propagate um directly so I will just copy this over right here and in all other cases we want to store the node uh that we have reached here inside of the search uh inside of ourg class as well right so else we can just write sg. node should equal this node right here and to make sure that we always start with node being one so that we can later differentiate between the safe play games in which we have reached a terminal node and in which or the safe play games that are uh that we want to expand on here later on we also have to say that SPG do node should be equal to none like this so because of that we can just move that out of this Loop right here and you know what I believe we also don't need this enumerate oops so let's just remove that so it's a bit easier to read it this way okay so now um we basically um have stored all all of our expandable nodes instead of ourg classes so first of all we now want to find out which uh safe play games are expandable or not so let's write expandable spgs uh or let's write safe play games like this um and here we want to have the list and then we want to store the mapping index so basically the index we can later use um when we allocate this um policy and this value back to our list of safe play games so we want to get the yeah index basically for each safe play game um so for each SBG where our node is expandable right here right so we can just write mapping index for mapping index then in range of Len um SPG or safe play games like this and then we only want to get the index for all of our safe play games where um safe playay game. node is equal is not equal to n right since these will be the safe play games that are expanded so if SP games of mapping index do note then is not none like this so this way we get a the yeah indes for all of our expandable safe play games and next we can check whether there are even any expand the games so can just write if L of expand for the P games is larger than zero and in this case we again want to yeah um basically stack up all of our states and then also get uh these encoded States right here so that we can then get this policy and this value so first of all we can write States and set that equal to np. stack again of and here we can just write SPG Dot node. state for SPG in um SP games like this and then oh sorry for um again we have to change the up bit sorry rather we have to write SP games of um I I believe or just mapping index like this if we want to reference this one again here so SA games of mapping index. node. state for mapping index and then in these expandable SP games right here so in expandable SP games like this okay so this way first of all we get this list of yeah the state for the nodes that are expandable um yeah basically for all of our self play games uh that we can expand right here right so okay now we have these states right here and we also want to encode them but we can just move over to this part again so I will just copy this over right here so now again we get this policy and this value back and inside here first of all we can use these expanded or these states right here and then we also don't have to UNS squeeze again because we already have a badge then here we don't have to squeeze again because we want to keep this badge as it is okay so yeah now we don't have have to add a noise right here so now we can again Loop over our um expandable spgs essentially right so we don't uh have to do this inside of this if statement but you could I mean not really changing much here so now we want to first of all have the index um that we can yeah get our policy at the given position and then we also want to have this mapping index so that we can then allocate this policy at index I to the safe play game at index mapping index if that makes sense because the yeah uh index um here when we enumerate is not aligned with the index of our safe play games right because we only um Loop for over a selected number of safe play games right so for I and the mapping index in enumerate then we want to again enumerate over our expandable SP games like this and then here we can just um yeah basically get these valid moves and this policy and then we want to expand and back propagate here so let's just remove that line right here um so this great and now first of all we again want to get the SPG policy and the SPG value so let's get SPG policy by yeah getting the policy at the position of I and the SPG value should also be the value at the position of I so let's call this right here okay so now we have our SPG policy on our SG value so now we going want to get these valid moves and we get this valid moves by first of all getting the state of our node for the given self play game and here we want to use this mapping Index right so here we can just call it safe play games at the position of mapping index and then here we want to get the note and the state right here you know what let's also just um I think just store this into a variable I think this nicer so just write node equals um save play games at the position of mapping index. node and here we just want to use no. State when we um get these valid moves right here and then here we can multiply the spbg policy with the valid moves at the state of or note and then again we want to divide it by its own sum like this so now we can get the item yeah of all value I think this doesn't do a big difference here um yeah because this already is a number array so I believe we don't need this actually so now first of all we want to expand our node using this SPG policy right here then also we want to back propagate here so I would just copy this over here sorry um copy this over from here and we want to back propagate using sg. value okay so this is great so basically yeah with this implementation we have the Monti research code working for this number of parallel games now at the end we would get these action props right here but I believe this is not that helpful instead of our MCTS class because here we uh yeah have our uh we have this parallelized and this is hard to parallelize these action props right here so rather I would just copy this out and instead we want to work with our action props inside of our safe play Method here so I would just paste this over uh to this part here so because of that we also don't get any action propess return here but we just change this the safe play games that we passed on here right okay so this is great and now let's also just run this so to make sure that we don't have any direct arrows okay so torch is not defined but this just because I haven't run these sets above so let's just have quickly do that here um so let's check whether we got any error okay so okay this is working for now okay so now we want to update this part here below and then we are finished with this parallel implementation so after we have um yeah done this paralyzed MCTS search we again want to Loop over all of our safe play games right so we can just write uh for I in range of L ofp games and here remember that we want to remove the safe play games uh from our list if they are terminal right and it's a bit yeah it's quite bad to remove a uh yeah safe play game from a list and then keep iterating over it um from basically zero to the Len um because yeah then we would yeah basically not have a perfect alignment between this index here and yeah the actual safe play game we want to reach if we mutate our list inside of our Loop right so we can fix this by uh flipping this range here around okay so now we want to Loop over this flipped range of our safe play games right and then we can just tap all of these things in right here and the first thing we want to do is that we actually want to get this safe play game at the given position so let's just write SPG equals s play games at the position of I and then when we get our action props here we don't want to Loop over the root. children directly but rather sg. root. children like this so yeah this should be working and yeah so then we have these action props right here and we don't want to add this just to the general memory but rather to the memory of our s play game as well and we don't want this yeah when we call the neutral State we can just either reference this one or just the state of the root node um for our games which is WR sg. root. State like this to get this nutrious state and then we have these action props here this should be working fine and then we have this player right here so then we also get these temperature action props and we get this action right here so this working well and now we want to update the state and here we again want to call SPG do state so that we update the state inside of our SPG class and we want to update this by first of all having the Old State so sg. state right here and then by just having this action right here and having this play right here so I think this is also fine and now we get this value and this terminal by have to terminal and again uh we have to call SPG do state inside of here and now we check whether s termin is true so here we don't need to Define this memory like I said but rather when we Loop over our memory first of all we have to declare Set This to sg. Memory um then we want to append this to this General return memory that we have defined up here right so then we can just delete this Return part right here and yeah so this is uh something that we have finished inside here and then we want to flip the player um yeah after we have looped over all of our self play games and at the end we just want to return this return memory up here above and also I just noticed that when we um yeah are finished right here with this safe play game so if our safe play game is terminal then we also want to delete it from our list of safe play games right so that this y Loop here it's working so here we can just write D and then SP games at the position of I so yeah this way we just shorten our safe play games list and because we Loop in this other direction it will still work to just keep uh looping instead of our for Loop um yeah because we will basically just remove the safe play game at this side when we look in this direction right and we also have to change this MCTS to MCTS parallel here right oh and I also noticed that we have this eror right here where we return these action props so obviously we want to remove that line right here because we just want to return the return memory and this statement was still part of the multical research that we removed um so let's just run this again and now we can try to train our model for Connect 4 using this parallelized Alpha Z implementation let's write alpha0 parallel inside of here and yeah everything else can just stay the same and I believe now the speed should be up by about 5x so this should be a massive Improvement so let's just run this say again and we get this nice progress bar so I would just train this for these eight iterations here and yeah this will still take a few hours on my machine but after that we will have this a great model that has a very good understanding of how to play the game of Connect 4 and then we can first of all evaluate it again and maybe also play against it in this nice visual way and maybe we will lose against it so we'll see okay so now I've trained in a network using this Alpha zero parallel class for eight iterations right here uh yeah on the game of Connect 4 so this still took a few hours I believe it could have been even faster if we would have further increased this number of parallel games here but still that's fine um so now we want to test this here inside of our evaluation cell so first of all let's um change the path uh for our weights to model 7 Connect 4 right since we train for eight iterations and then next we can also set number of searches to 600 as well so yeah we copy this yeah value from the arguments above and yeah this is still quite a small number when you would compare to other search algorithms that I used in practice so let's still check whether we get nice results here so let's run the cell and oh sorry I've accidentally accidentally ran this one right here um so let's run this right here so now we get this Connect Four board and I believe we want to play the midd right here so let's see where the Mod plays and the mod played on top of it think we want to play on this side right here yeah I think that should be fine so let's play on for now yeah the model played here think we either want to play here or there or maybe there so I would just put some pressure on on the model by playing here on five so now I think I think we're doing fine when we play here we just have to be careful so let's play on four so now we could also pressure here and then our mod can't play here right because then we could defend this so I think well probably I would want to play here um five again sorry I'm very bad at this yeah probably would have been much better I guess to play here then um so now we can't play here anymore we could still play here or there I think this is valuable here so let's move on top of this position here so let's play three so mod played here um want to get we have some pressure here so maybe we'd like to play here also defend here so let's play on two I guess oh no then if we play here our model would play here and we would be trapped so we can't play here anymore um so maybe we could play here but that's not that good either um maybe that's fine here it's not too valuable so guess we play here okay so where did m play so it just played here okay so think we we have to definitely play here I don't know if we still lose then so let play on five now okay so where did the M play now there some pressure here or so oh we lost this way and we also lost this way okay absolutely destroyed um yeah I don't know we have to just just play on three now I guess and yeah the took this right here and yeah has won the game because of that so nice uh could have even played here as well um so that should be yeah just I mean I played quite in a bad way I guess but still was super nice and so now we could see that even though we have around 600 searches our model was yeah still able to at least destroy me here in this game so this is fun and before we conclude with this tutorial here there's still some minor mistakes I noticed um so first of all when we get these temperature action props in alha zero tricks uh we currently aren't using them for our probabilities so let's change that right here um so now let's use them here and then also when we use our Optimizer we rather want to rather want to call self. Optimizer because then we yeah go for this Optimizer here instead of just um yeah calling this Optimizer right which might not work in all occasions yeah so that's just a small thing let's also add that to this Alpha zero parallel so temperature action props here and then save. Optimizer in our Training Method um okay so now we can just lastly um let our model play with itself and you know let's also get this nicer form of visualization compared to just printing out this boort state right here so yeah this is what we will do next and here I want to use the kegle environments package so I will just write import kegle environments right here and then you know what let's also print out the version so maybe you're interested in that um this so yeah that's my version right here and then using this package we can first of all create an environment and we get this by setting n equal to kegle environments do make and then here we can just use connect X to get this standard Connect 4 environment I mean this also tweakable so but here we just get the standard Connect 4 environment and then next we would want to have some players so this should just be a list of our agents and then um next we could just say n. run using these uh players we have defined and then n. render with mode being equal to I python instead of our Jupiter notebook here so now we also want to Define our players right so here I've just created this kegle agent class yeah and I've copied it over right here so basically we just do some pre-processing with the state and then we will yeah just call our MCTS merch search either or just uh get a prediction straight up from our model depending on our arguments so basically doing something similar than what we do up here in this Loop right okay so yeah now we want to create these players as keg agents and before we do that we also need this model this game and these arguments right here so we just them over um from here basically like this and then let's paste them here so first of all we have our game of Connect 4 then we don't need a player here um then we also have to add some arguments right here so first of all we want to search set search to True um and then we also need a temperature and the way uh this is buit up here when we have temperature equal to zero we don't uh do this yeah division to get our um yeah updated policy but rather we will just get the ARX directly so let's just set temperature to zero right here so that we always get the ARX of our policy and the MCTS distribution Guided by our model okay so now we have these right here let's also set Epsilon to1 so that we still have some Randomness here and so next we we have our device right here we create our model like this and yeah we use this path right here so this should be fine so now we can actually Define our players so let's set player one equal to kegle agent and like I said we first of all need this model this game and these arguments so let's write model game and ARS just like this that's let's also do the same thing for player two so I mean so this way we have more flexibility if we would want to try different players and then here for the players uh we can just fill this list with first of all stting player one run and then player two. run like this okay so this should be working now let's just run the Cale and um get these nice visualizations oh so now we get this need animation of our mods playing against each other so we have two players but just one mod here playing against itself basically and I believe this should come to a draw yes so our model is that advanced I guess that it can defend against all attacks so in this still uh just this nice animation and now we can also just yeah briefly do the same for Tic Tac Toe as well so here we just have to change some small things so first of all oh sorry uh we have to set the game to Tic Tac to and then also um yeah we can change our arguments so let's just set number of searches to 100 and here when for our rest net we can set the number of rest blocks to four and the hidden dim to 64 I guess and then we want to update this path so I think the last T to model we trained was model 2.pt if I'm not mistaken so then also here we have to set this to tiktock toe as well and then we can just run this okay so now we get this nice animation immediately of our mods play playing against each other and again we have gotten a draw because the model is able to defend all possible attacks um so this still very nice so feel free to do some further experiments yeah your own so maybe you could also set the search to for so that just your Nur networks directly will play against each other without doing these 100 searches in this case case and yeah so I think we're finished with this tutorial this was a lot of fun and um I've created this giup repository uh where there yeah there's a Jupiter notebook stored for each checkpoint and then also um I have this weights folder where we have the last model for tikat to and the last model for Connect 4 that we have trained and if there are any questions feel free to ask them either in the comments or just by sending an email for example and yeah I might do a follow-up video on mzero since I've also built that algorithm from scratch so if you're interested in that so um would be nice to let me know so thank youAlpha zero is a game playing algorithm that uses artificial intelligence and machine learning techniques to learn to play board games at superhuman levels in this machine learning course from Robert Forester you will learn how to create Alpha Zero from scratch thank you for tuning in in this video we are going to rebuild Alpha zero completely from scratch using Python and The Deep learning framework pytorch Alpha zero was initially developed by Deep minds and it is able to achieve magnificent performance in extremely complex board games such as go where the amount of legal board positions is actually significantly higher than the amount of atoms than in our universe so not only is this very impressive just from an AI standpoint on but additionally to that the machine Learning System of alpha zero learns all of this information just by playing with itself and moreover the various domains in which this algorithm can be used are almost Limitless so not only can this algorithm also play chess and shogi in a very impressive way but the recent Alpha tensor paper that was also published by Deep Mind showed that Alpha zero can even be used to invent novel algorithms inside of mathematics so together we will first of all understand how this algorithm works and then we will build it from ground up and train and evaluate it on the games of Tik tac toe and connect for so that you will also understand how flexible Alpha Zer really is when it comes to adapting it to various domains so let's get started okay great so let's start with a brief overview of the alpha Zer algorithm so first of all it is important that we have two separate components and on one hand we have this self play part right here and yeah during this phase our Alpha zero model basically plays with itself in order to gather some information about the game and while we play with ourselves we also generate data and this data will then be used in The Next Step during training so in this training phase right here which is the second component we like to optimize our model based on the information it gained while playing with itself and then next we basically want to fulfill the cycle and then use this optimized model um to play with itself again so the idea is essentially to repeat the cycle and number of times until we have yeah reached a Nur Network so a model right here that is capable of playing uh a certain game better than any human could because it just play with itself used The Information Gain to optimize uh yeah itself and then play with itself again basically um and then we just like to repeat that so in between these two components we have this alha Zer model right here which is just a n network and now let's look at this n network more closely so let's see how this actually is architectured so basically we take the state SS input and for the case of tikto for example the state would just be board position um so just turn to numbers and then we once we have the state has input we receive a policy and the value as output so the policy here is this distribution and for each action basically it will tell us how promising this action would be based on this state we have received here input so it will basically tell us where to play based on this state right here so this is the policy and then the value is just a float and it will basically be a measure telling us how promising the state itself is for us as a player to be in so maybe a concrete example that make might make all of this easier to understand uh is something we have down below here so here for the state we just take in this position and remember that we are player X in this case and yeah if you're familiar with the roads of titico you should see that we should play down here because um yeah then we would have three exis on the lowest row and yeah we would win the game so if our model also um understands how to play this game then we would like to receive a policy looking some word like this where basically the highest bar right here is also this position uh down below uh that would basically help us to win the game so yeah this is how the policies should look like and for the value uh we can see that this St is quite nice for us as a player because we can win here so we would like to have a very high value so yeah in this case the value is just float in the range of negative 1 and positive one so we would like to have a value that is close to positive one because this state is quite optimal okay now we uh yeah looked at how the general architecture is built and how the model uh is built but next we want to understand the safe play and the training part more in depth so before we do that you should also understand what a monal researches because that is vital for selfplay and the general algorithm so the monteal research is this search algorithm right here and it will take a state on our case a Bo position as input and then it will find the action that looked most promising to us and this algorithm gets this information by just first of all declaring a root note at the initial state right here and then by building up this tree into the future so just like we can see right here so basically the idea is that we um create these notes into the future uh right here and we get these notes by just taking actions that lead us to the Future so this note right here for example example is just the place we arrive at when we take this action one uh based on this initial State at our root note so each note first of all stores a state like S1 in this case and um so S3 right here would just be the state we get when we first of all play action one based on this initial State and then when we would play action three based on this state right here so playing these actions um after each other basically then additionally to the state each note also um stores this variable W and W is just the total number of wins that we achieved when we played in this direction into the future so basically when we took action one initially and arrived at this note we were able to win three times when we walked into the future even further so next to this variable we also have this n variable right here and N is just the total visit count so basically the total number of times that we went into this direction so here we can say that we um took action one four times and out of these four times we were able to win three times right so after we have um basically built up a tree like this um we can then find the action that looks most promising to us and the way we do this is by looking at the children of our root node so these two nodes right here and then for each node we can calculate the winning ratio so here we can say that this node won three out of four times where this node one one out of two times then we can say that you know what this node right here has a higher winning ratio and thus it looks more promising to to us and because of that we can also say that action one looks more promising than action two because this action was the one that even led us to this note here in the first place so basically the result of this tree right here that we got at this time would be that action one U should be the action we we would like to play then we could just use this information to uh say you know what in our actual game of tic tac to for example we now want to use this action right here so this is great but now the question becomes how can we even build up a tree like this and also how do we get the information whether we win at this given position or not right so the way we come up with all of uh this data right here is basically by working down uh all of these four steps for a given number of iterations so at the beginning we like we are in our selection phase and uh during selection we basically walk down our tree but we walk down our tree until we have reached a so-called Leaf node so a leaf note is a note that could be expanded even further um into the future so uh one example of a leaf note would be this note right here because you can see that we have only one child is this note but actually we could also expand in this direction if we say that for each note there are always two possible ways in which you could expand in this example right here so this is a leaf note and then also this would be a leaf note right here because this note doesn't have any child right and then also we can say that this note also is the Lea note because it only has one sh right here but we could still expand it in this direction so yeah these are Leaf notes for example while these aren't any Leaf notes because yeah this one already has two children and this one also has two children already so yeah this is basically how selection looks like but now the question also becomes um when we walk down uh which direction we should pick right so when we just start at the top we now have to decide whether we want to walk down in this direction or whether we want to walk down in this direction right here basically the way of finding out the direction we choose when walking down is by taking the direction or the child right here that has the highest UCB formula like this so while this might look a bit complex um in reality we just take the child that um tends to win uh more often while also trying to uh take a child that has has been visited just a few number of times relatively right so on one hand uh we when we maximiz this formula we want to take the child with the highest winning ratio and also um want to take the child that has relatively few visits compared to this large n uh which is just the total number of visits for the parent so if we would calculate the UCB formula for both of these nodes right here we can see that the UCB formula here is higher because um this note has been visited uh less often right and because of that we would like to walk in this direction right here okay so during selection we can now say that we walk down in this direction right here and now we have arrived at this note so first of all we have to check if we work down walk down further or if we have in fact reached the leaf note and now we can say see that we have reached a leaf note right here right because we could still expand in this direction so we could still have a child right here and because of that we actually skip to the next phase of the MCTS and that is the expansion phase so right here um in expansion we want to create a new node and depended to our tree so I will just create this node right here and I will call it I give it the state as seven then we can say that we took the action A7 to get to this newly created note right here and at the beginning uh this note should have a winning count of zero and also a visit count of zero so I'm just going to declare W7 right here and also N7 like this and both of them should just be equal uh to zero at the beginning right because we have just created uh this note and eded it to our tree but we have never been there actually so now we can say that we further work down here um to the note we have just created and um so now we have also finished this expansion phase right here and next uh we have arrived at this simulation phase and here the idea is basically to play randomly into the future and until we have reached a state uh where the game is over right so we want to play into the future until the game is finished and basically here we can just say we play using random actions so we just play randomly and finally we have just reached this node right here and yeah here the game is over so this is uh the simulation phase and now uh when we have reached this terminal node basically we then have to check whether we have won uh the game or we have lost or a draw has ured so in this concrete uh case we actually have one and with that we can also finish the simulation phase so just finally now uh we have arrived at the back propagation phase and here we like to get the this information from our simulation and then we want to give this information all the way up to our parents uh until we have reached the root note so here because we have won the game actually and during back propagation we can um first of all higher the number of wins for this note right here so we can say that we have one one time actually then we can also hire the total number of visits so here we have have visited this node one time and we have won in this case and because we want to back propagate all the way up we can now also say that here for this note um the number of wins now is two so the number of wins we got when we walk in this direction right and then also the total number of visits should now be three because we have yeah visited it when we selected inside yeah in this direction and now we can also back propagate all the way to our roote and here we can just say that the general visit count is seven and the total number of wins would be five right okay so now this would be a whole iteration of this MCTS process and um yeah then if we would again um choose an action we would just uh check the children of our rout again and again could just choose the action based on the winning um yeah amount of winds right here so this is great uh let's also just have an example where we start from all the beginning so that you yeah might find it easier to understand the search algorithm as well so again we want to perform these steps right here so at the beginning we are just in our selection phase so we want to walk down until we have reached the leaf node but here you can see that this root node already is a leaf node right so basically we can now move on to expansion and create a new node right here so let's say that we just create this Noe right here by taking action one and then we will arrive at State one here as well and then we can say that at this um note that we have just created the total number of wins right now should just be zero and also the visit count should just be zero um so we have W1 and N1 just created right here and yeah both of them can just equal zero like this so now we have also finished the expansion part for the first iteration and next comes the simulation part right so we have walked this way basically right here and now we want to simulate into the future just by taking these random actions so let's write this right here and now we have reached a terminate node and now we have to check whether we have won or lost or a draw is secured again so in this case we can just say that we have one again so because of that we actually can back propagate now and in back propagation we can say that the number of wins here is one with the visit count being one as well and then we can do the same for our root node right here so we can just say that the number of wins for our root note is also one and the visit count of our root note is also one so now that we have finished with this iteration I'm just going to clean everything up here and we can move on to the next iteration right so again we start at selection and we walk down until we have reached a leaf note but again we can see that our root note actually still is a leaf note so let's just expand uh this direction now and here we will just use action two and we will arrive at state two and here again we have to set the total number of wins to zero so W2 should be equal to zero so this also S2 and then the visit count so N2 should also equal Z right here so let's write N2 equal Z as well so let's do it like this and yeah now basically we walk in this direction right here and next we can do simulation again so let's simulate all the way down until we have reached a termin Noe and in this case uh we can say that we have actually lost the game so when I back propagate all the way up I will only increase the visit count right here so the physic C should now be one but the total number of wins should still stay the same uh being zero because we have lost here when we simulated so because of that when we back propagate to our root mode we should also only increase the visit count while leaving the uh total number of wins uh the way it was earlier so let's also finish up with this iteration right here and now we can move on to the next iteration so first of all we want to do selection again so we want to walk down until we have reached the leaf note and now our root node is fully expanded so we have to calculate the UCB score for each of our children right here and then we have to select the child that has the highest UCB score right so when we calculate the UCB score for both of them we can see that the UCD score here is higher because we have won the game in this case so we just walk down this direction right here and now uh basically we can move on to expansion so let's expand in this direction here so let's create this new node right here and append it to our tree and we took action three actually when we um expanded in this direction and then the State here should also be S3 because of that and now again we can declare this total number of wi W so W3 and the visit count N3 and both of them should just equal zero at the beginning so let's set them to zero and now again we can perform this um basically can perform Sim simulation again so we walk down in this direction right here and yeah so now let's do simulation and when we do that um we just arrive at this terminal note again and again we can now see that in this case we have just lost uh the game after just uh playing randomly and because that when we back propagate we only want to increase the visit count right here while not changing the total number of wins so let's set the visit count to two years well and and then we can set the visit count to three okay so this is this iteration right here great um so now we can just Al clean all of this up right here and then we can move on so next uh when we want to keep doing this process right here we again have to do selection right so now we have to again calculate the UCB formula for both of these values right here and here we can see that the UCB formula actually gives us a higher result for this note just because of the visit count that is lower right here so when we now walk in this direction we again have to check if we have reached the leaf not or not so here we have just reached a leaf note so because of that we move to expansion and create this new node right here and this should just be equal to S4 then so we can say that we took Action Four right here and then again the W and the this account um should be set to zero sorry color um so W4 and and in for here being our number of wins and our visit count and let's set them to zero l so now we also walk in this direction right here and again after um expansion we can now move to simulation so when move here um the game is terminal and in this case we can just say that draw is ured and you know what in case of a draw we just would like to add 0.5 to the total number of wins because uh no player has won here or no one has won so um because of that when we back propagate here first of all we can increase the visit count and then let's set the number of wins to 0.1 and then up here we can set the visit count to two and again the total number of wins should be 0.5 and when we back propagate all the way up to our root node then we have the total number of visits as being four and our um total number of winds should be 1.5 so this is just a an example of how the multicol research might actually look when you just start at the beginning and actually you can repeat uh this cycle for just a set number of iterations and you uh set this number manually at the beginning and alternatively you could also set a time and during this time you would just perform as many iterations as you can but in this case for example you could just stop it for iteration ations but in practice you might run for thousands of iterations okay so now we can also look at the way how our monal research changes when we adapted to this General Alpha Zer algorithm so there are two key changes that have been made here so the first thing is that we also want to incorporate the policy that was yeah gained from our model into the search process right and especially we want to add it to the s ction face right here and we can do this by incorporating it into this updated UCB formula so this way you see this P of I part right here so basically when we select a child and we want to yeah take the yeah CH with the highest UCD formula then we will also look at the policy that was assigned to it from its parent perspective so remember that the policy is just this distribution of likelihoods and basically for each child when we expand it we will also store this policy likelihood at the given position um here for the notes as well and because of that we then also tend to select children more often that were assigned a high policy by its parent right so this way our model can guide us through the selection phase inside of our multical research so yeah this is the first key change here uh so just generally this updated UCB formula right here and then for the next change we also want to use the information of the value that we got from our Nur Network and we can use this information by first of all completely getting rid of this simulation phase yeah right here so we don't want to do these random roll outs into the future anymore until we have reached a terminal state but rather we will just use the value that we got from our Nur Network when it basically evaluated a certain State and this value will then be used for back propagation right so this way we use use both the policy for our selection and the value for our back propagation of our Nur Network and because of that uh we know that our Monti car research will improve drastically when we also have a model that understands how to play the game so this way at the end uh we then yeah have a better search with a better model that can even create a much better model right so we can this uh yeah keep the cycle up this way okay so there's also just a small change uh here as well uh just a minor one and remember that when we get policy from our um Nur Network we immediately get this distribution right with the probabilities for each potential child and because of that it's much more convenient now to expand in all POS possible directions during the expansion phase so we won't only create one new note but rather all of the possible notes uh that we can yeah expand on right so this just a minor change and so in order to make this easier to understand we can just do some iterations here on the Whiteboard together so yeah this way it might be easier to fully understand how this multicult research is adapted so first you might also see here that we also store this policy information here instead of our note and yeah this would just be the probability that was assigned to it from its parent right and for the root node we just yeah have no policy probability because we also have no parent but for other notes we will also store this policy here instead of our notes okay so let's start with iteration here so first of all we want to walk down until we have reached the leave mode and here we can already see that our root node is in fact the Lea note because we can stay expanded uh in these two directions here so because of that we will then yeah move to the next step right here which is the expansion phase and here we will actually create our new nodes so now when we expand we also want to get a policy and the value from our Nur Network so because of that I will first of all write p and then value equals just calling our Nur Network it's the state of our root no and let's say this policy right here will just be 0.6 for the first CH and then maybe 0.4 for the second CH and let's say that this value so the estim of how good this state of our roote is is just 0.4 here okay so now when we expand we will create this these two new nodes right here so these two and here we have action one right and here we have action two so because of that we also have state one here and we have state two right here so now we can also add the number of wins the visit count and the policy uh information to our notes right here so first of all I will create this W1 and set that equal to zero then I will create this n one and set that also equal to zero and now we have this policy probability one and this should be equal to the policy for action one and here the policy for action one is just 0.6 so yeah the policy see here should be 0.6 because of that so now we can also add the information here to the other child so let's first the number of Wis or the value later on here zero way for W2 let's set N2 to Z as well and the policy probability for the this child right here should now be equal to 0.4 right um like this okay so now we have expanded here and we skip this step here so now we move to the back propagation step and here we just want to back propagate this value right here so we will just do this by um yeah just adding it right here so basically in this case it doesn't matter much for our root note but maybe it's some nice information so we can just add 0.4 here and then because of that also change the visit count to one minut okay so this was one iteration and now we can move on to the next iteration so I will just clean this up okay so now again we want to walk down until we reach a leaf note so now we have to decide between those two children right and for each we have to calculate the UC B formula and then also uh we will just walk down in the direction where the U UCB formula is the highest right so when we calculate this UCB formula we will see that no longer um can we basically get the winning yeah um winning probability here because we have a visit card that is equal to zero So currently the way this is implemented we get we would get a yeah division by zero error so here we have to make sure that we will only check for the winning probability if we have a visit count that is larger than zero so because of that I will just set that in Brackets right here so we will only check for this if we have a visit count that is larger than zero and in all other cases we will just mask it out so we will just check for this part right here okay so now we can find out the child where this value is the highest and we can see that we will just pick this ch right here because it has the higher policy probability and because of that it's UCB formula will also be higher so because of that we would just walk in this direction right here so now we have reached a be note here and we would like to expand this note here so let's expand in these two directions so first of all this should be action three and this should be Action Four right here so when we create these new States we have state three and we have state four like this okay so now again U we also have to calculate a new policy and a new value so we will get this policy and this value by calling our Nur Network f with this State here as so S1 because yeah this is the position we have walked down to and from here we want to find out what the policy would be for these children and also what the value would be for this notes directly so let's say that the policy here is just 05 and 0.5 for both and let's say that this value right here is just 0.1 so now um first of all we can also add this information here to our notes again so let's first of all set W4 equal to Z then we can say N4 equal to Z and then policy probability for should be equal to 0.5 and then here we also have your W3 which should just be zero and we have N3 that should also be zero and then we have also our policy probability three here and this should be equal to 0.5 you can read that here okay so yeah now we have expanded these two children right here and yeah created them and Stor the policy information here inide them so now we want to back propagate again so here we will just use this value of 0.1 so actually we will add 0.1 here and we will raas this visit count by one so let's change that one right here and remember that we want to back propagate all the way up to our root note so here we will change the value so w0 here to 0.5 so also add 0.1 to it and then we will just erase this number visit by two is by one as well so it will be two afterwards so yeah this is just this updated tree after just two iterations but yeah again if we would repeat this we would just first of all walk down until we have reach reached the leaf note then we would expand and use the policy um here when we store the policy information uh in the new children and then we would just use the value for back propagating up all the way in our tree to our root mode right so yeah these are just the small changes that have been added to our multi research great so now that you know what a multicol research is and how we adapt our multicol research in during our Alpha zero algorithm we can actually move to the self play part right here so basically during self play we just yeah play uh games with ourselves from the beginning to end and so we start by just getting this initial state right here with this uh which is just a completely blank State and yeah we also are assigned to a player and then based on this state player um yeah position uh we perform a monticola research and yeah this multi research will then give us this distribution as a return uh which is equal to the visit count distribution of the children of our root node right and once we have this distribution we then would like to sample an action out of this distribution so in this case right here uh we for example sampled this action uh which is highlighted here because yeah it turned out to be quite promising during during our multicol research so once we have sampit this action we want to act based on it so we play here basically as player X and then uh yeah we move to the next um State and we also switch the player around so that we are now player Circle or player all and we have this state basically as input at the beginning and then again uh we want to perform this multicolor Tre search right here and we gain this MCTS distribution then we just uh yeah sample uh one action out of it uh we play based on this action and um then we can basically change the perspective again so that we are now play player X and we basically yeah continuously perform a Monti car tree search sample an action out of it play uh this action right here and then change the perspective so that we are now the other player or the opponent and we do that until we actually have reached a terminated state so once a player has won or a draw has occurred and yeah so when this is the case we want to store all of this information we gain well playing um to the training data so let's first of a look how this information is structured so basically for each uh State uh we also want to store the MCTS distribution and then we also need to find out the reward for the given State the reward is equal to the final outcome for the player uh that we are on this given state so basically in this example right here x won the game so that means that for all states when we were player X we want the final uh reward or final outcome be equal to one and in all cases when we were player o we want the reward to be negative one because we lost the game so basically that means that you know what we want this game is play X right here so we might just guess that this state also is quite promising because yeah this state led us to win the game eventually so this is why we turn change this reward to positive one here when we are play X and this also the reason why we change the reward to negative one when we are player um player o here so these um combinations of the state the MCTS distribution and the reward will then be stored as two pits to to our training data and then we can later use these um for training right in order to improve our model so this is great but now we have to understand uh how training works so yeah let's look at this right here so at the beginning we just take a sample from our training data and yeah uh you should know now that this sample is the state the MCTS distribution and pi and the reward Z right here then we will use the state S as the input um for our model then we will get this policy and this value um out as a return and now for training The Next Step basically is to minimize the difference between the policy p and the MCTS distribution at the given State Pi on one hand and then we also want to minimize the difference of our value V here and the uh final reward or final outcome z uh we sampled from from our trending data and the way we can minimize um the difference basically in a loss is first of all by having a mean squared error uh between uh the reward and the value here and then by also having this multitarget cross entropy loss between our MCTS distribution pi and our policy P right here then we also have some form of a true regularization at the end but yeah so essentially we want to have this loss right here and then we want to minimize the uh the loss by back propagation and this way we actually update the weights of our model Theta right here and we also thus get a model that better understands how to play the game and and that has been optimized and then we can use this optimized model to again play with itself in order to gain more information in order to train again and so on right um so this is how alpha zero structured and now we can actually get to coding so let's actually start by programming Alpha zero so first of all we're going to build everything inside of a Jupiter notebook since the interactivity might be nice for understanding the algorithm and also this might help you if you want to use Google cab for example to use quid's gpus to train alha zero more efficiently and we will start everything just by creating a simple game of tic tac toe and then we will build the monteal Tre search around it and after we have gone so far we will eventually build Alpha zero on top of the multicol research we had previously and then we will expand our portfolio to connect for as well and not only should this be easier to understand but it should also show how flexible f z really is when it comes to solving different environments or board games in this case so for tic TCT toe we want to use numai as a package so I will just import numi S&P right here and if you are wondering my version currently is this right here so let's just create a class of t t to and a simple init and we first of all want to have a row count which equals three the column count as well should also be three and then we also need an action size and the action size is just the amount of rows um multiplied with the amount of columns so s. row count times s. column count uh so that's great and now we can write a method to get our initial state so this is what we will call at the beginning and at the beginning the states just full of zeros so we will just return np. zeros and we AP here is the row count for the number of rows and the column count for the number of columns and next we also want a method that will give us the next state after action has been taken so we write get next state here and as input we want the previous state the action and also the player so the action itself will be just an integer between 0o and8 where zero will be the corner up left and eight will be the corner down right so actually we want to encode this action into a row and a column so that we can use it inside of our number array so the way we do this is that we Define the row is the action divided by the column count but we use an integer division here so we will get no floats and then for the column we can use the mod of the action and the column count as well so if our action would be four for example then uh our row would be for integer division by three which would just return one and our column would be for mod low 3 which is also one so we would have Row one and then column one as well as so the middle of our board which is exactly what we want so then we can set the state at the row at the given column to the player we had um as input and let's also return the state so our code is more readable so that's awesome and next we also want a method that will tell us uh what moves are actually legal so these are just uh um the moves where the field is equal to zero so let's write a method get valid moves and also give a status input and then we can just return um state. reshape 1 so we flatten out the uh State and then we can check if the state is equal to zero and this will give us a Boolean array back so true or false but it's quite helpful if we have integers here so I will turn the type to np. and8 so unsigned integers and yeah that's awesome so now we also want a method that will tell us if a player has won after he um took or he acted in a certain way so let's also write a method for that so just Dev check win and here we want the state and then action so let us first get the row again so we just say action integer division s. column card and let's also get the column here um nice so now we can also get the player uh that played in the given position and the way we do this is just that we check the state at the row at the column so so basically we turn this expression here the other way around um and actually in Tic Tac Toe there are four different ways in which you could win a game so first of all you could have three in a row like this or you could have three in a column like this or three in this diagonal or three in this diagonal right here and we want to check for all of these four conditions and if one of them turns out to be true then we will return true as well and none of them are true uh then we just return for it so let's write out all of these Expressions right here so first of all we can just check if there are three in a row so um we will use np. sum of the state of the given row and all of the columns inside of that row and we want to check if that sum right here is equal to the player times the column count uh yeah and then we can also check whether there are three in a column like this so we just use np. sum of state of all of the rows for a given column and then again we check if that equals player times. column count or row count at this case I mean let's be more flexible here and that's great so now we want to check for this diagonal and the way we can do this is by using the np. DI method so I pulled up the documentation right here and if we would have an array like this which we also want to have for Tic Tac top and then we can use this di method right here to get um the values from this diagonal so let's use this to also check the uh someone has one so we use np. sum of np. di of slate and we'll check if that's equal to player times se. row count or se. column count in this case it doesn't matter because you can only get this full diagonal if row count and column count is the same and next we also want to get this opposite diagonal right here and there is no way to do it with the normal di method but we can just flip our state and then use this old di method again and since our state is flipped it would be as if our DIC itself is flipped if that makes sense so we can just use all np. sum of np. DIC of np. flip of the state let's just say x is equal zero so we flip like this and then we can take the opposite di which is what we want so I also check if that equals player time sa. R so awesome um and next we also want to check if the if there has been a draw so um if the game terminated in any other way um since this might also be a scenario where we want to stop the game so let's just write a new method get value and terminated and we want a state and in action and first of all we will check if uh check win is true in this case so if check win or save to check win of state of action then we just return one for the value and true for terminated and then we can also check if there has been a draw and that we know that there has been a draw if the amount of valid moves is zero so we can just take the sum of the valid moves and check if that's zero so I say if NP do sum of save. get valid moves of the given State equal zero then we just return zero for the value since no one has won the game and we will also turn return true for terminated and in all other cases the game must continue so let's just return zero and FSE awesome so now we got a working game of Tik toac to and additionally would also want to write a method to change the player since that might be different for different board games so let's just write method get opponent and yeah take take a players input and then we just return the negative player so if our initial player would be negative one we would return one and if our initial player would be one then we just return negative one so that's great and now we can test our game we build right here so I just say tick tock toe equals Tick T toe then I just say player equals one and I say State equals Tik Tac toe. get initial State awesome so then let's just bu while Loops um just say while true we'll print the state right here then we want to get our valid moves so we know where to play so I just say valid moves equals take T to do get Val moves of the state we have currently and we can also print that so I just say valid moves and you know what let's print the valid moves in a more readable perform since currently we only have zeros and on but we actually want to get the position where we can play so I just create this array right here and I say e for e in range and then we can just say Tic Tac toe. action size So currently we get all the indices for um the possible actions and then we want to check if the ini itself is valid so we say if valid moves of I equit one right so uh we want to get all these uh Ines and this can just be printed out like this and then we want to get an action so the action will just be an input that will be casted to an integer and let's just um write out the player like this um and get the action right here so then we want to check if the action is valid so we'll say if valid moves of action equates zero then we just say print um action not valid and we will continue so we'll ask again basically and in all other cases we want to move so we create a new get a new state and the new state will be created by calling Tik Tac toe. get next state and we want to give the Old State the action and the players input um great um so then we can check if the game has been terminated so we'll just say value is terminal equals Tic Tac toe. get value and terminate and then we want to give the state and the actions input and then we say if this terminal print the state so we know where we are and then we will check if value equates one and then we say player has one so uh just say player so one or negative one for the player and in all other cases um we can just print that there has been a draw great and then we also want to break our while loop so uh we can end the game and in all other cases if the game continues we also want to flip the player um so we just say player equals Tick Tac to.get opponent of player so nice this should be working so let's test this out so here are the valid moves so this looking nice so let's just pick zero for starters see nice okay we played here so we are player negative one now so let's just play four at play Zero we can just say eight Play 1 one for example play Zero i' say two and play negative 1 we can just say seven and nice we see here we got three negative ones inside of this column right here and thus we get the result that play negative one has one perfect so now since we have got our game of TK already we can actually build the monticolo research around it so let's just create a new s right here and then we want to have a class for our multical research so I'm just going to call that MCTS for now and then we have our inet here and we want to pass on a game so take to in this case and then also some arguments so these are just hyper parameters for our multicol research so inside of our init we can say s. game equals game and s. ARS equs ARs and then below we'll Define our search method and this is what we'll actually call when we want to uh get our MCTS distribution de so let's just write the search and for the search search we want to pass a state as input and inside here first of all we'd like to define a root Noe and then we want to do all of our search iterations so for now we can write for search in range of safe. ARS num searches and then for each search we'd like to do the selection phase then the expansion phase then the simulation phase and then the back propagation phase and then after we have done all of these things for the given number of searches we just want to return the visit count distribution and uh it's the distribution of visit counts for the children of our root not so let's just say return visit counts yeah at the end yeah so that's the structure we have inside of our multicol research and next we can actually define a class for a node as well so let's write class node here and then have our init again and first of all we'd like to pass on the game and the arguments from the MCTS itself and then also we want to have a state as a note and then a parent but not all nodes have parents so for the root node uh we can can just write none here as a placeholder and then we also want to have an action taken but again we have a placeholder of none um for the root n here as well so inside of our inet we can just say save. game equals game safe. arss a safe. state will equal State and then again save. parent equals parent and sa. action taken equates action taken so next we also want to store the children inside of our note so let's write s. children equals just an empty list so these are yeah the children of our note and then also we want to know in which uh ways we could further expand our Noe so for example if the root node would just be an initial State then we could expand in all nine fields for a Tic Tac to board and yeah we want to store this information here so let's just say self do expand moves equals game. get valid moves of State at the beginning then as we keep expanding here we will also remove uh these expandable States from our list here and additionally to all of these attributes we also want to store the visit count and the value sum of our notes so that we can later perform our UCB method and get the dist distribution back at the end so let's say sa. visit count equal zero at the beginning and sa. value sum should also equal zero uh when we start so yeah that's uh just the back or the beginning structure of our node and uh now we can Define our root node here so let's say root equals note and we have save. game save. ARs and then the state we took here as input and yeah parent and action can just uh stay none then inside of our search we just want to say node equates root at the beginning and next we want to do selection so from the video you might remember that we want to keep selecting downwards the tree as long as our nodes are fully expanded themselves so for doing so we should write a method here that would tell us so def is fully expanded and a node is fully expanded if there are no expandable moves so that makes sense and also if the number if there are children right so um if a note would be terminated you obviously can't select past it because the game has ended already so let's say return np. sum of s do expand a moves and that should be zero right and Len of s. children should be larger than zero so this way we can check whether node is fully expanded or not so during our selection phase we can just say while Noe is fully expanded and here we just want to select downwards we say node equals node do select so now we can clear this right here and next we want to write a method that will actually select down so I'm going to write that here and the way selection works is that we will Loop over all of our children as a note and for each child we will calculate the UCB score and then additionally at the end we will just pick the child that has the highest UCB score so for the beginning we can just write best child equals none since we haven't checked any of our children yet and then best UCB should just be negative mp. infinity and then we can write for child in safe. children and then UCB equal save. getet UCB of this child right here and then we can check if UCB is larger than best UCB and if that is the case we say best child should be child and best UCB should also equal youb now right and at the end we can just return best child here um so next we should actually write this method right here we use for calculating the UCB scope and yeah we want to take a child here as input and for the UCB scope we first of all have our Q value and the Q Valu should be the likelihood of winning for a given note and then at the end we have our C which is a constant that will tell us whether we want to focus on exploration or exploitation and then at the end we have a math square root and then we have a log at the top with the visit count of the parent and then below we have the visit count of the child so we want to uh first of all select notes that are promising uh from the win uh likelihood uh perceived by the parent and then also we want to select notes that haven't been visited that often compared to the total number of visits so first of all we should Define the Q value and uh that should generally be the visit uh sum of our child divided by its visit count uh sorry its value sum so the value sum of our child divided by its visit count and the way we implemented our game would actually be possible to have the negative value and generally we would like our Q value to be in between the range of 0 and one because then we could turn it into a probability so um currently it's between negative 1 and positive one so let's just add a one here and then at the end we can also divide everything up by two and now there's just one more thing we could we should think about and that is that actually the Q value right here is what um the child thinks of itself and not how it should be perceived as by the parent because uh you should know that in Tik Tac Toe the child and the parent generally are different players so as a parent we would like to select a child that itself has a very negative or very low value because as a player we would like to put our opponent in a bad situation essentially right so we want to switch this around and we do this by saying one minus uh our old Q value here so if our child has a now normalized value that is close to zero then as a parent this actually should give us a q value that is almost one because this is very good for us to uh walk down the tree in a way that our child um is in a bad uh position so now we can just return our UCB right here so I'm just going to write Q Value Plus safe. ARS of c and then that should be multiplicated with math. square root and we should also import meth up here so that we oh sorry like that um so that we can use it and inside of our square root we have math.log and here we have the visit count uh as a note ourselves so just save. visit count and then we can divide everything up by child. Vis account yeah so that should be it um perfect so now we actually have a working selection method here and now we can move on so now we have moved down all the selected down um the tree as far as we could go and we have reached a leaf note or a note that is a terminal one so before we would like to expand our Leaf node we still have to check whether our node we have finally selected here is a terminal one or not and we can do this by writing value is terminal equals s. game.get value and terminated and inside so we are referencing this method right here um so here we just want to write State and action uh as the input so we can take the state of O note here we have selected and then also The Last Action that was taken so just node doaction take and we still have to remember one thing and that is um that the note always gets initialized this way so the action taken was action that was taken by the parent and not the not so it was an action that was taken by the opponent from the notes perspective so if we would actually have reached a terminal note and if we would get the result that yeah a player has one here then the player who has one is the opponent and not the player of the note itself so if we want to uh read this value here from the uh noes perspective we would have to change it around so here uh ins inside of our game method we should write a new method def get opponent value and here we can just say negative value um so if um from the parents perspective we have got a one right here is a return and we would like to turn it into a negative one from the child's perspective because uh it's uh the opponent that has one and have the child itself uh yeah so let's use it right here so let's say value equals save. game. getet opponent value of the old value and we should also keep one thing in mind and that is that we're using this no. action taken method right here right but actually at the beginning we just have our simple root node and we can see that the root node is not fully expanded right because it has all possible ways we could expand it on so we will call this right here immediately with our root Noe and the way we initiated our root no um we still have action taken is none right and if we call this method right here with action taken being none um so we call this with action taken being none and we'll move to check one with action being none and then we will uh do this expression right here and this will just give an error so we have to say if action equates none just return forwards right because for the root note we always know that uh no player has won since no player has played yet so we can just uh leave it like this and perfect so now we're finished with uh all of this and next we want to check if uh this note is termined and then we could back propagate immediately and if not then we would do expansion and simulation so I'm writing now uh here if is if not is terminal then we want to do this right here and at the end we can just back propagate right okay so yeah now we can do expansion and inside of this if statement I can just write note equit node. expand um so we just like to return the note that was uh created by expanding here and then inside of our node class we can write a new method uh that will do the expansion job so let's say Dev expand and okay nice so the way expanding works is that we uh sample one expandable move out of these expandable moves we have defined up here and then once we have sampled this move or action uh we want to create a new state for our child we want to act inside of this state and then we want to create a new note um with this uh new state we have just created and then we'll append this node to our list of children here so we can later reference it inside of our select method and uh yeah at the end we can just return the child that was created newly so for now let's get an action here and yeah we want to sample it from above so let's use np. random. choice so this would just pick a random value out of a list on nonay array and inside we can say np. where and then uh the condition is that s. expandable moves should be equal to one right because the way we programmed it all uh values that are one are actually legal and we could expand on them and with mpw you always have to use the first argument you get when using the method so yeah this should work now so yeah we just first of all check what moves are legal then we use np. Weare to get the indices um of from all legal moves and then we use np. random. choice to randomly sample one indic and this uh legal indic will then be the re action we got right so that's great and now we also have to um make this move not uh expandable anymore right since we already sampled this so we can say save. expandable moves of action and that should just be zero here nice so now we can create the state for our chart and at the beginning we can just copy our own State and now we would like to act inside of the state of our child so we can say child State equals safe. game.get next state so we reference this right here and then we can use State action and player so let's move up again so for State we have just child State and for Action we have the action right here and for the player we will use one so you might have noticed that we haven't even defined the player yet which is weird right because for Tik Taco you have two players obviously but the way we will program our Monti car research is that we will never change the player or never give the node the information of the player but rather we will always change the state of our child so that it is perceived by the OPP opponent's player while the opponent's player still thinks that he is player one right even though so the parent and the child both think that they are player one but instead of changing the player we flip this state around so we turn all positive numbers into negative ones and vice versa and this way um we can actually create a child state that has a different player but still thinks it's player one so this just makes the logic much more easy for our game and it also makes the code um here valid for one player games even so that's very nice so now we can say child State equal safe. game. change perspective of child State and then the player should be negative one right because in Tic Tac to uh when you change the perspective um or when you create a CH note you always want to be the opponent here so uh let's write a new method def do change perspective of a state and of a player and then T to we can just um return the state multiply with the player so like I said when we would change the perspective to player negative on so to opponent then we would uh turn all positive ones to negative ones instead of our ttic t to game um so nice so now we have come here and we have got and our child state is ready now so next we can just create the child itself which is a new note so let's say child equs note and up here we can see that first of all we need the game so it should just be save. game then save. ARs and then the state which is just child State and for the parent we can choose ourselves as a note and then for the action taken we can use this action right here we samp it above so let's just take that and now we want to append this child to our children so I'm say going to say save. children. append child right here and at the end of our method we can return the child we have just created so awesome that's the code finished here so we have gotten this expansion part done and after we have expanded we want to do simulation so from the video you might remember that we perform these roll outs and the way roll outs work is that we just perform random actions until we have reached a note that is terminal and then we will look at the value so the final outcome of this game so uh who won essentially and uh then we use this information who won um to to back propagate basically right so uh nodes uh where the player of the node itself one after randomly choosing actions are more promising than um the ones where the opponent won right so yeah we use this for our back propagation so here we can just say value equals node. simulate and then uh we obviously need to write simulate method up here so let's do that for now so def simulate of self and first of all again we have to check if this node that has just been expanded is a terminal one so up here we just say value is terminal equal self. game do get value and terminated and inside we can again use s. State and self doaction t and we have to again uh flip this value around uh to the opponent's perspective because um always uh when a Noe has one one it's not because of the node itself but because of its parent that has uh taken this action here so um we have to say value equals sa. game. get opponent value of the value we have gotten here and yeah then we could just say if is terminal then we would like to return this value up here and in all other cases we now want to perform our roll out so in order to do that we can just say roll out State equal s. state. copy and then we'll say roll out player equals one at the beginning and here we actually going to use the player and just that's just because we need to find out if the value we got at the end should be perceived uh by this get opponent Value method or not when we use it to back propagate our node uh value right here so instead of our roll we can say while true and then again we want to sample an action um like we did up here so first of all we can say valid moves equals sa. game.get valid moves of the r out State and then here we can write action equals np. random. choice of np. whereare of valid moves equals one take the first one here and yeah that should be our action and now we can use that action to get the next state so we can write roll out State equals and then we can say save. game.get next state and then again we use the rooll out State and the action here and then for the player we'll use rooll out player and after we have gotten our new state we have to check again whether this state is terminate so let's write value is terminal equal save. game dot get value and terminated and we just use this newly updated Roy out State as input and also the action we took and now if this termin is true we'd like to return our values so let's write that as an if statement and for the value we again have to check whether we were rooll up player one uh when we won the game or hit a terminal note so we say if roll out player equit negative one then we would flip the value up um so that we don't get a positive uh return uh since the opponent won and not we as a note so we can say value equal s. game. get opponent value of value and generally we can just return this value here like we did up here above so nice now uh in all other cases if we haven't reached the terminal node immediately we want to flip our player around so we can say roll out player equal save. game.get opponent of the old rooll out player yeah so that should be it um so yeah now we can do simulation after we have expanded and the final step would now be to do back propagation and we can just write a new method here so let's say node. back propagate and then we pass the value here as input we have gotten either here or here and we can write that here inside of our note class so Dev back propagate and then take value here's input so when we back propagate first of all we want to add the value to our value sum and then we want to add uh our visit count with just a one so count that up and then we want to keep that propagating up all the way to our root note so let's write save. value sum plus equates value and then let's write save. visit count plus equal one then uh when we back propagate up we again have to remember that uh our parent is a different player than us so we should write value equals save. game.get opponent value off value here and then we say if parent is not none which is always true except for the root note where we have this as our placeholder um then we just going to say parent. backprop gate value right here and that should be so yeah we have this recursive back propagate method and that should be all we need for CTS so finally we have gotten all of the results we have back propagated all of our visit counts and values and so on so now we would actually like to return the distribution of visit counts so to do that we should just um write a new variable action props so the probabilities of uh which actions look most promising and at the beginning that should just be np. Z and the shape here can just be the ACT size of our game so save. game.get doaction size so for Tic Tac Toe just this one right here and um now we want to Loop over all of our children again so let's say for child in. children and then we'll write action props at child. action taken and this probability should equal to the visit count of our child so let's say child. visit count and and now um we want to turn these into probabilities and the way we do that is we just divided by its sum so that the sum will be equal to one and let's just write mp. sum of action props right here and then we can return this nice so that's looking promising so let's test this out and um there is still invalid syntax here so I just have uh should these should remove this here and now this working so let's test this out and we can create um MCTS object here inside of our test script we used for the game so let's write MCTS equals MCTS and for the game we'll use Tik Tac to and now we still have to Define our arguments so for C which we use in our UCB formula um we can just uh roughly uh say that we want the square root of two which is what you might use generally and then for the number of searches we can set that to a th nice um so let's also pass on the arcs here inside of our multicol research and then during our game I should remove that um we should say that we only do all of this so acting ourselves if we are player one and in all other cases so in take to when player is negative one we want to do uh a multicol research so we can say MCTS props quits MCTS do search of State but you should remember that we always like to be player one here when we do our multicol research so first of all let's write neutral State equids and then we can say save. game do uh change perspective and also not save. game but Pi yeah um and then we'll use this General State here and for the player we can just set a player which is here always negative 1 so we always change the perspective then instead of MCTS search we should just use this Nutri State we have just created and that's great and now out of these probabilities we want to sample an action and to keep things easy we can just uh sample the action that looks most promising and the way we do that is by using np. ARX and this would just return the child that has been visited most number times and inside here we can use the props um of our MCTS yeah and this just the action we have here so let's test this out and see if this works okay so we still have a problem here or also um for parent we need to use sa parent and here I have a typle and so just want to say best child and here we don't want to use sa the children obviously but the child children of our root note so instead of MCTS we can say for child in root of children so perfect so now we can see that this mon col research moved here inside of the middle and now we as a player let's play a bit stupid here to check this out uh we can just play here my MCTS played here then we can say we play seven and now the MCTS play here and from the roots of Tik 2 you know that we the MCTS has hit three negative ones here instead of this row here so it has one so nice so now we can see that this MCTS working well okay so now that we have our Standalone multical research implemented right here um we can now start with building the nurin network right uh so that we can then later use the alpha zero algorithm in order to train this model that can understand how to play these given games right and before we start uh yeah with building um let me just briefly talk about the architecture that we have for our nural Network so yeah this is just a brief visualization here and first of all we have this state right here that we gave us give us input to our INR Network and yeah in our case this state would just be a board position so for example a board of tic tac tool right here and actually we encode this state so that later we have these three different planes next to each other like in this image right here and actually we have one plane for all of the fields in which player negative one has played uh so where these fields will then be turned to ones and all other fields will just be zeros then we also have a plane for all empty fields and this will then be turned to ones or other fields being zeros and then we also have this last plane for all the fields in which play a positive one played and yeah here again the these fields will be turned to ones and all other fields will be turned to zeros so by having these three planes right here it is easier for our n network to basically recognize patterns and understand how to play this game right so essentially we encode our board game position uh almost so that it looks like an image afterwards right because we have instead of these RGB planes we have these planes for player positive one uh player zero or empty fields and player negative one right right and um because we our state looks like an image we also want to use um Nur Network architecture that works best for images right uh so because of that we have these convolutional blocks then inside of our Nur Network right so first of all we have this backbone right here and this backbone will just be the main part of our Nur Network um where we do most of the computation and here for the backbone like I said we use these convolutional blocks and we also use this res net architecture so the residual Network and this is just yeah a research paper that turned out to work really uh yeah well with um yeah understanding images and here basically what this resonet architecture does is that we have these skip connection like this so when we do the feed forward through our backbone we won't just always update the X by moving forward right but rather we will also store this residual uh so basically we will store the x value or the yeah value that is passed through before we give it to our conf block and then at the end the output will just be the sum of the x value before our conf block and the x value after our conf block or blocks and because of that our model is much more flexible because it could technically also mask out conf blocks by not changing the X at all and just using this residual skip connection X right so here's also just an image from the paper so yeah generally the main idea is to also store this x right here at this position as a residual and then have the skip connection later on so that the output will be the output from the conf blocks summed with the initial X so this skip connection right here so this is just the reset architecture that we use um inside of our backbone right um so after we have gone through our backbone um we split our model architecture into these two parts so first of all we have this policy head here above and at the beginning here we just have a singular con block this conf block will take the output from our backbone as input like this and then after we have gone through that conf block we will then flatten out the results here and then we just have this linear layer or fully connected layer between the outputs for conf block and then also these neurons right here and at the end we just want to have uh yeah nine neurons in the case of tictac toe because we want to have one neuron for each potential action right and then Tic Tac to there are nine possible actions so yeah at the end we will just have these nine neurons right here and and normally we would just get these logits so just these standard outputs but when we want to get this readable distribution telling us where to play we also have to apply this soft Max uh function and this will basically turn the outputs from our nine neurons to the is distribution of probabilities and yeah then each probability will basically uh indicate us how promising a certain action is right okay so yeah this is why we have these nine neurons here and then why we also in practice later call the soft Max method right so the second head is this value head right here and here we also have this singular conf block that is different yeah from this one right here so we have uh this connection here as well and this also takes the output from our backbone as input and then again we would like to flatten the output of this Con block here out and we want to have this linear layer or fully connected layer and finally we just want to have one neuron right here because remember uh this that this value head should yeah tell us how good uh this state right here is and it should just yeah give a float in the range of negative one to positive one is output and because of that we only need one neuron right and the best way to get this um range of negative one to positive 1 is by also applying this 10 H activation function onto this last singular neuron right here so here's just a v visualization of the 10 function and basically this will just squish all of our potential values into this range of 1 to positive one and yeah this exactly what we want for our value head so ultimately we have this model right here that takes this encoded status input and on one hand it will output the policy so just yeah this distribution telling us where to play and on the other hand uh it will give us this value so just this estimation telling us how good this state is right here okay so now we can actually build this instead of our Jupiter notebook and here the first thing is that we want to import py torch because we want to use the Deep learning framework pytorch so first of all I will also print the np. version so we always get that right here and then we will import torch like this let's also print torch do verion uh right here okay so let's just that okay so this is my current torch version and I think uh for yeah the loss we will use later on we will have this multitarget cross entropy loss and this uh has been added as a feature just recently to py toch so uh I would just recommend to get this version right here but or any newer version um so if you have any problems you can just use this one and I also have uh Cuda support right here because of my Nvidia GPU but you could also just get the CPU Standalone py toch version uh depending on your setup right okay so now additionally to torch we also want to import torch.nn snn right here and then we want to import torch. nn. functional uh like this s this big F right capital F okay so now we can import these things right here and then I we just create this model and I will just yeah create this new cell above our MCTS implementation so I will just call this class rest net then we have nn. module and inherit from that so here we have our init and first of all we have a game and then we have a num of rest blocks so this will just be the length of our backbone right inside of our architecture and then we also have num hidden here so this will be the hidden size for our con blocks right okay so inside first of all we have to call Super do init and now we can Define the start block this should be equal to nn. sequential and first of all we just have this conf block at the beginning so NN doc 2D and the input your hidden size is just three because remember that we have these three planes at the beginning and then the output uh dim should be numb hidden and then for the kernel size we can just set that to three and also we can set padding to one right because uh when we have K size as three and padding S one this Con block won't actually change the shape of our game uh State directly so it always be will be number of rows times number of columns uh for these images essentially right for these pixel values if you want to look at this look at it like an image so after our con 2D we also want to have this batch Norm so nn. batch Norm 2D and then here we have just num hidden as our hidden size yeah and this bch number we just increase the sped speed for us to train um essentially so now we can also have this n NN value and yeah this radio will just turn all negative yeah values to zeros so clip them off basically and this way it's also faster to train and safer okay so this just our start block right here so first of all just this conf to D then this bch num to D and then our Rue and now we can also have this s. backbone like this this should be equal to nn. module list and then here we just want to have this array and basically want to have this array of different uh rest blocks right uh inside of our backbone uh remember from this image right here and in order to do that we also have to create this rest block class so let's write class rest block like this of NN do module and then first of all we have our init again and then we just have num hidden as input here and again we have to init our super class as well um like this okay so now for each uh res block we also want to have uh our first conf block so save do con one this should just be equal to nn. con 2D of num hidden for the input uh hidden dim then num hidden for the output hidden dim and then Kern size should be three and then ping should be one here so this our first con block then we also have our first bch Norm block so bn1 this should be equal to nn batch Norm 2D of num hidden is the hidden them then we can have our second conf block so safe.on 2 should be set to nn. 2D uh like this and then again we have num hidden here and num hidden here as well and then a c size of three and a padding of one and now we can also Define the second batch S.B and2 set that equal to nn. batch Norm Tod of num hidden as well so then let's also Define this forward method so let's write Dev forward then first of all we have save and then here the X so the input and then remember that we uh instead of our rest block want to have this residual so that we can later have this skip connection so let's create this residual and set that equal to X and then now we want to update uh this x right here separately and we can do this by first of all calling sa. conon one of X then calling sa. bn1 on top of that and then also we want to have a rue here so just F dot sorry capital f f. value like this and so this is our X at this position and then we also want to feed through our second conf block here so let's update X by calling sbn2 of S.C 2 of X right here so this way we have yeah went through uh these two c blocks right here here and then now we can just sum up um this output with the residual right so X plus equals residual right here and then um now we can just uh again call the value here so X should be um f value of x and now we can just return that X right here so this this way we have this nice residual connection here and at the end we just return the sum right okay so now we actually can create this backbone right here so in order to do that we want to create a rest block of num hidden just for I in range of num rest blocks like this so this way we also have our backbone right so this great and next we can create this policy head so save. policy head like this this should again be equal to nn. sequential and inside of the sequential first of all we want to have just this conf block so nn. con 2D and here we have a num hidden as our input hidden size then we have just 32 as our output hidden size and then we can also Define a kernel size and set that to three and have a padding of one and then again we want to call it batch Norm here and just this R here okay so after we have uh you went through this conf block remember that we want to flatten our results so let's just call NN do flatten here and now we want to have this linear layer right so nn. linear and for the input we want to have these 32 planes multiplied with the number of rows and multiplies with the number of columns so time game. row count times game do column count right because remember that we have the curent size of three and the padding of one for all of our con blocks because of that we won't change the actual shape of our state right so um because of that we can just when we want to get the input size of our flattened output here and we can just multiply the number of hidden uh planes so 32 with the row count with the column count here as well then for the output of our linear layer we just want to have game doaction size right okay um yeah so this great and now we can also Define a value head so save to Value head like this and this should be set to nn. sequen and here first of all we also have nn. con Tod and for the input hidden dim we have just n hidden and for the output di we have just three and for the C size we can set that to three and ping to one again and now we can also have a bchn here so bch number Tod and have three as the hidden size and then again call just a radio on top of it and flatten our results here okay so now we also want to have this linear layer at the end and again for the input we have just three times game. row count times game. column count and then for the output we just want to have one right because we just want to have one neuron here and then also um we want to have this 10h activation function right so we can just add NN do 10h right here so now um we have here this part here finished and next we can write this forward method for our rest net class so it's right there forward um and take the input X here and first of all we want to um send this x through our start block so X should be save. start block of X right and then we want to Loop over all of our rest blocks right here instead of our backbone so let's write four rest Block in save do um backbone like this um and then here we can just set x to the rest block of X right um so this great and now we have these X values right here and next we yeah can just get this policy and this value here back so first of all we can write policy and here we can just set that equal to sa. policy head of X then this value should be save do value head of X and now we can just return our policy and our value right here okay so let's just run that for now and now we also want to test that this is working so I would just create this new s right here and here let's just start by creating a game of Tik toac to so this should be equal to this Tik tooe class right here and let's get this state by calling Tik to toe. getet initial State like this and then let's just update that state by first of all setting state to Tick T toe. get next state of State let's just say that I don't know we play at the position two for the player negative one or POS one in this case and let's also play at um position seven for the player negative one okay um so now we can just print the state for now and yeah we get this state right here okay so next we should um remember that we also have to encode our state when we give it to our model so we also have to write this new method right here so here we just write this get encoded state of save and then just of this state right here and remember that we want to have these three planes right so we can get this by writing encoded State equals np. stack and then here we want to Stack first of all the state for all fields that are equal to negative one with the state of all empty Fields so that the fields that are equal to zero with the state where all for all of the fields they equal posi for like this so yeah this should be our encoded State and here these will just be booleans but rather we want to have floats so let's just set the type here np. float 32 and now we can just return this encoded State okay um so let's test this out by writing encoded State and yes think that equal to uh tick tck toe get encoded state of this state right here and let's just print that here okay so yeah now we see that we have this uh just state right here and this is our encoded state right so first of all we have this plane for all of the fields in which play negative one is plate so just this field right here which is encoded as a one here and then we we have these empty Fields right here and then we also have these fields where player positive one play it so just this field right here right okay um so this great and now we want to get this policy and this value right so first of all we have to um turn this state to a tensor now that's so because of that we can write tensor State and set that equal to torch. tensor of encoded State like this and then when we give a tensor to our model as input we also always need this um batch Dimension right um but here we just have one state and not a whole batch of States so because of that we have to UNS squeeze at the zero AIS so what this does is that will that it will essentially just uh create um further brackets around this encoded state right here and uh this this way we can pass it um through our model right okay so now we have this tensor state right here then we can get this policy and this value by calling our model first of all I even need to Define our model so it's right model equals rest net and for the game we have Tic Tac Toe for the number of rest blocks we can just set that to four and for the number Hidden we can set that to 64 okay so now we can write policy and value equals model of this tens Source state right here and then we want to yeah process the policy and the value so let's first of all get this float from our value by calling value do item right here since this will just be a tensor and we rather like to have the float and we get this by calling do item on this tensor here and then we can also change the policy so here we can just write policy equ equ torch. softmax of this policy right here and for the X's we have to uh choose one because we don't want to apply the soft Max on our badge axis but rather on this axis where we have these nine neurons right so this AIS one and now uh we first of all have to uh squeeze this again so that we remove the uh batch axis essentially and then we can detach that so so we don't have a grad here when we turn this to num now so then let's also take this to CPU and let's call this to num here like this okay so now we have this policy and let's just print out both of these things so print out the value and print out the policy and see whether this works okay so nice uh now first of all we get this value of 0.3 for this state right here and then this is our policy distribution telling us where to play so just this uh number of probabilities now you know what let's also make this a bit more visual so let's import met plot lip like this do p plot um right here yeah s p and then here we can just write p. bar first of of the range um from our Tic Tac Toe do X size and then these policy values right here let's just write p. show okay so yeah nice uh so this is basically how our policy looks like so for each action we have this bar right here um yeah telling us how promising this action is and then uh obviously our model was just initiated randomly currently so we can't expect too much here so nothing actually but um when we later have this train model we can expect this nice uh distribution of bars telling us where to play right okay so now we also have this mod model build and next we can start with incorporating our Nur Network into this multicol res search right so before we do so uh I'd still like to apply a small change here so um let's just set a seat for torch so that we have producible results so that you can make sure that you at least build uh something similar to what I've built right here so let's write torch. manual seat and let's just set the seat to zero like this so perfect and uh then we can update this uh multicolor Tre switch right here so the first thing is that we now want to give a model here input so this will later on just be our rest net so let's also declare safe. model as model and then there are two key changes uh we want to do so basically after we have reached the leaf note we want to predict a value and a policy um based on the state of our Leaf note and then we want to use this policy uh when we expand so that we can in incorporate the probability um inside of the policy into the UCB formula if that makes sense so that when we choose a note during our selection phase we are more likely to choose notes with higher uh policy values because these were the ones that seemed more promising to our model right so we want to choose these more often and walk down the tree inside of these directions where our model guides us so this is the one uh thing we use our model for and the next thing is um that we want to use the value we get from our model to back propagate once we have reached the leaf note so actually we can completely remove this simulation part so we don't want to do any roll outs with random actions but instead we just use the value we get from our model so let's call this right here so we can write policy and value equals safe. model and here we want to pass in the state of our node right but remember that we want to encode it beforehand so that we have this three planes based on the players um we have in Tic Tac Toe or just for the ENT Fields right so uh you might remember this from the last checkpoint so first of all we can write safe. game.get in coded state and here we just have no dot State and now we want to turn this encoded State into a tensor uh instead of just a nump array so that we can give it as input to our model directly so we can write torch. tenso of save. game.get encoded state of no. state so now we have to unque at the zero um AIS and um so let's write UNS quiz of zero right here and this is because we always need an access for our batch right and currently we aren't batching any states um so we just create this empty bch basically with just this one state right here so this is why we un squeeze here and yeah then we get a policy and value as result and now remember from the model checkpoint that we want to turn um this policy um which is just uh with which are just Logics currently so just nine uh floats we want to turn this into a distribution um of likelihoods so first of all we can write policy equals torch. softmax of policy and then we have to set AIS to one because we want to apply the soft Max not on the badge axis but rather on the AIS of our nine neurons right in the case of tictac toe so after we have for the softmax we would like to squeeze again at the zero AIS so that we remove this batch AIS now so actually want to reever this uh statement right here after we have feed forwarded our state through our model and next we can call CPU in case we might lose G use GPU later and then we can use num right here and also we should use the torch. nod decorator on top of our search method um so because we don't want to use uh this policy and value for training immediately rather we just want to use it for prediction so we don't want to store the gradients um yeah of the these tensors right here so we just use no gr as a DEC rator so this way um yeah it's faster to run our multical research so great so now we have our policy and we have turned it into a number and we have applied the softmax activation function but next we also want to mask out all of the illegal States uh all of the legal moves I'm sorry because uh when we expand we don't want to expand um in setad of directions where a player has already played for example right so we have to mask out our policy um on these illegal positions so we get the valid moves first of all by calling safe. game.get valid moves and then for the state we can just use node. state so the state of our Leaf node right here and um then we want to multiply our policy with these valid moves so now all illegal moves will have a policy equal to zero which is great but again we have to rescale our policy so that we have percentages and we do this by by just dividing with its own sum so that the sum of our policy will turn to one um and yeah this will give us this nice distribution we want so let's actually also change the value and we want to just get the float uh from our one neuron of our value head so we get this by calling value equals value do item so in py torch uh if you call do item on the tensor that has only one uh float you will just get the float as result um so this great so now we have our value and we have our policy so then we want to use this value for back propagation and we want to use this policy for expanding right so let's call not. expand with the policy as input and let's remove this simulation part directly so that we uh will use this value right here for back propagation later on in case we have reached a leavea note here and we expand right so this great and now there's some smaller things we can just remove so first of all we can just um completely delete this simulation method here and then there are some more things and first of all when we call the expand method we immediately expand in all possible directions because we just have this policy right here and um it doesn't make sense anymore to only expand and One Direction once you had reach the leaf note but rather we expand in all directions immediately so um this also means that we don't need these expandable moves right here anymore um because yeah um we will check the expandable moves by calling do valed moves um later on so also um we can change this is fully expanded method here and now we can just check whether the Len of our children is larger than zero because remember when we expand once uh we expand in all directions so if uh we have one sh we have uh we know that there aren't any more um children we could have right so yeah this should be enough here so yeah now this is working but next we have to update um this no. expand method right here so let's do that here so first of all we take this policy here's input and then we want to look over our policy right so we can write for Action uh and then prop so just the probability uh at a certain action in enumerate policy and then we can check for each action whether the probability at the given action is larger than zero so let's write if prop is larger than zero and we can remove these two statements here and if this is the case let us just um create this new child right here and let's appart it to the children but uh remember that we want to use this probability here inside of our UCB formula later on during selection so we want to store this probability inside of our node object and we do this by creating this new prior input here so let's just prior to uh set prior to zero per default and let's define PRI sa to Prior here as prior so the prior would just be the probability um that was given here when we initiated child and it is equal to the policy at the given action from the parents perspective when we um expanded our parent so um yeah now we have defined this prior right here but we also want to um yeah use it here when we create a new um note so let's just give your prop as input for the prior uh at the end like this so great so now we have uh all of this ready so um next we have to update the UCB formula since Alpha Zer uses a different um UCB formula than just a standard um multicolor research so I just brought an image up right here so actually we remove the math.log and we just wrap the square root around the visit count of our parent and then also um we just add a one to the visit count of our child and yeah so so that's basically it um so first of all we have to say that if we call this UCB formula on a child that has a visit count of zero then we can't calculate a q value right so let's just set the Q value to zero and that case as well so let's say if child. visit count equal zero then Q value should equal zero and then else um sorry Q value should be equal to this formula right here so because now uh we don't back propagate immediately on a note that has just been created during expanding so it is actually possible now to call this UCB method on um yeah on a child that hasn't been visited before so yeah this way we change uh this part right here and then also we want to update this part here below so let's write uh it like on this image here so let's first of all remove this math.log from here and then um here we want to add a one the visit count of our child and then we also want to multiply everything up with the uh prior of our child right so that we uh also use the policy when we um yeah when we select um downwards the tree yeah so this should be working now so let's see see if we got everything ready so yeah I think that's it so let's just run this right here so first of all we just update C to be equal to two and then we can create a model right here so let's write model equals rest and then for the number for the game we can just s t to for the number of rest blocks let's just say four and for the number of our hidden planes we can just set that to 64 and then you know what also eval our model right here so currently this model uh has just been initiated randomly right so we can't expect too much but um let's try this out so oh sorry uh we're still missing the model here's argument inside ofs so yeah this so nice so now we see that with our updated MCTS um we played right here and you know what let's say um we would like to play on one so yeah the MCS played here uh and let's play uh I don't know just on this a right here and the MCTS played here so negative one and has won the game so now we have our updated multical research ready and next we can actually start to build this main Alpha Zer algorithm so in order to do that uh let us just create a alpha Z class and I will Define that just below our updated MCTS sorry let's remove that here and Define it down below here so let's write class alpha0 and for the init we would like to take a model then also an Optimizer so this will just be a py toch Optimizer like Adam we will use for TR and then also a game and some additional arguments here as input and next we can Define all of these things right here so let's write s. model equals model s. Optimizer equs Optimizer and then WR save. game equs game and save. ARS should equate ARS as well and uh at the end we can now also Define a multical research object inside of our Alpha zero class so let's write safe. MCTS should be equal to MCTS and when we initiated when we initiate our multicol research we want to pass in the game the ARs and the model so let's just write game Arc model just like this so yeah this just the standard init of our F Zer class and then next we want to define the method inside and from the overview you might remember remember that we have these two components which is the safe play part and then the training part and outside we just have this Alpha zero repper right here so let's create our safe play Method this and let's just pass that right now and let's also create our Training Method like this and let's pass that here as well so next we can have this learn method and this will will be uh the main method we will call when we actually want to start this cycle of continuous learning where we run self play get data use this data for training and then optimize our model and then call self play again with a uh model that is much smarter right um so for this cycle we want to first of all Loop over all of the iterations we have so let's say for iteration in safe. ARS of num iterations and then for each iteration we want to create this memory class so just the training data essentially uh for one cycle so let's write uh a memory list here let uh so let's define memory right here and then we want to Loop over all of our selfplay games so let's write for self play iteration in s. ARS of s or in range sorry uh we need to Define range here um so in range of save. ARS of num safe play iterations like this and then for each iteration we would like to extend uh the data we just got um out of our sa play Method and we want to extend this to this memory list right here so let's write memory plus equals save. save play um just like this so that we extend the new memory and actually we'd like to change the mode of our model so that we so that the model is now in EV mode so that we don't have uh these batch Noms for example during safe play and now we want to move on to the training parts so first of we can write save. model. train like this and then we can save for EPO in range of safe. ARS num EPO and yeah for each EPO we like to call this train method right here and we like to give the memory is input so let's write save. train of memory like this so let's also add memory here as an argument so that's great and at the end of an iteration we like to store uh the weights of our model right so let's write torch. save of save. model. stact like this and for the path uh we can just write model and then uh the iteration with instead of an F string and write PT for the file ending so let's also save the STA of our Optimizer so let's write torch. save of save. Optimizer do state dict then for the path again we can write Optimizer and then iteration and take PT s f ending here and we have to turn o string into an F string yeah so this great and um now that we have gotten our main Loop ready um we would like to focus on this safe play Method right here so inside uh we again have to define a new memory so this will just be uh yeah the memory inside of one own uh selfplay game and then we have to define a play at the beginning so let's say player should equal one at the start and then we also have to create an initial State and we get the state by calling save. game. getet initial State like this and then we want to have this true Loop and inside of this Loop we want to first of all call the Monti car tree search based on the state we have and then we want to get yeah these action props as result and next we like to sample an action from this distribution of action probabilities and yeah so after we have SED an action we want to use this action to play essentially so we will get a new state um yeah by calling safe. game.get next State uh based on the Old State and then once we have this new state we want to check if this state is a terminal one so if the game has ended and if indeed the game is ended already uh then we want to First of uh of all um return all of this data to the memory here and yeah the data should be structured in this tup form where for each instance we have a given State we have the action probabilities of our MCTS and then we have the final outcome right so whether the player who played at the instance also was the player that won the game at the end or the player that lost the game or the player that hit and draw right so this should be incorporated in the final outcome so so we can write while true like this and here first of all we have to remember that when we call MCTS do search we always like to be player one so we first of all need to get this neutral State and uh we can get this by calling safe. game. change perspective so let's write it right here and for the state we just like to take the state right here and for the player we just want to take the player above and yeah so that's great so now that we have our neutral State we can get our action props back from this MCTS search right here and we call save. MCTS do search and we take this neutral stck State as input so now that we have our neutral State and our action props we can store this information instead of our memory so that we can later use this information um to GA gather this training data so first of all we can write s. memory. append then we append this trle right here of the neutral State and the action props and then also the player so yeah now we want to sample an action out of the action props right and we can do this by calling np. random. choice and for the number of different options we can use the action size of our game so in this case of T factor that is equal to 9 so let's say save. game doaction size and then for the probabilities uh we use for sampling we just want to use these action props right here we got from our MCTS so like this so now that we have sampled an action we like to play based on this action right so we can say stage should equal save do game. getet next state and we take give the Old State the action and the player here's input and now that we have gotten our up updated State we want to check if this state is terminal or not so we write value is terminal equal save. game.get value and terminated and we give the updated State and the action we use to get this updated State here's input and then one can say if is terminal is true so if is terminal uh then we want to return uh this data right so first of all we can write return memory should be equal to this new list right here and then we want to Loop uh over all of these instances inside of our memory or sorry let's not write s. memory but rather memory just like this because we want to use this memory right here so we can write four hist neutral State hist action props then his player in memory and for each instance we now want to get the uh final outcome right at the given instance uh because we need to store that inside of our training data so we can say his outcome and this outcome should be equal to this value right here if his player is also equal to the player um we were when we achieved this value right so if we were player positive one um when we uh played here and uh have indeed won the game uh then we will receive a value as positive one as well and we want to set the H outcome for all instances in which player positive one play to positive one and we want to set the his outcome to all instances in which player negative 1 play to negative one right so this outcome should be equal to Value if um his player is equal to player and then else we just want to receive the negative value right um and now we can um append this information to our return memory so let's write return memory. pend and and yeah again we need to append a tup so have brackets inside of these brackets right here and we want to attend the neutral state but remember that we like to encode this state when we give it as input to the model so we can already encode it right here so let's write save. game.get encoded State off is neutral State for the state and then just the action prop from our MCTS distribution and then just the his outcome we got here yeah so just like this and sorry we want to append this to our return mem but not return immediately but after we have looked through all of this uh we now can actually return this return memory just like and all in all other cases if we haven't reached the terminal note yet then we want to flip this player around so that we are player negative one now uh if we were player positive one previously so let's write player equal sa. game.get opponent of player like this so there's still some minor things I would like to update right here so first of all let's make this code more General by changing negative value right here to the value is perceived by the opponent depending on the game we have uh inside of our Alpha zero implementation so that we could also um yeah run this uh in games where we have only one player right so let's write sa. game. get opponent value of value uh like this so that's just more General if we um call this method right here so get opponent value and um the next thing would be that we want to visualize these Loops right here and the way we can do this is by having these progress bars and we can get them by uh using the tqdm package so let's write import TQ dm. notebook and you know what from tqdm notebook we want to import a t range like this so then we can just replace these range CS uh below with a t range so just a small difference right here and now we want to check that this I zero implementation uh is working as we have build it so far so um we want to create an instance of alpha zero and in order to do that we need a model an Optimizer a game and some arguments so we get the model first of all by building reset but initially let's just create an instance of tic tac toe let's write Tic Tac Toe equals Tic Tac Toe like this and then we can write model equals rest net and for the game we want to use this tic tac toe instance we have just created and for the number of rest blocks we can just say four and for the hidden dim we can just say 64 and now we want to Define an Optimizer right so let's use the atom Optimizer as uh and build in py torch so let's write Optimizer equals torch. optim do addom and then for the parameters we want to use the parameters of our models just model. parameters like this and for the learning rate we can just use 0.01 just a standard learning rate for now and um next we also need to Define these arguments right here and we can do this by creating just another dictionary and for the exploration conent we can just choose uh two again for the number of searches we can choose 60 and then for the number of iterations on the highest level we can just choose three for now and then for the number of safe play iterations um so the number of safe play games we play for each iteration um play iterations uh we can set that to 500 for now and the number of epo can just be equal to four currently you know what let's also lower this amount to 10 for now so that we can see that we can save our models and let's just run that right here oh sorry and now we want to create this instance of alpha Z and this should be equal to this Alpha Zer class and the input should now be this model we have created the optimizer we have created and the game and ARS we have created so let's write model Optimizer and for the game we can write Tic Tac to and for the arguments we can use the here then we can run Alpha 0. learn like this let see if this works so nice so now we get this a good looking bar we can see that we are running these safe play games great so now we can actually implement the Training Method inside of our Alpha Zer implementation so let's just change this method right here and the first thing we always want to do when we call this method is that we want to shuffle our training data so that we don't get the same badges all the time so we can do this by calling random. Shuffle here of the memory and if we want to use use this we have to import random here off just like this and and nice so now we can actually continue um with this implementation right here so next we want to Loop over all of um our memory and we want to Loop over it in batches right so that we basically have a batch index and uh for each batch index we can then sample a whole Badge of different samples and then use these for training so the way we do this is by writing for batch index then range and here we want to start at zero we want to end at the L of our memory and then we want to step in the size of save. ar. BGE size and now we want to take a sample from our memory right and we can get this sample by uh calling the memory at the badge index at the beginning and we want to call it until batch index plus sa. AR do batch size but just remember that we don't want to um call it uh on any batch index that might be higher than our Len of the memory so we have to be careful here and we can check um that we don't exceed the L of our memory by calling Min of Len of memory minus one and then um the batch index plus. a and of batch size right here so this way we won't ever exceed uh this limit uh currently yeah so now we want to get the states the MCTS props and the final rewards um from our sample and we can do this by writing State and policy targets for the mcds distributions and value targets for the final reward uh we get from safe play and uh we get this from our sample by calling the zip method so the way um this works is that we basically transpose our temp So currently we have just this list of TS and each T contains a state a policy targets and the value but by applying this asterisk here and then zipping it we actually get lists of State lists of policy targets and list of value targets so we transpose this um sample around basically so now the state is a list full of NP arrays the policy targets is a list full of MP arrays and value targets is a list full of MP arrays but it's much more helpful if we actually have all of these things as NP arrays immediately so let's change them by saying value and policy targets and value targets and these should be equal to np. array of state and np. array of policy targets and np. array of value targets like this but remember for the value targets that currently we only have this n r inside we immediately have all of our float uh values but it's much more helpful if each of these value actually is in its own subarray later on if we compare to the output of our model so we can change this by calling reshape here then we say negative one for the batch access and then we just save one so that yeah each value will be in its own subarray essentially so this is great and now we can turn all of these three things into tensor so first of all we can write State equ equ torch. tensor of state and for the D type we can just say torch. float 32 even though um yeah right this encoded State might the um just um integers right now not in float so that's nice to change the DP here and for the policy targets we want to um also turn this into a tensor and change the D type to FL 32 um let's torch. like this of policy targets here yeah and then the D type should be equal to tor float 32 like this and then we have our value targets and here we also want to turn this into a tensor and have value targets as the old input and the D type again should be equal to torch. float 32 so great um so now we have all of our ready and the next step is to get the out policy and the out value uh from our model by letting it predict the state um we have here so let's write out policy out value equals safe. model of the state right here and then now we want to have a loss for the policy and a loss for the value and we can first of all Define the policy loss and you might remember that this is a multitarget cross entropy loss so here we can write F do cross entropy like this and then first of all we have our out policy and then we have our policy targets and yeah so next we have our value loss and here we use the mean squ error so we can run value loss equals f m loss like this and then first of all we have our out value and then we have our value targets and now we want to take the sum of both of these losses to finally get just one loss so we can write loss equals policy loss plus value loss right here so now we want to minimize this loss by back propagating so first of all we can write Optimizer Z gr then we want to call loss. backward and then we want to call Optimizer do step so this way py to does all the back propagation force and actually uh optimizes uh this model right here so this way we have now have a better model and then we can use this uh during safe play so let's test this out so here I will just uh call this again you know what let's actually set n s play iterations to 500 now and I will now test this for the uh number of iterations we have here and then afterwards uh we can finally see how this model has learned and and yeah feel free to train as well but you don't have to and we will make it more efficient later on so we can also just test it here on my machine okay sorry so I forgot to define a badge size right here so let us just pick 64 as our default badge size and then we can run this sell right here and we get these nice progress bars during training and I have just trained uh yeah this s right here and it took roughly 50 minutes just using the CPU of my machine so now we actually have trained for a total number of three iterations and remember that we save our model at each iteration uh because of this expression right here so now let us actually check what the neural network we have trained under stance of the game and we can do this by moving up to this cell right here so this just where we tested our randomly initiated model so here when we Define our model let us actually load V State dict uh so let's write model. load State dict inside we can write torch. load and for the path we can write model and then two since we want to check the last iteration and then PT for the file ending and you know what let's also put our model in mode so let's just run this right here and then for this uh state right here we get this distribution uh of where to play so our model tells us that either we should play in the middle right here as player positive one or we should play on this field right here in the corner is player positive one and you know what uh let's also test it for this uh state right here so um two and four is player negative form and then as player positive one we have played on the fields of eight of six so first of all now you can see that we have copied this sport state from the image right here and yeah then we also encode it and then we get this distribution of where to play so now the Nur Network correctly tells us that we should play on position seven because we would win here and yeah for the value we get a yeah result that is close to positive one because uh we would just win if we play here so uh we have a high estimation of this position right here is a player positive one so yeah and this is even better than the distribution we have right here so uh remember this is just what this Nur Network as a mathematical function got put out and we never did a Monte carot research or anything when we got this result here so we can perfectly see that this mathematical function so this Nur Network actually internalized the montic col research by playing with itself and now we can just call this Nur network with a given State and we get this neat distribution of where to play and also uh this nice float here telling us whether the given state is a good one or not for us as a player great so now we can apply some tweaks to our zero algorithm and the first twak we want to yeah add right here is that we want to also be able to use GPU as a device since that might be way faster for training depending on your setup so we can add gpus support by declaring a device right here and for the device we can say that it should equal to torch. device and the device should be Cuda so yeah Nvidia GPU if torch. Cuda do is available and then else uh we can just set it to CPU like it was initially and yeah we want to then use this device and give it to our model as an argument so this also means that we have to add it right here to our rest net class so let's set device right here and then we will first of all add it as a variable here so s. device should equal device and then we also want to set our model to the device uh it was given so let's write s.2 device right here so that we turn our mod to GPU for example if we have Cuda support on our machine yeah so that's the first part and now we also want to use this device first of all during self play and then second of all during training right so let's move to the safe play part right here and we use our model during self play when we get these action props right here uh during our MCTS search because yeah we call our model right here so yeah when we get this policy and this value right here and yeah we also want to turn this tensor into a uh tensor that is on our device we have declared so we can do this by writing device should equal self. model. device like this and now we also want to use GPU for uh or we want to add GPU support during training so let's move to the training part right here and we can uh first of all yeah move the state to our device since we use it here as input to our model let's write device equals s. model. device and then we also use these policy targets and value targets and we compare them to the outputs of our model so we also have to yeah move them to to the device of our model so let's write self. device let's write device self. model. device again right here and also right here down below okay so yeah now we should have uh GPU support as well so that was fast and now there are several more tweaks so the uh yeah the next tweak we can apply right here is that we also want to add weight decay since um we have this L2 regularization in our loss um for Alpha zero um like you can see on this image right here so we want to write weight Decay and we can just set that to 0.001 like this so this should be fine and yeah so now we have weight Decay and GPU support added to Alpha Zer and next uh we also want to add a so-called temperature so remember when we sample an action from these action props right right here currently we just use the visit count distribution of the children of our root node as the probabilities uh and then we take these Pro probabilities for sampling right but actually we want to be more flexible because sometimes we would rather like to explore more um so also uh you know sometimes take uh visit count uh take children or actions more often that haven't been visited that often and um sometimes we also want to exploit more so so we want to tend to always just pick the actions that look the most promising to us so the way we have this flexibility is that we uh yeah get some temperature action props so we basic basically tweak this uh distribution right here and then we will use the tweak distribution as the probabilities when we sample in action so let's write temperature action props like this and we get these by um first of all using the old action props then we want to power them by 9 ided by s. ar. temperature um so like this and now we also have to Define temperature in our AR right here so let's write temperature and set that to 1.25 in this case so now we have declare this temperature right here and actually what this does uh in the case for a temperature that is larger than one uh you can see that uh yeah the number we use uh for the power actually gets much smaller and what this then does as a result is that we squish all of our uh probabilities uh much closer together so that we actually do more of exploration because uh the higher our temperature actually gets to Infinity the more it would be as if we would just take completely random uh actions sampled from just this random distribution and on the other hand and if our temperature would um move down we can see that this number would actually move up so if our temperature would turn to zero for example or close to zero and not zero directly then it would be as if we would um take just the arc Max of our um distribution right here so we want to do exploitation and not exploration right so uh this temperature right here basically gives us the flexibility to to decide whether we want to rather take more random actions that do exploration or whether we just want to yeah exploit and take the actions that look most promising at the current time so yeah this is great and now we also have uh yeah our final tweak we want to add to Alpha zero and that is that we also want to apply Some Noise to the policy that was given for our root node at the beginning of a monteal research so when we um when we give our root not policy we basically want to add some noise so that we also explore more and move in some directions that maybe uh we haven't checked before so that we also make sure that we don't miss on yeah any possible promising actions so the way we add uh this noise to our root note is that we first of all have to um yeah add these statements uh to the top right here as well so that we um also add the policy um at the beginning and basically separate that from the steps right here so first of all we should get a policy and a value by calling safe. model and then here inside we have torch. tensor then we have safe. game.get encoded State sorry and then for the state we can just take the state um of our root but this is equal to this state right here so let's just take this one and for the device again we want to use the device of our model so that we could use GPU um so s. model. device right here and then again we have to UNS squeeze at the end so that we have a access for our batch so this great um so now we get this policy and this value and actually let's just turn this value into a blank variable because we don't uh need it at the top so this should be more readable now and now we want to again tweak our policies so we have a typ right here so let's first of all use soft Max right here so policy should equal to torch. softmax of policy for the AIS we want to use the first axis um so that we don't use the soft Max on the badge AIS then again we have to squeeze at the end uh turn our tensor to a CPU and then just get the number array out of it so now again we want to uh get some valid moves and then multiply our policy with them so that we mask out the illegal moves so valid moves should equal to s. game. getet valid moves of state right here and then policy should be multiplied with these and then again we can now add some noise right here so after that we would just again turn our policy to this distribution of probability so we want to divide it with its own sum so now here we actually want to add this noise to our policy so that we yeah do some more exploration during a multicol research so the way we can at this noise is shown here on this image so first of all this should be the updated um policy for our root node and then we get this updated policy by first of all multiplying our old policy with this coefficient right here and this coefficient is just a number smaller than one so um basically we lower all of our policy values uh instead of the distribution and then we add um the uh basically opposite coefficient uh so also number smaller than one um multiply it with this noise right here and yeah this should just be our updated policy so this Epsilon value here um is just float smaller than one this n um value right here is a noise called d random noise and we get this by just using np. random. dlet right here and yeah basically for us this will just give this uh distribution of random values and we basically will add them to our policy multiplied with this coefficient that is smaller than one so this way we basically change our policy a bit to also incorporate some randomness so that we then can explore more rather than just always walking down the tree the way that our model tells us because at the beginning our model doesn't know that much about the game so we also want to make sure that we do some exploration as well so let's add this right here so policy should first of all be 1 minus s. ARS of d l Epsilon like this and then multiply it with our old policy and then we want to add this second part right here so this should be save. ARS of D Epsilon again and then we want to multiply it with np. random. dlet and yeah inside here we just um want to use this Alpha value um so give that as input to our MP random and this Alpha basically changes uh the way our random distribution looks like and is this based on the number of different actions we have for game so in the case of tictac to our Alpha is quite high so roughly uh equal to 0.3 if you just want to use it uh because that turns out to be working well but yeah it might also change depending on your environment here then we want to use s. ARS Alpha now let's actually move that to the next line so it's more readable and then we want to multiply this array um with the action size of our game so and actually we also have to move this part here below so that we mask out our moves after applying this random noise and note before so uh let's just copy them over uh okay great so now we also added this um yeah noise to our policy and now we can also use this policy we got right here to expand our root note so let's write root. expand and then um yeah just use this policy Right Here and Now there's just a minor change uh we can still add and that is that currently uh when we expand our root note at the beginning um we won't back propagate immediately so that means that our root node will have children but the visit count of our root node will still be equal to zero so that we means that when we select right here we will call this UCB formula with the visit count of our parent being zero and if this visit count of our parent is z zero then um basically this means that all of this will basically move away because we just turn to zero as well so um this also means that we don't use our prior when we select a children for the a child for the first time so in order to make this better we should just set the visit count of our root node to one at the beginning so you know what let's write visit count let's set that one right here and this also means that we have to add visit count right here by default it should just be zero but yeah for the root note it should be one at the beginning so that we um immediately uh use the information of our prior when when we select a child at the beginning uh of our multicol research okay great so now we can uh run all of these setes again and then see if this working so we also have to add DL Epsilon here and yeah this should just be equal to 0.25 and then dlet Alpha and I just set that to 0.3 like I told you okay so now let's test this out okay so we still have a CPU device somewhere and let's see where that could be oh um I haven't run this right here this the arrow sorry so here we have to also add a device and for this set right here you know what let's just say torch. device of CPU and then we can run this again okay so perfect now this running and yeah for this current setup with tic tac to it shouldn't be that much faster but later on when we use more complex games um GPU support will be much nicer and also this these other tweaks should help to make our model more flexible and actually yeah make it easier to always get to a perfect model at the end so nice so I would just train this again uh feel free to also train it but you don't have to I can I've also um uploaded all of the weights and you can find them on the link below in the video description so let's just train that and then we can briefly check it again and then we will also expand our agent so that we can also play the games of uh Connect 4 as well to tic TCT to so see you soon okay great so now I've trained everything right here and now we also want to check that our model has learned how to play the game of tic TCT to so in order to do that right here we have to move up to this cell again and then you what you know what let's also Define a device right here so let's write device equals torch. device of Cuda if torch. Cuda do is available right here and then else we can just use CPU for our device and then first of all we want to use this device right here when we Define our model and then we also want to use this device for the tensor State uh that is the state we use um for the model to predict the policy and the value so let's write device equals device right here as well and then we also have to use the device when we um yeah load the state dict uh from this file right here um and that is when we have a map location so it's right map location equs device as well so now let's just run this again and we see that we get this nice distribution and this nice value again so we can also see that our model has learned and by the way um here we aren't even masking out our um policy so the model has learned itself that you no these moves aren't that uh great right here because so you can't even play here because someone else has played uh without us even yeah masking these illegal moves out so that's nice and now we can move on to also train our model or Alpha zero on the game of Connect 4 great so let's also define a class for the game of Connect 4 but before we do that let's also add this representation method to our Tik toac to game so let's write Dev rapper like this save and then here we can just return the string of Tik toac toe and this string is yeah also uh the result we would get when we uh call this object here uh instead of a string representation so so this means that we can use this information down below when we save our model um for each iteration during uh yeah learning so we can just add save. game right here so basically we will also add the name of our game to this model and this Optimizer weight or Statics when we save them so that we won't override um our mods when we train for the game of Connect 4 um yeah so this great and yeah now that we have done this part let us actually just copy this uh game over right here so that's great and now let's use it to define the game of Connect 4 and for the game of connect for first of all we have a row count that is equal to six we have a column count that is equal to S and then the action size is just equal to the column code and then also let's add a a variable for the number of stones you need in a row to win the game so let's write save do in a row and let's set that equal to four and for the representation we can change the string to connect four here and yeah the initial State can stay the same because we just want to get this yeah blank uh array with row count being the number of rows and column count being the number of columns so that's great and now we have to change this get next State method right here because uh this action will just be a column uh so where telling us where to play and the first step now is that we have to check get a row back um and we get this row by looking at the given column then we want to look at all of the fields inside of this column that are equal to zero so the uh that are empty currently then we want to basically take the deepest empty field inside of this column because yeah this is where we would playing connect for and yeah this uh would just then be the row uh we use for playing so we get this by calling np. Max since yeah in a number array um you start at zero if the top and then move your values get higher uh if you move down basically on the rows so we take np. Max of np. whereare of the state at the given column yeah and the state here should uh equal to zero right so yeah np. whereare will just again give us the indices of all of the Mt fields and then mt. ma np. Max will have give us the highest indc because yeah this is where we would play in Connect 4 because yeah the stone would just fall down until um yeah there this is uh already non empty field so now we have a row and technically we also have a column because of this action right here so we can now write state of row of column and say that equal to this player right here we have given here input and yeah so now this great and next we want to move over to this method right here so we have this uh these get valid moves right here and for the get valid moves we just want to check the um yeah row just at the top and then we want to see whether the fields at this row are equal to zero or not and these are our valid moves so we can just write state of zero so at the Bic upmost row yeah and then this should be equal to one um if it is valid and yeah otherwise we just uh have an illegal move on this given action okay and now we want to check for the win and frankly I'm just going to copy this method over U right here so basically where we do the same checking as in TI to where first of all U we check for the vertical um yeah possibility of a win and then for horizontal one and for these two diagonals based on this action um right here that was given as input it's just that compared to Tik to it's a bit more complex where we actually have to uh walk in both directions and then keep checking whether um we have stones in these directions that are equal to our own Player sign also have this get value terminated method and this can just stay the same for both games get opponent can stay the same get opponent value can stay the same change perspective also stays the same and get encoded State um yes also the same right here oh and I just noticed a small mistake and that is when we use get next right here we can't reference column because yeah we have defined our column's action here from the top so let's just replace column with action like this and now we can check um that this is actually working in practice as well so we can do this by making our code more General and here we just replace TI TCT Toe with game then we use game inside here and we use game inside here as well use game like this use game here and also game right here so just remove this instance of T toe and yeah this one as well and yeah so this should be fine and yeah now the should be no more instances of TK TCT to and next let's replace this game of TCT Toe with the game of Connect 4 let's set the number of searches to just 20 so we can validate let's say 100 here and then let's change the size of our model to be equal to n for the number of rest blocks and 128 for the hidden dim so that we have a larger model when we at the train connect for and now we can check whether this is working or not so first of all we also miss the argument of device so let's add that uh right here as well so let's write torch dot device of Cuda if torch. Cuda do is available like this and then it's we can just use uh CPU and yeah now we want to use this device instead of our mod class so let's use it right here and yeah so that should be fine so let's run this again okay perfect so now we have this game of Connect 4 let's say that we want to play at the position of four or five sorry and here we still have a problem so de Epsilon is also not set right here so let's copy these values over um like this but let's head D Epsilon to zero so that we don't use noise here um okay so I've played here the mod has played here you know what let's play at one the model has played zero again let's play at two see where the mod played played at zero again and then we can just play at three like this and we have one so at least we checked that uh the game of Connect 4 is working now but yeah obviously home model is terrible because we don't do that many searches right here and also we haven't trained anything yet so obviously our model is garbage but let's change that now so let's also train a model for the game game of Connect 4 and then again test our results here and we can do this by just applying some minor changes right here so first of all we should also make this cell right here more General so we write game and then for the game we just will use Connect 4 here and then we also have to set the game to just game here instead of the reset and you know what because Connect 4 is harder to train let's change the number of rest block to 9 and let's change the num of our hidden Dimensions to 128 and then we also have to use game right here during our Alpha zero initiation and uh then we can also slightly tweak these um hyper parameters here so first of all for the number of searches we want to use 600 then for the number of iterations we can use something like eight maybe for now and then for for the badge size we can also increase that to 128 so yeah and next we can just run that and this should take a lot longer um but we will also increase the efficiency of it later on so feel free to just use uh the weights that I've uploaded in the description down below uh so you don't actually have to train this all yourself since it might at least on my machine currently take a couple hours so see you soon okay now I've trained Alpha zero on the game of Connect 4 using these arguments right here and frankly I've just trained it for one iteration since even that one iteration took several hours to train right so now we want to increase the efficiency so yeah that will be way faster for us to train also for these rather complex environments but yeah so still uh it was nice to train for this one interation because now we can evaluate our model and find out whether we at least learned something so let's move to this evaluation set right here and now when we create this model we also want to load the state dict from the weights that we saved yeah after we have trained for this one iteration here right so here when we create our model in this line right here we can write model. load state dig like this then inside we can write torch. load and then we want to reference this path right here and so we can write model and for the iteration we will just use zero since we started zero and we have only TR for one iteration and for the game we will use Connect 4 and for the file ending PT again and let's also Define a map location and set that equal to device so this way um if your device would change during training compared to evaluation you could still load the state dick like this right here so yeah this might be nice so let's run the sale again and you know what let's just play it six here so yeah now our model played in the middle like this so let's just play it six again and yeah our mod defend this position so let's just I don't know play it Z right here and the model build up this Tower right here let's play at zero again model building up the tower let's just play at zero again and our model finished this Tower right here and thus it has won the game so we can still see that our model isn't playing perfectly and obviously I just played in the corners to yeah check for some simple intuition about the game on the model side yeah but we can still see that it is much better compared to the initial model we had where it only wanted to play inside of this uh column right here right so um this great um and now we also want to increase the efficiency of our Alpha Zer implementation right so actually now the main part is completely done and you should have this intuitive understanding of how the algorithm works but like I said it would be nicer to increase yeah the speed uh we got when training and especially the speed during safe play with it so how can we do that and here we want to basically parallel I as much of our Alpha zero implementation as possible because these Nur networks are built in a way where we could batch up our states to then get yeah parallel predictions for the policy and the value uh especially when we run our sfe play Loop so one maybe way in which you could parallelize this Alpha zero implementation here is by using a package like Ray that would basically yeah this threats that are independent from each other and then basically use that to yeah harness all of your GPU power but I'd rather like to build this parallelized version in a more pythonic way like this uh so just inside of uh our Jupiter notebook again so let's actually do that um so we will actually basically um batch up all of these different states then get policy and value predictions and then distribute them again to our safe play games so that we play several safe play games in parallel and this way we can yeah essentially drastically increase the speed of our Alpha zero implementation so the first thing we want to do here is that we also want to add this argument um of nonparallel games here to our dictionary and for now I just set that to 100 so the idea is that we want to play all of these your s play games in parallel and then when we reference our model we will batch up all of our states instead of our MCTS search and this way we can drastically reduce the amount of times that we call our model in the first place and yeah thus on one hand we fully utilize our GPU capacities and on the other and um yeah we can just decrease the amount of times we call our model and thus also um yeah have a higher speed right so the way we can yeah change this or update our Alpha zero implementation is that first of all we want to just copy this class here over so I will just create this new class here and I will call this Alpha zero parallel and then we also want to update the multicolor Tre search so I will also copy this class here over as well and I will call this MCTS parallel like this so let's write it like this so yeah now we have our parallel F Zer and mcds classes and yeah here we want to change the implementation of the safe play Method and because of that also the implementation of this search method right here so now that we have got this this class um we also want to create a new class and here we basically want to store the some information of a given self play game so that we yeah can Outsource some information from our Loop when we Loop over all of these given games so let's create class and I will just call this SPG for S play game and then we can also have an init here and for the init we want to p in a game so take the to for example or connect for and then first of all we want to have a state here so initially this should just be the initial state of our game right so we can set safe. state to game.get initial State like this and then we also want to have a memory inside here and this should just be an empty list at the beginning and yeah then we also want to store a root and this should be none at the beginning and we want to store a note here and this should also be set to none at the beginning and yeah so this is everything we need here for this class right here and yeah so this way we can later store some information here okay so now we want to update this safe play Method right here um so first of all we want to call it less often since every time we call this method um we will call it for 100 um yeah potential G games that are played in parallel right so here um when we Loop over our num Play iterations We want divide that with the games we play parallel so safe. ARS of sorry num parallel games like this and also when we um yeah create this safe play game class right here I've just forgot to add those brackets right here so we should just add them yeah okay so yeah inside here first of all um we can remove this state right here since we store our state in the sa play games right and then also we can rename this memory to the return memory since this right here won't be the actual memory um in which we store our yes State and action props and player uh topits initially but rather this will be the state that we will return at the end so uh the list we currently have here right which we won't need later on okay so then the player can stay just the same because even though we play games in parallel we will always just flip the player after we have looped over all of our games right so this way all of our games in parallel will also always be at the same player right because we always start at player one and then we can just flip for all games parall as well okay so now we also want to store our self play games here so I would just call this spgs like this and maybe let's just or sorry maybe like safe play games like this and here we want to have a list of all of our safe play games so one safe CL game should be created using this SPG class we have defined below here and inside we want to pass the safe. game we have here is our Alpha zero class and we want to create a safe play game for SPG in range of S.S and then num parallel games right so this way we create this list for all of our C play games and yeah we will basically give them the game uh in each case okay so now we have these safe play games right here and next we want to change this while true um yeah expression here because this way uh would stop after one game has finished but rather we want to keep uh yeah running our Sur play Method until um we all of our self play games are finished so we can uh check this by saying while L of SP games is larger than zero so uh we implemented this way that once the safe play game is finished we will just remove it from our SP games list because of that we can just check here whether we have uh any any s play games inside of our list right so uh great um so the next step right here would be to get all of the states of our self play games first of all inside of our W so let's get all of the states here and we can get all of the states first of all as a list by saying SBG do state for SBG in Sp games like this so yeah this way we just have a list of all of our states that are instored inide our class and now we can turn these states into just a number array so the best way to do this is just by calling np. St here of this list of number arrays and this way we will basically get one large number array right okay so here we have all of our states and next we also want to get the neutral States so first of all we can just yeah change the naming right here and then here when we change the perspective we just have to write States here instead of state and now we will change the perspective for all of our states basically in just yeah one function call right so this will we can also increase the efficiency since um yeah changing the perspective works the same way if we have several States because we would just multiply our values with negative 1 in case that we have player uh set to NE one and yeah here we we can just do the same right so this great and now we can also pass these neutral states to our monal research and we have basically finished this part right here so let's now change the multicol Tre search search method um yeah so first of all we want to now give States here as input so these are the neutal states we have just created and now um we can also first of all change the order so let's just um paste this one here um below and then we can work on the policy and the value first um so let's move it like this okay so yeah first of all we get this policy and this value and like I said we can also do this with all of our badged states so this where we just get several policies and values back here as a return and this is great but we still have to keep two things in common so first of all we don't need to call UNS squeeze here because now we have a badge so we don't need to create this fake badge access right and then we also um have to update this get encoded State method right here so because the way it is currently implemented um is only working if we pass it one state at a time because this get encoded State method will yeah create these three planes for the fields in which play negative one played or these uh the fields that are empty or the fields in which player positive one is played right so it will always first of all give us these three planes and then inside of these planes it will give us yeah the fields right but we want to change the order here because the first AIS we basically want to have um yeah all of our different states and then for each state we would like to have these three planes right so we don't want to have these three planes at the beginning but after our batch AIS basically right if that makes sense so uh we have to basically swap the axis of this encoded state right here so that it would also work if we have several States and not just one uh we use when we call this method here so first of all we can check whether we have several States or not and we can check this by checking whether the Len of state do shape is your equal to three right because normally the L of our of just one state would be two right where we have the rows and then the columns and if we also have a batch axis then the Len would be three because first of all we have a batch then we have the rows and then we have the columns and yeah so if this is the case then we want to create this new encoded State like this um and should basically be uh equal to np. swap AIS like this and then here we want to yeah pass in the old encoded State then we want to swap the A's zero and one right so that we don't so that the basically the new shape should be the size of our batch then three and then the number of uh rows and then the number of columns right instead of having three then the size of our badge and then the number of rows and number of columns right if that makes sense so basically we swap these xes right here um so this great and now we can also copy this uh over to the game of Tik Tac to as well and yeah oh sorry I just noticed that we had np. XIs using an e here instead of an I so that's just a small mistake and let's update this for both of our games and yeah so this should be working now okay so now we can keep updating this um MCTS search method so first of all um we now get these policies and these values back as return right at the beginning we don't need our values so we just get these policies and we can rename the state here to States and then uh when we yeah um yeah keep processing the policies it's nice that we call the softmax this way but again we don't want to squeeze since we want to keep our badge AIS the way it is currently and then when we add the noise to this policy we also have to make sure that this shape of our noise is equal to the shape of our policy so we can add a size right here and the size should just be equal to policy. shape of zero right so this way we basically create noise for each different states so for each different safe play game essentially Okay so this is great and now we have this yeah process policy here ready for us and the next step would be to allocate this policy to all of our um yeah sfe play games so we first of all have to also add the safe play games here uh to our MCTS search as an argument and let's also add them here because of that and now we can work with them so here we basically want to Loop um over all of our sfe play games so I will just write for I then SPG in enumerate and then safe play games like this and yeah inside of each s play game we first of all want to get the valid moves and we can get the valid moves from the state for the given safe play game right and and we get this state by looking at the states at the position of I right here since this is the index we get when we enumerate through our safe play games and this is also the same index as uh the one we have when we basically stack all of our states up together right so we can just uh yeah call it like this so um now first of all we get our valid moves and then we multiply the policy with it but we don't want to multi play all of the policy with it but rather this just the policy at our position right so let's also find the policy uh for the safe play game so let's just write SPG do policy like this this should be also equal to the policy at the position of I so now we can multiply this SPG policy with our valid moves and then again we also want to divide this SPG policy with its own sum so that we yeah at the end have this distribution of percentages right okay so now we create this root right here and first of all we should also give yeah the states at the position I here for the state and then also we have to make sure that uh this root will be equal to SBG do root right because we want to store it inside of this class right here okay so now we have this sg. root and we can also expand it so let's write SBG do root expand like this and then we will expand it with the safe play game policy we have here okay so yeah this way we already have saved a lot of uh yeah lot of calling our model right because we batch up all of our states here so this is great and now we also want to keep doing this inside of our actual iterations right so here when we uh basically have our num searches we also want to uh Loop over our set play games again so let's again write for I and then SBG in numerate Sp games like this and yeah then we can continue with this part right here so for each safe play game first of all we want to get the node and we get this node by calling SPG root right here and then we can just continue with these things right here and yeah so now um here we would call our model again right and we want to basically yeah also parallelize this model call right here so the way we can do this is by turning this if statement around so instead of checking whether um we haven't reached a terminal node we want to check whether we have reached in facted terminal not so if is terminal and then this case we will back propagate um directly so I will just copy this over right here and in all other cases we want to store the node uh that we have reached here inside of the search uh inside of ourg class as well right so else we can just write sg. node should equal this node right here and to make sure that we always start with node being one so that we can later differentiate between the safe play games in which we have reached a terminal node and in which or the safe play games that are uh that we want to expand on here later on we also have to say that SPG do node should be equal to none like this so because of that we can just move that out of this Loop right here and you know what I believe we also don't need this enumerate oops so let's just remove that so it's a bit easier to read it this way okay so now um we basically um have stored all all of our expandable nodes instead of ourg classes so first of all we now want to find out which uh safe play games are expandable or not so let's write expandable spgs uh or let's write safe play games like this um and here we want to have the list and then we want to store the mapping index so basically the index we can later use um when we allocate this um policy and this value back to our list of safe play games so we want to get the yeah index basically for each safe play game um so for each SBG where our node is expandable right here right so we can just write mapping index for mapping index then in range of Len um SPG or safe play games like this and then we only want to get the index for all of our safe play games where um safe playay game. node is equal is not equal to n right since these will be the safe play games that are expanded so if SP games of mapping index do note then is not none like this so this way we get a the yeah indes for all of our expandable safe play games and next we can check whether there are even any expand the games so can just write if L of expand for the P games is larger than zero and in this case we again want to yeah um basically stack up all of our states and then also get uh these encoded States right here so that we can then get this policy and this value so first of all we can write States and set that equal to np. stack again of and here we can just write SPG Dot node. state for SPG in um SP games like this and then oh sorry for um again we have to change the up bit sorry rather we have to write SP games of um I I believe or just mapping index like this if we want to reference this one again here so SA games of mapping index. node. state for mapping index and then in these expandable SP games right here so in expandable SP games like this okay so this way first of all we get this list of yeah the state for the nodes that are expandable um yeah basically for all of our self play games uh that we can expand right here right so okay now we have these states right here and we also want to encode them but we can just move over to this part again so I will just copy this over right here so now again we get this policy and this value back and inside here first of all we can use these expanded or these states right here and then we also don't have to UNS squeeze again because we already have a badge then here we don't have to squeeze again because we want to keep this badge as it is okay so yeah now we don't have have to add a noise right here so now we can again Loop over our um expandable spgs essentially right so we don't uh have to do this inside of this if statement but you could I mean not really changing much here so now we want to first of all have the index um that we can yeah get our policy at the given position and then we also want to have this mapping index so that we can then allocate this policy at index I to the safe play game at index mapping index if that makes sense because the yeah uh index um here when we enumerate is not aligned with the index of our safe play games right because we only um Loop for over a selected number of safe play games right so for I and the mapping index in enumerate then we want to again enumerate over our expandable SP games like this and then here we can just um yeah basically get these valid moves and this policy and then we want to expand and back propagate here so let's just remove that line right here um so this great and now first of all we again want to get the SPG policy and the SPG value so let's get SPG policy by yeah getting the policy at the position of I and the SPG value should also be the value at the position of I so let's call this right here okay so now we have our SPG policy on our SG value so now we going want to get these valid moves and we get this valid moves by first of all getting the state of our node for the given self play game and here we want to use this mapping Index right so here we can just call it safe play games at the position of mapping index and then here we want to get the note and the state right here you know what let's also just um I think just store this into a variable I think this nicer so just write node equals um save play games at the position of mapping index. node and here we just want to use no. State when we um get these valid moves right here and then here we can multiply the spbg policy with the valid moves at the state of or note and then again we want to divide it by its own sum like this so now we can get the item yeah of all value I think this doesn't do a big difference here um yeah because this already is a number array so I believe we don't need this actually so now first of all we want to expand our node using this SPG policy right here then also we want to back propagate here so I would just copy this over here sorry um copy this over from here and we want to back propagate using sg. value okay so this is great so basically yeah with this implementation we have the Monti research code working for this number of parallel games now at the end we would get these action props right here but I believe this is not that helpful instead of our MCTS class because here we uh yeah have our uh we have this parallelized and this is hard to parallelize these action props right here so rather I would just copy this out and instead we want to work with our action props inside of our safe play Method here so I would just paste this over uh to this part here so because of that we also don't get any action propess return here but we just change this the safe play games that we passed on here right okay so this is great and now let's also just run this so to make sure that we don't have any direct arrows okay so torch is not defined but this just because I haven't run these sets above so let's just have quickly do that here um so let's check whether we got any error okay so okay this is working for now okay so now we want to update this part here below and then we are finished with this parallel implementation so after we have um yeah done this paralyzed MCTS search we again want to Loop over all of our safe play games right so we can just write uh for I in range of L ofp games and here remember that we want to remove the safe play games uh from our list if they are terminal right and it's a bit yeah it's quite bad to remove a uh yeah safe play game from a list and then keep iterating over it um from basically zero to the Len um because yeah then we would yeah basically not have a perfect alignment between this index here and yeah the actual safe play game we want to reach if we mutate our list inside of our Loop right so we can fix this by uh flipping this range here around okay so now we want to Loop over this flipped range of our safe play games right and then we can just tap all of these things in right here and the first thing we want to do is that we actually want to get this safe play game at the given position so let's just write SPG equals s play games at the position of I and then when we get our action props here we don't want to Loop over the root. children directly but rather sg. root. children like this so yeah this should be working and yeah so then we have these action props right here and we don't want to add this just to the general memory but rather to the memory of our s play game as well and we don't want this yeah when we call the neutral State we can just either reference this one or just the state of the root node um for our games which is WR sg. root. State like this to get this nutrious state and then we have these action props here this should be working fine and then we have this player right here so then we also get these temperature action props and we get this action right here so this working well and now we want to update the state and here we again want to call SPG do state so that we update the state inside of our SPG class and we want to update this by first of all having the Old State so sg. state right here and then by just having this action right here and having this play right here so I think this is also fine and now we get this value and this terminal by have to terminal and again uh we have to call SPG do state inside of here and now we check whether s termin is true so here we don't need to Define this memory like I said but rather when we Loop over our memory first of all we have to declare Set This to sg. Memory um then we want to append this to this General return memory that we have defined up here right so then we can just delete this Return part right here and yeah so this is uh something that we have finished inside here and then we want to flip the player um yeah after we have looped over all of our self play games and at the end we just want to return this return memory up here above and also I just noticed that when we um yeah are finished right here with this safe play game so if our safe play game is terminal then we also want to delete it from our list of safe play games right so that this y Loop here it's working so here we can just write D and then SP games at the position of I so yeah this way we just shorten our safe play games list and because we Loop in this other direction it will still work to just keep uh looping instead of our for Loop um yeah because we will basically just remove the safe play game at this side when we look in this direction right and we also have to change this MCTS to MCTS parallel here right oh and I also noticed that we have this eror right here where we return these action props so obviously we want to remove that line right here because we just want to return the return memory and this statement was still part of the multical research that we removed um so let's just run this again and now we can try to train our model for Connect 4 using this parallelized Alpha Z implementation let's write alpha0 parallel inside of here and yeah everything else can just stay the same and I believe now the speed should be up by about 5x so this should be a massive Improvement so let's just run this say again and we get this nice progress bar so I would just train this for these eight iterations here and yeah this will still take a few hours on my machine but after that we will have this a great model that has a very good understanding of how to play the game of Connect 4 and then we can first of all evaluate it again and maybe also play against it in this nice visual way and maybe we will lose against it so we'll see okay so now I've trained in a network using this Alpha zero parallel class for eight iterations right here uh yeah on the game of Connect 4 so this still took a few hours I believe it could have been even faster if we would have further increased this number of parallel games here but still that's fine um so now we want to test this here inside of our evaluation cell so first of all let's um change the path uh for our weights to model 7 Connect 4 right since we train for eight iterations and then next we can also set number of searches to 600 as well so yeah we copy this yeah value from the arguments above and yeah this is still quite a small number when you would compare to other search algorithms that I used in practice so let's still check whether we get nice results here so let's run the cell and oh sorry I've accidentally accidentally ran this one right here um so let's run this right here so now we get this Connect Four board and I believe we want to play the midd right here so let's see where the Mod plays and the mod played on top of it think we want to play on this side right here yeah I think that should be fine so let's play on for now yeah the model played here think we either want to play here or there or maybe there so I would just put some pressure on on the model by playing here on five so now I think I think we're doing fine when we play here we just have to be careful so let's play on four so now we could also pressure here and then our mod can't play here right because then we could defend this so I think well probably I would want to play here um five again sorry I'm very bad at this yeah probably would have been much better I guess to play here then um so now we can't play here anymore we could still play here or there I think this is valuable here so let's move on top of this position here so let's play three so mod played here um want to get we have some pressure here so maybe we'd like to play here also defend here so let's play on two I guess oh no then if we play here our model would play here and we would be trapped so we can't play here anymore um so maybe we could play here but that's not that good either um maybe that's fine here it's not too valuable so guess we play here okay so where did m play so it just played here okay so think we we have to definitely play here I don't know if we still lose then so let play on five now okay so where did the M play now there some pressure here or so oh we lost this way and we also lost this way okay absolutely destroyed um yeah I don't know we have to just just play on three now I guess and yeah the took this right here and yeah has won the game because of that so nice uh could have even played here as well um so that should be yeah just I mean I played quite in a bad way I guess but still was super nice and so now we could see that even though we have around 600 searches our model was yeah still able to at least destroy me here in this game so this is fun and before we conclude with this tutorial here there's still some minor mistakes I noticed um so first of all when we get these temperature action props in alha zero tricks uh we currently aren't using them for our probabilities so let's change that right here um so now let's use them here and then also when we use our Optimizer we rather want to rather want to call self. Optimizer because then we yeah go for this Optimizer here instead of just um yeah calling this Optimizer right which might not work in all occasions yeah so that's just a small thing let's also add that to this Alpha zero parallel so temperature action props here and then save. Optimizer in our Training Method um okay so now we can just lastly um let our model play with itself and you know let's also get this nicer form of visualization compared to just printing out this boort state right here so yeah this is what we will do next and here I want to use the kegle environments package so I will just write import kegle environments right here and then you know what let's also print out the version so maybe you're interested in that um this so yeah that's my version right here and then using this package we can first of all create an environment and we get this by setting n equal to kegle environments do make and then here we can just use connect X to get this standard Connect 4 environment I mean this also tweakable so but here we just get the standard Connect 4 environment and then next we would want to have some players so this should just be a list of our agents and then um next we could just say n. run using these uh players we have defined and then n. render with mode being equal to I python instead of our Jupiter notebook here so now we also want to Define our players right so here I've just created this kegle agent class yeah and I've copied it over right here so basically we just do some pre-processing with the state and then we will yeah just call our MCTS merch search either or just uh get a prediction straight up from our model depending on our arguments so basically doing something similar than what we do up here in this Loop right okay so yeah now we want to create these players as keg agents and before we do that we also need this model this game and these arguments right here so we just them over um from here basically like this and then let's paste them here so first of all we have our game of Connect 4 then we don't need a player here um then we also have to add some arguments right here so first of all we want to search set search to True um and then we also need a temperature and the way uh this is buit up here when we have temperature equal to zero we don't uh do this yeah division to get our um yeah updated policy but rather we will just get the ARX directly so let's just set temperature to zero right here so that we always get the ARX of our policy and the MCTS distribution Guided by our model okay so now we have these right here let's also set Epsilon to1 so that we still have some Randomness here and so next we we have our device right here we create our model like this and yeah we use this path right here so this should be fine so now we can actually Define our players so let's set player one equal to kegle agent and like I said we first of all need this model this game and these arguments so let's write model game and ARS just like this that's let's also do the same thing for player two so I mean so this way we have more flexibility if we would want to try different players and then here for the players uh we can just fill this list with first of all stting player one run and then player two. run like this okay so this should be working now let's just run the Cale and um get these nice visualizations oh so now we get this need animation of our mods playing against each other so we have two players but just one mod here playing against itself basically and I believe this should come to a draw yes so our model is that advanced I guess that it can defend against all attacks so in this still uh just this nice animation and now we can also just yeah briefly do the same for Tic Tac Toe as well so here we just have to change some small things so first of all oh sorry uh we have to set the game to Tic Tac to and then also um yeah we can change our arguments so let's just set number of searches to 100 and here when for our rest net we can set the number of rest blocks to four and the hidden dim to 64 I guess and then we want to update this path so I think the last T to model we trained was model 2.pt if I'm not mistaken so then also here we have to set this to tiktock toe as well and then we can just run this okay so now we get this nice animation immediately of our mods play playing against each other and again we have gotten a draw because the model is able to defend all possible attacks um so this still very nice so feel free to do some further experiments yeah your own so maybe you could also set the search to for so that just your Nur networks directly will play against each other without doing these 100 searches in this case case and yeah so I think we're finished with this tutorial this was a lot of fun and um I've created this giup repository uh where there yeah there's a Jupiter notebook stored for each checkpoint and then also um I have this weights folder where we have the last model for tikat to and the last model for Connect 4 that we have trained and if there are any questions feel free to ask them either in the comments or just by sending an email for example and yeah I might do a follow-up video on mzero since I've also built that algorithm from scratch so if you're interested in that so um would be nice to let me know so thank you\n"

AlphaZero from Scratch – Machine Learning Tutorial

Random Videos