Concept Learning with Energy-Based Models (Paper Explained)

**Optimization and Learning with Energy Functions**

The concept of optimization is a crucial aspect of machine learning, allowing models to make decisions based on the most likely output given the input data. In the context of energy functions, these functions can be used to optimize the model's parameters to minimize the difference between the predicted output and the actual output.

**Complexity of Optimization**

The complexity of optimization is often underestimated, as a single step may not always lead to an optimal solution. This is particularly evident in tasks where multiple variables are involved, making it difficult for the model to determine the best course of action. The authors highlight that models which optimize only at inference time are more powerful than those that perform a one-shot forward pass. This is because optimization at inference time allows the model to consider dependencies between variables, enabling it to make more informed decisions.

**Auto Regressive Models**

The authors discuss the difference between auto regressive models and non-auto regressive models in natural language processing (NLP). Auto regressive models produce words one by one, whereas non-auto regressive models produce all words of a sentence simultaneously. This difference highlights the importance of dependencies between variables in optimizing model performance.

**KL Objective**

The authors mention the KL objective, a regularizer used to prevent overfitting in energy-based models. While they acknowledge its use, they do not delve deeper into its specifics, instead focusing on the practical applications and results obtained from using these energy functions.

**Demonstration and Transfer Learning**

A demonstration of the power of energy functions is presented, where a robot is trained to recognize concepts such as shapes, distances, and quantities. The authors show that by training the model with a dataset of examples, the robot can learn to generalize beyond the training environment and apply its knowledge to novel situations. This demonstrates the potential for transfer learning in energy-based models.

**Regional Geometric Shapes**

The authors describe the ability of their model to recognize regional geometric shapes such as squares, circles, and triangles. The model learns to identify these shapes based on their spatial arrangement, allowing it to optimize its behavior accordingly.

**Spatial Arrangement and Proximity**

The authors highlight the importance of spatial arrangement in recognizing concepts. For example, the model can determine whether a line is a circle or a square by analyzing its proximity to other objects. This demonstrates the ability of energy-based models to reason about spatial relationships between variables.

**Optimization Procedure**

The authors illustrate the optimization procedure used by their model, which involves stabilizing attention on specific points or regions in space. This allows the model to converge towards optimal solutions more efficiently and effectively.

**Counting Quantities**

The authors demonstrate that their model can recognize quantities such as one, two, three, and so on. This is achieved through a combination of recognition and optimization techniques, allowing the model to generalize beyond specific examples and apply its knowledge to novel situations.

**Future Directions**

While the paper only demonstrates the effectiveness of energy functions on toy problems, the authors suggest that this area may hold significant promise for future machine learning research. The potential for exploring real-world applications and pushing the boundaries of what is currently possible with energy-based models makes this a promising direction for ongoing research.

**Conclusion**

In conclusion, energy functions have shown great promise in optimizing model performance on various tasks. By leveraging complex optimization techniques and demonstrating their ability to recognize regional geometric shapes and quantities, these models offer a powerful toolset for researchers and practitioners alike. While the current applications are limited to toy problems, the potential for expansion into real-world scenarios makes this area an exciting one to explore further.

"WEBVTTKind: captionsLanguage: enhi there what you're seeing here is an energy based model that learns the concept of a shape from a demonstration on the left so on the left you can see a demonstration of data point sampled from a shape in these cases circles or squares and then the corresponding energy function that the model in first from that and then it can replicate that shape on the right using that energy function so the paper we're going to analyze today is called concept learning with energy based models by yegor more Dutch of open AI and this is a very cool paper or at least I think it's a very cool paper but it is also a very hard paper so therefore first I want to kind of make a bit of an introduction into the concepts that we are facing in this paper so the first thing you need to know are energy functions or energy based models what is an energy function an energy function sometimes called e is simply a function with one or multiple inputs let's call them X and you can make the if the energy function is happy with X it will be the value 0 and if the energy function is not happy with X it will be a high value like larger than zero so this is happy this is not happy so let's give some examples of this we can formulate almost any machine learning problem in terms of an energy function let's say we have a classifier the classifier is takes as an input image here may be of a cat and the label so if the label is cat then the energy will be zero if the energy function is of course working correctly and if but if we give the energy function the same image but we give it a wrong label dog then it is very high in the case of the classifier of course we can to simply take the loss function as the energy function and we are altom automatically get an energy based model so the loss function here would be something like the negative log probability of the of the sorry of the correct class but in any case it is just going to be a high number let's call it 10 to the 9 so the energy function says ha this is very this is very bad this thing here is very bad the entire thing you input it won't tell you yet what's bad about it so that also means you can change any of the two things to make the classifier happy now usually we're concerned with changing the label it's like tell me which other label do I need to input to make you happy and if we make the labels differentiable of course we never input the true label we actually input like a distribution softmax distribution over labels and that's a differentiable we can use gradient descent to update the dog label we can use gradient descent to find the label that would make the energy function more happy so we could use gradient descent to get the cat level if we had a good classifier but we can also we can also optimize the image to make it compatible with the dog label right that's things that if you ever saw deep dream or something like this those models do exactly that they optimize the input image for a particular label and there you can view the entire neural network including the loss function as the energy function so what's another example another example is let's say you have a k-means model and the energy function is simply input a data point and for the data point what you're going to do is you're going to find the min cluster index the min Kay over you know you have your multiple clusters here and your data point might be here so you're going to fight the cluster that's closest and then the distance here this since Dee will be the energy of that so the model is very happy when your data point it comes from one of the clusters but your model is not happy when the data point is far away and that would be the cost function of the k-means function so that's an energy based model too now currently energy based models have come into fashion through things like Gans or any conservative noise contrastive estimation so in a jet in a gam what you have is you have a discriminator and the discriminator will basically learn a function to differentiate data from non data so that by itself is an energy function so the discriminator will learn a function and that function will be low wherever the discriminator thinks there is a data right so it will usually do this around the data point so the data points form the valleys right here and then the generator will basically take that discriminator function and will try to infer points that are also in these valleys to produce points that are also in the valleys and then you basically have an energy learning competition the discriminator now tries to push down on the energy where the true data is and push up on the energy where the generated data is and that will give you basically a steeper energy based function in the future I hope so in this case the discriminator neural network is the energy function and the D generator just tries to produce data that is compatible with that energy function so I hope that concept of what an energy function is is a bit clearer and in any again any machine learning problem can be formulated in terms of an energy function now what is not done so far is what we alluded to a little bit before in the classifier example and also here so right now when we want to Train again we simply take the generator to whose data now what's the generator skull the generators goal is to hit those valleys in the energy function and we produce a generator into in one shot produce this data but what we could also do is of course we could just start somewhere let's say here we pick a random data point and then we use gradient descent because the energy function in this case is smooth we use gradient descent to just drop down this valley and then find ourselves in this valley so without ever training a generator we can use this methods to produce points that are in the valley of the energy function right and this I don't know if people I can guess people have trained gams like this the reason why it doesn't work let's say in the real world is because that procedure will just produce adversarial examples for the discriminator and those usually look like nothing like data because if you keep the discriminator just stable and gradient descent against it what you'll get isn't really qualitatively good but in principle if the discriminator was a good energy function for the date to describe the data we could use gradient descent the same up here in order to find a good label for an image given that we have a good energy function right so this is that we could simply gradient descent on the label in order to find a better in order to find a better label so in this paper we're going to have a situation where we say we're given an energy function and we're given a bunch of inputs they are then called X a and W and if I have my energy function already if I have given my energy function and I have given two of those three things any two right I can infer the last thing simply by gradient descent on my energy function because I know the energy function is zero when all of these when the energy function is happy with the input so when all of these things agree basically the energy function is happy it will output zero otherwise it will output a high value therefore if I given any of those two any two of those three things I can find a compatible third thing by descending and then of course over here in these machine learning problems the task was always actually to learn an energy function right so usually in the training dates that we are given images and labels and we want to learn this energy function which would be parameterized so we want to learn the parameters and the same here in our general case if we are now given three things but we are not given the parameters of the energy function we don't know what those are as long as we're given all of the inputs and our training data set and our training data set guarantees these are actually you know these are inputs that are compatible with each other the energy function should below we can simply gradient descent on the parameters of the energy function so in a sense there are four things right there are these three inputs and then there are the parameters of the energy function if we're given any three of those four we can gradient descent on the rest and that's going to be the basis so the X here is going to be the so-called state and the state in this paper is going to be images of entities the entities sorry is not going to be images but the entities are these little circles that you're going to see and each of those entities can have an exposition a Y position and I believe a color so our G and B so each of those can have that and then the concatenation of all of those attributes is a one big vector and that is your X that's your state so state is number of entities and their attributes a is going to be an attention mask over the state so a is going to be here you have four entities so a will have four entries telling you which of these entities you should pay attention to right now and W is going to be a concept vector so called so W is going to be the embedding of a concept now what a concept is in this case is very general I can give you an example one a concept is do any of do the entities that the a pays attention to are they close to each other so in this case you see we have two entities that a has a high value one and this is this all up here and this ball down here now if the concept vector is the embedding for the concept of being close to each other then the energy function would be very happy if those two things are close to each other and it would be very unhappy if those two things aren't close to each other but in the very same situations of the same X the same attention mask but a different concept so a different W vector right here then the the energy function would be maybe very happy if the two things are far apart and maybe unhappy if the two things are close so the question is always how are the three things that you put into the energy function compatible with each other and given all but one of these things you can infer the other so let's say you have a perfect energy function for this this all of these for this situation you're just given the energy function you can trust it and you are given let's make an example you are given the X so you're given the state I'm gonna draw the state down here right okay this is the state and you were given the W and the W is the embedding it's a vector but the embedding space but the embedding is for a line right so the geometric the geometric unit of a line now your task is to find a the attention mask that will make the energy function happy and as you can see right here while you would do is you would put a lot of weight on this this this and this ball and no weight on that ball because those make a line and since everything here is differentiable so the state is differentiable the attentions differentiable and the concepts or vectors that are differentiable you can use gradient descent to find that another example if you're given again the same W so line and you were given this following thing and you were given now you're given the attention on these three and you say please find the X please find the X the states that makes this energy function happy now this here you would call the starting state the x0 your ear task is going to be find the x1 find the state how do you have to change this state such that the energy function is happy and of course the answer is going to be is to push this ball here inward until it is in the middle of the two others so the three form a line right these three form a line you don't you don't have to do anything to this ball up here because there is no attention on it and the attention it's only is the concept for the things that you put attention on yeah and the state are those three in agreement and the energy function is happy okay we have covered the basics now let's dive into the paper I think this is the longest introduction ever but I think it will pay off once see so they they specifically or this this author I think it's a single author identifies two different things that you can do with an energy function here of course you can do more as we saw but they identify two so here is where you have given the initial state and an attention mask and you want to find the x1 the state that satisfies the concept and the tension the most this the author calls generation as you can see here these four things that you have the attention on are pushed around until they make a square because the concept right now is square and in the other case where you are given this x0 and x1 just call this X right here just call this thing X if you're given those two and you are given the concept Square and you're tasked with finding a the attention mask of course you're going to put the attention on these right here and that is going to happen through gradient descent again we're not learning a model to give you that attention like in again we're learning a generator to just one shot give it to you right now what we're going to do is we're going to gradient descent optimize on our smooth energy function to give us that perfect attention mask that satisfies the energy function all right so this is the difference right here gradient descent is part of the output procedure of the model usually we just use it to learn and we learn a one-shot model but here gradient descent is part of the model so they introduce energy functions here and they say okay we can have a policy on X so if we're given a concept W and if we're given an A we can have a policy over X which basically means we can find X's that are compatible with that by running gradient descent here you see there is an XK minus 1 and we are running gradient descent on the energy function with respect to X to find a better X that satisfies the energy function given those inputs and the same if we want to find an attention mask we are running gradient descent on the attention mask again in order to satisfy the same energy function so you see the inputs are both times the same the concept here we can input square here we can input square but the difference is what we're running gradient descent on and what we keep constant and I would get I would add a third line here actually because we can also if we're given an X and an A we can also infer a W and that's going to be an integral part so if I have this right here and this situation and I have say I have a tension on these for now I can ask the model so I'm given X and I'm given a I can ask the model to infer W and the model should ideally output ha this is square now the model isn't going to output square the model is going to output a vector representation of square right so the model is going to output square but as a vector of numbers because that's how we've trained it W isn't it is the embedding but what we can then do later is we can say okay I'm not going to tell you it's a square you just come up with a vector W it describes this situation and now I'm going to take that vector W that you came up with miss mister or missus model and I'm going to take tell you a new situation this situation right here and I'm going to now give you X and I'm going to give you the W that you yourself have output and now please tell me what's the a and then the model is of course supposed to tell you oh these four here or the a so without without ever telling that it should be a square what you can do is you can let the model infer a W from one example situation and then transfer that W to a new situation so it can identify you can just say whatever concept I have up here please apply that same concept which is the W down here and this is the entire paper now this is the concept learning through energy based models okay so that is kind of a third line I would add down here you can infer a concept vector if you're given the X and the a so in order to do all this their energy function is going to be a so called relational neural network so what you'll have is you'll have a simple neural network a multi-layer perceptron that always connects two entities to each other with the concept vector and then this is a belief a sigmoid that connects the attention masks of the two and then you simply sum over all pairs of two entries in your model and then you send that through an MLP sorry through an MLP again this I believe is not so important it's just important that they can feed this entire situation the X the a and the W they can basically feed into a neural network in the neural network comes up with a number of how well those three things fit together and then you can transfer these concepts that's pretty cool now the only question is of course and we've always said we're given an energy function or just we just have it but of course this is a neural network and the neural network has parameters and the parameters we don't know what good parameters are at the beginning so we need to train this thing and again the reason why these are toy problems right here is I mean we'll get to why it's computational but this is kind of a new field I believe in machine learning at least I come from classical machine learning and we only ever have used like SGD to train and we only have her have produced models that one shot produce something and here we this is a I believe there's a new concept where you use gradient descent as part of the output and that makes a lot of trouble and so that's why we work in toy problems so what this this here is the situation I described you have a demo event where you're given the X and the a and you're supposed to infer the W so the question here is what's the W and the model will come up with a W and you're not gonna do anything you're not right now you're simply gonna take that W and tell it oh well here is a so called test event so please apply the W you came up with in this test event and please find me the the a in this case that satisfies the W and the X I give you here and of course the a right here is as you can see even you don't know that it's a square and the actual concept here is move the grey ball to the middle of the square right that that is it here but no one has told me this I just looked at the picture so the the correct answer here would be to place attention on those four things and then to take this thing and move it to the middle right here in in the in this over here so that would be the correct answer now the question is how do you train something like this and they they show that they so this is the loss function right here the loss function is they give you a concept and an initial situation and you're supposed to infer the x1 and the a and the loss function is simply the negative log likelihood of that but what does that mean so will will make it easier if if you have this this procedure right here where you have demo event this up here this is demo and this is a test event how are you going this entire procedure how are you going to learn the energy function well in this case this entire procedure this entire thing is one training sample sample but usually we have input and label and now here it's much more complicated because so we have input okay that's this X and this a cool but then we have SGD as integral part of the procedure to determine the W and now what we could do is just apply a loss to the W but we don't because we don't know what the embedding space for the concepts is we could maybe train a classifier but in this case we want to train the ability to transfer these concepts so our training sample needs to be one time transferring a concept so SGD for one is part of our process here and not only that but then this this X here of course is also part of our training sample write this up here as X 0 and this here is X 1 and now we need to find this a this attention mask and that is an SGD again remember inferring anything through the energy function is a gradient descent process so ultimately our one training example consists of X 0 a at the beginning so let's call that a zero it consists of the SGD procedure to find W it consists of X 1 and they consist of the SGD procedure to find a the a 1 the output a and then that will give us the output a the a 1 so this here is our input in the classical machine and this would be our X and this here would be our label Y and that's what we trained on we trained so it such that the output right here the this is of course sorry this is of course the Y hat this is what we predict and in the training sample we just write a little generator that will you know make this situation that knows what the concept is right it will say okay I'm gonna make an example for a square then it make this will make the attention mask for a square and then it will make the new situation again with a square but not tell us the attention mask there and it will make the attention mask into the true Y so at the end we can compare what our model output the attention mask we output here without ever knowing that it should be a square and we have the true label which comes out of the generator that at the beginning decided that it should be a square and then the loss in the distance between those two that's our loss this is an in this is an enormous procedure to get a loss and most crucially you have to back propagate through optimization procedures and this is something that we just can't do yet in our models if you take an image a resonate 50 right right now we do one forward propagation to get a label in this procedure if you had two back propagate through the optimization procedure for each sample you would need to basically back propagate through 50 forward passes of the resonate if you if your optimization procedure is 50 steps long and that is just not feasible right now so that's why we don't do it but I believe maybe once we find a smart way of back propping through optimization procedures a whole lot of these things will become the new and new wave and machine learning I really I'm excited - I'm pretty sure it doesn't work yet and this is very figley fiddly work but I'm excited by the prospect that we can do this so this is the training procedure right you've given X 0 x1 and a and you Optima is in order to infer the concept behind it right the generator that your level generator of your training data it knows the concept it has a concept in mind when it generated this but you're not telling your model what the concept is it needs to infer that and then using the model thing that the model inferred you can either give it x0 and x1 and infer a or you can give it the X and the a and in forex you can do either of those right these are called identification or generation respectively and then you compare the output here to what the generator at the beginning thought again it's not telling you it's that's because that's the label and you compare this to that and that will be your loss to train your energy function parameters so your training samples if you think of this entire thing as one forward pass of the model then it's just classic machine learning right you have a training sample which is one forward pass and you have a corresponding label that you infirm so let's jump to the experiments right here experiments are actually pretty cool so what they've done is for example have taken the concept of being far apart from something now being far apart so that the little X needs to be as far away as possible from the ball that has the attention on it so if you do generation and you start the little X right here and you ask the model where please infer the next state of the world it will push that little X away right here and in color you can see the energy function valleys of the position of the X so it pushes it away from this thing but if you take the same concept embedding the concept embedding of being far away but you don't do generation you do identification which means you infer the a then it will simply tell you that this ball right here is the furthest away from the X right so you can do all sorts of things like this and transferring concepts I find this here pretty interesting so they had to have two different concepts one concept is read as an identification you need to identify the red ball but the other concept is you need to turn something red right you need to take a ball that is maybe now blue and of course the color you can gradient descent on the colors you'd need to make it red and since the energy function it just takes three input X a and W it doesn't you you you're not going to tell it right now in which situation you are it has to create create this W embedding space through learning and if you do it with those two concepts then it will put the make something red concept and the is something red concepts in the same places so this is a PCA and then blue I think these blue is the attention codes for identify the red things and in red or the generation code for make something red and they will be put in the same place which is pretty cool it means that the energy function really learns the feature of something being red I find this pretty pretty neat and then here they they have some experiments where they basically show we need that gradient descent optimization procedure because only after many steps will will the energy function basically be aligned with the concept that you want so if you have a zero shot model like just one forward pass as we do here you'll see that the energy function that is supposed to make a circle from samples right this is the example concept right here it if you just have a one shot model it will it cannot but in this case at least it doesn't learn to one shot produce only if you opt in for a few steps will it get this so you optimize at inference time and that seems to be very important you can see again here demonstrations of this so the example is this and then the model as you can see after 20 steps learn optimizes the points to go to these locations whereas after only one step it didn't do that yet so there are complex things at work here and this column here is where you don't have a relational or neural network so you can't basically capture dependencies between things so you you have no chance of making a square because you don't know where the things are in relation to each other but that's more of an engineering question their point is basically that if you have models that do an optimization at inference time they are much more powerful than models that just do a one-shot forward pass it's sort of like an auto regressive model in NLP versus a non auto regressive model that produces all words at once if you produce all words of a sentence at once no word can depend on any other word and you can just come loose independent or you can just produce independent things which will make the sentence often not make any sense they also have this KL objective which is a regularizer which I believe that's just a trial and error they built it in because but it is a regularizer I don't want to really go into that and then they they do demonstration in and they re-enacted on a robot the demonstration here is that there is a situation where two things have a tension on and you're supposed to move something into the middle of the two things so that's the content you don't tell the robot the concept it needs to learn that from data and then infer that this is the concept that you want and then transfer that to the other environment now you know this you look you know there's this robot environment but ultimately they still encode the positions of these things and the position of that and really all you have to do different here is that instead of moving this actuator directly you need to like calculate what you need to do to the individual joints in the robot so I think this is maybe because it's open AI and it needs to you know look robot II and stuff but the problem here is not really different it's it's not even it's not real-world transfer or anything so yeah let's let go through some of the things they can learn with this so you can see here they can learn these regional geometric shapes and so on the left is the example event that the model needs to take the concept from now this is this is I believe very much identification so what they did is they trained with a data set where all of these appear right so this there are squares there are lines there are circles so this is maybe my criticism here that it is not so much to generally infer a concept it is more like identify the concept so the model basically just needs to decide is this line is the circle or is this square because that was those things were in the training data set it would be nice to see how this generalizes to general concepts or if we can even make that if we can have a zero shot concept inference and then transfer those concepts to other things maybe that's already happening I don't I don't know so here the spatial arrangement is to either be close to something or to be between two things so if the attention is on two things you want in between so you see the top ones are the demonstrations it needs to recognize the concept and it needs to basically optimize to fulfill that concept shapes so to make shapes is mmm oh yeah there's a triangle right again this this this just very much I believe relies on recognition and not actual understanding of what a triangle here you have proximity being closer being far apart what else is cool oh yeah if the recognition for the same task right you need to identify the ball that is closer for and here you really also see the optimization procedure in action where for example at the beginning of each flicker you kind of see the attention being everywhere and then stabilizing to one or two points so if two points are equally close or far apart you'll see the attention being on multiple points which is pretty cool right so that means the model really learns this this is concept here's the count quantity so you can either have one two or larger than three or something yeah that seems like they tried three and four and didn't work so they just said I will just do larger than three and here is this robot thing where it also always needs to move in between now this this is the part that I'm not really impressed with but you know whatever whatever you want okay I hope this was a good introduction to energy functions what you can do with them what I think of them end of this paper it is a pretty cool paper it yes it only works on toy problems so far but I believe this is one interesting direction of future machine learning and something yet to be very much explored if you like this content please subscribe tell all of your friends about it share and I'll see you next time bye byehi there what you're seeing here is an energy based model that learns the concept of a shape from a demonstration on the left so on the left you can see a demonstration of data point sampled from a shape in these cases circles or squares and then the corresponding energy function that the model in first from that and then it can replicate that shape on the right using that energy function so the paper we're going to analyze today is called concept learning with energy based models by yegor more Dutch of open AI and this is a very cool paper or at least I think it's a very cool paper but it is also a very hard paper so therefore first I want to kind of make a bit of an introduction into the concepts that we are facing in this paper so the first thing you need to know are energy functions or energy based models what is an energy function an energy function sometimes called e is simply a function with one or multiple inputs let's call them X and you can make the if the energy function is happy with X it will be the value 0 and if the energy function is not happy with X it will be a high value like larger than zero so this is happy this is not happy so let's give some examples of this we can formulate almost any machine learning problem in terms of an energy function let's say we have a classifier the classifier is takes as an input image here may be of a cat and the label so if the label is cat then the energy will be zero if the energy function is of course working correctly and if but if we give the energy function the same image but we give it a wrong label dog then it is very high in the case of the classifier of course we can to simply take the loss function as the energy function and we are altom automatically get an energy based model so the loss function here would be something like the negative log probability of the of the sorry of the correct class but in any case it is just going to be a high number let's call it 10 to the 9 so the energy function says ha this is very this is very bad this thing here is very bad the entire thing you input it won't tell you yet what's bad about it so that also means you can change any of the two things to make the classifier happy now usually we're concerned with changing the label it's like tell me which other label do I need to input to make you happy and if we make the labels differentiable of course we never input the true label we actually input like a distribution softmax distribution over labels and that's a differentiable we can use gradient descent to update the dog label we can use gradient descent to find the label that would make the energy function more happy so we could use gradient descent to get the cat level if we had a good classifier but we can also we can also optimize the image to make it compatible with the dog label right that's things that if you ever saw deep dream or something like this those models do exactly that they optimize the input image for a particular label and there you can view the entire neural network including the loss function as the energy function so what's another example another example is let's say you have a k-means model and the energy function is simply input a data point and for the data point what you're going to do is you're going to find the min cluster index the min Kay over you know you have your multiple clusters here and your data point might be here so you're going to fight the cluster that's closest and then the distance here this since Dee will be the energy of that so the model is very happy when your data point it comes from one of the clusters but your model is not happy when the data point is far away and that would be the cost function of the k-means function so that's an energy based model too now currently energy based models have come into fashion through things like Gans or any conservative noise contrastive estimation so in a jet in a gam what you have is you have a discriminator and the discriminator will basically learn a function to differentiate data from non data so that by itself is an energy function so the discriminator will learn a function and that function will be low wherever the discriminator thinks there is a data right so it will usually do this around the data point so the data points form the valleys right here and then the generator will basically take that discriminator function and will try to infer points that are also in these valleys to produce points that are also in the valleys and then you basically have an energy learning competition the discriminator now tries to push down on the energy where the true data is and push up on the energy where the generated data is and that will give you basically a steeper energy based function in the future I hope so in this case the discriminator neural network is the energy function and the D generator just tries to produce data that is compatible with that energy function so I hope that concept of what an energy function is is a bit clearer and in any again any machine learning problem can be formulated in terms of an energy function now what is not done so far is what we alluded to a little bit before in the classifier example and also here so right now when we want to Train again we simply take the generator to whose data now what's the generator skull the generators goal is to hit those valleys in the energy function and we produce a generator into in one shot produce this data but what we could also do is of course we could just start somewhere let's say here we pick a random data point and then we use gradient descent because the energy function in this case is smooth we use gradient descent to just drop down this valley and then find ourselves in this valley so without ever training a generator we can use this methods to produce points that are in the valley of the energy function right and this I don't know if people I can guess people have trained gams like this the reason why it doesn't work let's say in the real world is because that procedure will just produce adversarial examples for the discriminator and those usually look like nothing like data because if you keep the discriminator just stable and gradient descent against it what you'll get isn't really qualitatively good but in principle if the discriminator was a good energy function for the date to describe the data we could use gradient descent the same up here in order to find a good label for an image given that we have a good energy function right so this is that we could simply gradient descent on the label in order to find a better in order to find a better label so in this paper we're going to have a situation where we say we're given an energy function and we're given a bunch of inputs they are then called X a and W and if I have my energy function already if I have given my energy function and I have given two of those three things any two right I can infer the last thing simply by gradient descent on my energy function because I know the energy function is zero when all of these when the energy function is happy with the input so when all of these things agree basically the energy function is happy it will output zero otherwise it will output a high value therefore if I given any of those two any two of those three things I can find a compatible third thing by descending and then of course over here in these machine learning problems the task was always actually to learn an energy function right so usually in the training dates that we are given images and labels and we want to learn this energy function which would be parameterized so we want to learn the parameters and the same here in our general case if we are now given three things but we are not given the parameters of the energy function we don't know what those are as long as we're given all of the inputs and our training data set and our training data set guarantees these are actually you know these are inputs that are compatible with each other the energy function should below we can simply gradient descent on the parameters of the energy function so in a sense there are four things right there are these three inputs and then there are the parameters of the energy function if we're given any three of those four we can gradient descent on the rest and that's going to be the basis so the X here is going to be the so-called state and the state in this paper is going to be images of entities the entities sorry is not going to be images but the entities are these little circles that you're going to see and each of those entities can have an exposition a Y position and I believe a color so our G and B so each of those can have that and then the concatenation of all of those attributes is a one big vector and that is your X that's your state so state is number of entities and their attributes a is going to be an attention mask over the state so a is going to be here you have four entities so a will have four entries telling you which of these entities you should pay attention to right now and W is going to be a concept vector so called so W is going to be the embedding of a concept now what a concept is in this case is very general I can give you an example one a concept is do any of do the entities that the a pays attention to are they close to each other so in this case you see we have two entities that a has a high value one and this is this all up here and this ball down here now if the concept vector is the embedding for the concept of being close to each other then the energy function would be very happy if those two things are close to each other and it would be very unhappy if those two things aren't close to each other but in the very same situations of the same X the same attention mask but a different concept so a different W vector right here then the the energy function would be maybe very happy if the two things are far apart and maybe unhappy if the two things are close so the question is always how are the three things that you put into the energy function compatible with each other and given all but one of these things you can infer the other so let's say you have a perfect energy function for this this all of these for this situation you're just given the energy function you can trust it and you are given let's make an example you are given the X so you're given the state I'm gonna draw the state down here right okay this is the state and you were given the W and the W is the embedding it's a vector but the embedding space but the embedding is for a line right so the geometric the geometric unit of a line now your task is to find a the attention mask that will make the energy function happy and as you can see right here while you would do is you would put a lot of weight on this this this and this ball and no weight on that ball because those make a line and since everything here is differentiable so the state is differentiable the attentions differentiable and the concepts or vectors that are differentiable you can use gradient descent to find that another example if you're given again the same W so line and you were given this following thing and you were given now you're given the attention on these three and you say please find the X please find the X the states that makes this energy function happy now this here you would call the starting state the x0 your ear task is going to be find the x1 find the state how do you have to change this state such that the energy function is happy and of course the answer is going to be is to push this ball here inward until it is in the middle of the two others so the three form a line right these three form a line you don't you don't have to do anything to this ball up here because there is no attention on it and the attention it's only is the concept for the things that you put attention on yeah and the state are those three in agreement and the energy function is happy okay we have covered the basics now let's dive into the paper I think this is the longest introduction ever but I think it will pay off once see so they they specifically or this this author I think it's a single author identifies two different things that you can do with an energy function here of course you can do more as we saw but they identify two so here is where you have given the initial state and an attention mask and you want to find the x1 the state that satisfies the concept and the tension the most this the author calls generation as you can see here these four things that you have the attention on are pushed around until they make a square because the concept right now is square and in the other case where you are given this x0 and x1 just call this X right here just call this thing X if you're given those two and you are given the concept Square and you're tasked with finding a the attention mask of course you're going to put the attention on these right here and that is going to happen through gradient descent again we're not learning a model to give you that attention like in again we're learning a generator to just one shot give it to you right now what we're going to do is we're going to gradient descent optimize on our smooth energy function to give us that perfect attention mask that satisfies the energy function all right so this is the difference right here gradient descent is part of the output procedure of the model usually we just use it to learn and we learn a one-shot model but here gradient descent is part of the model so they introduce energy functions here and they say okay we can have a policy on X so if we're given a concept W and if we're given an A we can have a policy over X which basically means we can find X's that are compatible with that by running gradient descent here you see there is an XK minus 1 and we are running gradient descent on the energy function with respect to X to find a better X that satisfies the energy function given those inputs and the same if we want to find an attention mask we are running gradient descent on the attention mask again in order to satisfy the same energy function so you see the inputs are both times the same the concept here we can input square here we can input square but the difference is what we're running gradient descent on and what we keep constant and I would get I would add a third line here actually because we can also if we're given an X and an A we can also infer a W and that's going to be an integral part so if I have this right here and this situation and I have say I have a tension on these for now I can ask the model so I'm given X and I'm given a I can ask the model to infer W and the model should ideally output ha this is square now the model isn't going to output square the model is going to output a vector representation of square right so the model is going to output square but as a vector of numbers because that's how we've trained it W isn't it is the embedding but what we can then do later is we can say okay I'm not going to tell you it's a square you just come up with a vector W it describes this situation and now I'm going to take that vector W that you came up with miss mister or missus model and I'm going to take tell you a new situation this situation right here and I'm going to now give you X and I'm going to give you the W that you yourself have output and now please tell me what's the a and then the model is of course supposed to tell you oh these four here or the a so without without ever telling that it should be a square what you can do is you can let the model infer a W from one example situation and then transfer that W to a new situation so it can identify you can just say whatever concept I have up here please apply that same concept which is the W down here and this is the entire paper now this is the concept learning through energy based models okay so that is kind of a third line I would add down here you can infer a concept vector if you're given the X and the a so in order to do all this their energy function is going to be a so called relational neural network so what you'll have is you'll have a simple neural network a multi-layer perceptron that always connects two entities to each other with the concept vector and then this is a belief a sigmoid that connects the attention masks of the two and then you simply sum over all pairs of two entries in your model and then you send that through an MLP sorry through an MLP again this I believe is not so important it's just important that they can feed this entire situation the X the a and the W they can basically feed into a neural network in the neural network comes up with a number of how well those three things fit together and then you can transfer these concepts that's pretty cool now the only question is of course and we've always said we're given an energy function or just we just have it but of course this is a neural network and the neural network has parameters and the parameters we don't know what good parameters are at the beginning so we need to train this thing and again the reason why these are toy problems right here is I mean we'll get to why it's computational but this is kind of a new field I believe in machine learning at least I come from classical machine learning and we only ever have used like SGD to train and we only have her have produced models that one shot produce something and here we this is a I believe there's a new concept where you use gradient descent as part of the output and that makes a lot of trouble and so that's why we work in toy problems so what this this here is the situation I described you have a demo event where you're given the X and the a and you're supposed to infer the W so the question here is what's the W and the model will come up with a W and you're not gonna do anything you're not right now you're simply gonna take that W and tell it oh well here is a so called test event so please apply the W you came up with in this test event and please find me the the a in this case that satisfies the W and the X I give you here and of course the a right here is as you can see even you don't know that it's a square and the actual concept here is move the grey ball to the middle of the square right that that is it here but no one has told me this I just looked at the picture so the the correct answer here would be to place attention on those four things and then to take this thing and move it to the middle right here in in the in this over here so that would be the correct answer now the question is how do you train something like this and they they show that they so this is the loss function right here the loss function is they give you a concept and an initial situation and you're supposed to infer the x1 and the a and the loss function is simply the negative log likelihood of that but what does that mean so will will make it easier if if you have this this procedure right here where you have demo event this up here this is demo and this is a test event how are you going this entire procedure how are you going to learn the energy function well in this case this entire procedure this entire thing is one training sample sample but usually we have input and label and now here it's much more complicated because so we have input okay that's this X and this a cool but then we have SGD as integral part of the procedure to determine the W and now what we could do is just apply a loss to the W but we don't because we don't know what the embedding space for the concepts is we could maybe train a classifier but in this case we want to train the ability to transfer these concepts so our training sample needs to be one time transferring a concept so SGD for one is part of our process here and not only that but then this this X here of course is also part of our training sample write this up here as X 0 and this here is X 1 and now we need to find this a this attention mask and that is an SGD again remember inferring anything through the energy function is a gradient descent process so ultimately our one training example consists of X 0 a at the beginning so let's call that a zero it consists of the SGD procedure to find W it consists of X 1 and they consist of the SGD procedure to find a the a 1 the output a and then that will give us the output a the a 1 so this here is our input in the classical machine and this would be our X and this here would be our label Y and that's what we trained on we trained so it such that the output right here the this is of course sorry this is of course the Y hat this is what we predict and in the training sample we just write a little generator that will you know make this situation that knows what the concept is right it will say okay I'm gonna make an example for a square then it make this will make the attention mask for a square and then it will make the new situation again with a square but not tell us the attention mask there and it will make the attention mask into the true Y so at the end we can compare what our model output the attention mask we output here without ever knowing that it should be a square and we have the true label which comes out of the generator that at the beginning decided that it should be a square and then the loss in the distance between those two that's our loss this is an in this is an enormous procedure to get a loss and most crucially you have to back propagate through optimization procedures and this is something that we just can't do yet in our models if you take an image a resonate 50 right right now we do one forward propagation to get a label in this procedure if you had two back propagate through the optimization procedure for each sample you would need to basically back propagate through 50 forward passes of the resonate if you if your optimization procedure is 50 steps long and that is just not feasible right now so that's why we don't do it but I believe maybe once we find a smart way of back propping through optimization procedures a whole lot of these things will become the new and new wave and machine learning I really I'm excited - I'm pretty sure it doesn't work yet and this is very figley fiddly work but I'm excited by the prospect that we can do this so this is the training procedure right you've given X 0 x1 and a and you Optima is in order to infer the concept behind it right the generator that your level generator of your training data it knows the concept it has a concept in mind when it generated this but you're not telling your model what the concept is it needs to infer that and then using the model thing that the model inferred you can either give it x0 and x1 and infer a or you can give it the X and the a and in forex you can do either of those right these are called identification or generation respectively and then you compare the output here to what the generator at the beginning thought again it's not telling you it's that's because that's the label and you compare this to that and that will be your loss to train your energy function parameters so your training samples if you think of this entire thing as one forward pass of the model then it's just classic machine learning right you have a training sample which is one forward pass and you have a corresponding label that you infirm so let's jump to the experiments right here experiments are actually pretty cool so what they've done is for example have taken the concept of being far apart from something now being far apart so that the little X needs to be as far away as possible from the ball that has the attention on it so if you do generation and you start the little X right here and you ask the model where please infer the next state of the world it will push that little X away right here and in color you can see the energy function valleys of the position of the X so it pushes it away from this thing but if you take the same concept embedding the concept embedding of being far away but you don't do generation you do identification which means you infer the a then it will simply tell you that this ball right here is the furthest away from the X right so you can do all sorts of things like this and transferring concepts I find this here pretty interesting so they had to have two different concepts one concept is read as an identification you need to identify the red ball but the other concept is you need to turn something red right you need to take a ball that is maybe now blue and of course the color you can gradient descent on the colors you'd need to make it red and since the energy function it just takes three input X a and W it doesn't you you you're not going to tell it right now in which situation you are it has to create create this W embedding space through learning and if you do it with those two concepts then it will put the make something red concept and the is something red concepts in the same places so this is a PCA and then blue I think these blue is the attention codes for identify the red things and in red or the generation code for make something red and they will be put in the same place which is pretty cool it means that the energy function really learns the feature of something being red I find this pretty pretty neat and then here they they have some experiments where they basically show we need that gradient descent optimization procedure because only after many steps will will the energy function basically be aligned with the concept that you want so if you have a zero shot model like just one forward pass as we do here you'll see that the energy function that is supposed to make a circle from samples right this is the example concept right here it if you just have a one shot model it will it cannot but in this case at least it doesn't learn to one shot produce only if you opt in for a few steps will it get this so you optimize at inference time and that seems to be very important you can see again here demonstrations of this so the example is this and then the model as you can see after 20 steps learn optimizes the points to go to these locations whereas after only one step it didn't do that yet so there are complex things at work here and this column here is where you don't have a relational or neural network so you can't basically capture dependencies between things so you you have no chance of making a square because you don't know where the things are in relation to each other but that's more of an engineering question their point is basically that if you have models that do an optimization at inference time they are much more powerful than models that just do a one-shot forward pass it's sort of like an auto regressive model in NLP versus a non auto regressive model that produces all words at once if you produce all words of a sentence at once no word can depend on any other word and you can just come loose independent or you can just produce independent things which will make the sentence often not make any sense they also have this KL objective which is a regularizer which I believe that's just a trial and error they built it in because but it is a regularizer I don't want to really go into that and then they they do demonstration in and they re-enacted on a robot the demonstration here is that there is a situation where two things have a tension on and you're supposed to move something into the middle of the two things so that's the content you don't tell the robot the concept it needs to learn that from data and then infer that this is the concept that you want and then transfer that to the other environment now you know this you look you know there's this robot environment but ultimately they still encode the positions of these things and the position of that and really all you have to do different here is that instead of moving this actuator directly you need to like calculate what you need to do to the individual joints in the robot so I think this is maybe because it's open AI and it needs to you know look robot II and stuff but the problem here is not really different it's it's not even it's not real-world transfer or anything so yeah let's let go through some of the things they can learn with this so you can see here they can learn these regional geometric shapes and so on the left is the example event that the model needs to take the concept from now this is this is I believe very much identification so what they did is they trained with a data set where all of these appear right so this there are squares there are lines there are circles so this is maybe my criticism here that it is not so much to generally infer a concept it is more like identify the concept so the model basically just needs to decide is this line is the circle or is this square because that was those things were in the training data set it would be nice to see how this generalizes to general concepts or if we can even make that if we can have a zero shot concept inference and then transfer those concepts to other things maybe that's already happening I don't I don't know so here the spatial arrangement is to either be close to something or to be between two things so if the attention is on two things you want in between so you see the top ones are the demonstrations it needs to recognize the concept and it needs to basically optimize to fulfill that concept shapes so to make shapes is mmm oh yeah there's a triangle right again this this this just very much I believe relies on recognition and not actual understanding of what a triangle here you have proximity being closer being far apart what else is cool oh yeah if the recognition for the same task right you need to identify the ball that is closer for and here you really also see the optimization procedure in action where for example at the beginning of each flicker you kind of see the attention being everywhere and then stabilizing to one or two points so if two points are equally close or far apart you'll see the attention being on multiple points which is pretty cool right so that means the model really learns this this is concept here's the count quantity so you can either have one two or larger than three or something yeah that seems like they tried three and four and didn't work so they just said I will just do larger than three and here is this robot thing where it also always needs to move in between now this this is the part that I'm not really impressed with but you know whatever whatever you want okay I hope this was a good introduction to energy functions what you can do with them what I think of them end of this paper it is a pretty cool paper it yes it only works on toy problems so far but I believe this is one interesting direction of future machine learning and something yet to be very much explored if you like this content please subscribe tell all of your friends about it share and I'll see you next time bye bye\n"