DeepMind Reinforcement Learning

**The Power of Generative Query Networks**

Those features they aren't hand-coded then the generation network will be asked to predict aka imagine a scene given both they previously unobserved viewpoints and the scene representation created by the first network. The generator essentially learns how to fill in the details given the highly compressed abstract representation created by the first Network inferring likely relationships between objects and regularities in the environment.

**Understanding the Relationship Between Two Networks**

I'd like in the relationship between these two networks - the relationship between a crime scene witness and a sketch artist. The witness remembers fragments of a criminal, their height, their hair color, their choice of Linux distro, and the sketch artist must discern the full picture of the criminal based on a few details inferring the likely other traits based on what they're given by the witness. Put more formally, first the algorithm collects a set of different viewpoints from the training scene, each viewpoint is an image, each fed sequentially into the representation network which is a convolutional neural network best known for image classification tasks. An image is a matrix of numbers and through a series of matrix operations a convolutional network will continually modify that input matrix. The result is the representation it will create as many representations as there are viewpoints, then they performed a summation operation on the representations to create a single representation or R. This representation is then fed to the generation network for the generation Network they used a recurrent neural network since they are capable of processing sequences of data during training. Recurrent networks aren't just continuously fed the next data point in a data set, they are also fed the learned state from the previous time step which is what gives them recurrent knowledge of the past. They learn from what they've learned before allowing for a contextual understanding that incorporates time into their predictions.

**The Role of Latent Variables**

Since they wanted an agent that could predict the next frame in a sequence of 3D environment frames, they needed to use a sequence model and the generator network used what's called a latent variable to mathematically vary the output of it. The generator then generated a likely image for a given viewpoint and that generated image was compared to the actual viewpoint an error value was computed by computing the difference in these two images mathematically. Then they updated both networks using that error to be just a bit more accurate the next training iteration via the popular back propagation technique updating the weight values of each this optimization strategy meant both the representation network and the generation network were improved over time at the same time as the agents navigated whatever environment it was in making this an end-to-end approach.

**From Simple Environments to Complex Ones**

They first trained it on a few simple 7x7 Square Maps with a few objects in them. Over time, it rapidly learned to predict what an entire map looked like. So they gave it a more complex Mais instead and over time I learned how to represent that as well at first it was a bit uncertain of some parts of the map but with more observations and by more I mean only five total its uncertainty disappeared almost entirely eventually.

**Deep Reinforcement Learning**

They wanted to use it to control a robotic arm to grab a colored object in a simulated environment. Because Yolo, not the algorithm deep reinforcement learning is a combination of deep learning aka learning a mapping and reinforcement learning aka learning from trial and error in an environment. It's been behind some of the big AI successes of the past few years like alphago a notorious DQ learner. The idea is that the AI agent learns a policy for playing a game by learning directly from pixels from the game frames no hints as to what the objective of the game is or what the controls mean.

**The Power of Data Efficiency**

The problem with this approach is that it requires a very long training time to converge to good results. So they conducted an experiment where they first trained the GQN to learn how to represent observations of the environment then they use its learned representations as input to a policy algorithm that learned how to control the arm. The representation encapsulated what I saw the arms joint angles the position and the color of the object the colors of the walls in a much more compressed way than just using the raw input pixels. And because of this, they saw that it was substantially more data efficient requiring only a quarter of the training time that a raw pixel version would require. Very impressive indeed GQN is exciting because a major limiting factor on what it can do is computing power if given enough computing power who knows what kind of amazingly detailed environments it could generate and this is exciting for anybody designers artists engineers scientists who could use a tool to help them visualize and create things.

**Three Key Takeaways**

Deep Minds Generative Query Network learned how to perceive and interpret an environment without labels. It consists of a representation network which encodes image frames and a generation network that generates them based on those representations. And it did surprisingly well requiring only a fourth of the training time for a deep reinforcement learning task that a raw pixel focused algorithm would require.

AI is never boring if you want to stay up-to-date with the latest advancements in machine learning and artificial intelligence, then this is an article worth reading. The Generative Query Network is a powerful tool that has the potential to revolutionize many fields including computer vision, robotics, and more.

"WEBVTTKind: captionsLanguage: enif deepmind hired me i just reveal all their secrets so hello world it's suraj and deepmind just dropped a very impressive paper called neural scene representations and rendering their AI is capable of rendering an entire 3d environment from just one or a few input images and to make it even more impressive it learned how to do this without any labels it just learned from data it obtained itself from exploring different 3d environments I'll explain how it works in this video since there are a lot of applications for this technology for example companies like Google and GM are spending billions of dollars on research and development for self-driving cars if a self-driving car incorporated this technology into its stack it would give it yet another signal of what to expect on the road after training on millions of hours of dashcam footage it could create a reliable 3d map of what's likely to come up on its path adding to its predictive decision-making capability and thus further reducing the risk of an accident also creating a quality game is no simple task often it requires many hours of practice with a whole host of tools to begin the process of creating an intricately detailed 3d world this technology could allow anyone to generate a 3d game world from a simple 2d drawing or even a real-life photo letting them iterate on their ideas much faster and these 3d worlds could be used not just for gaming but for virtual and augmented reality the AR versions could let engineers design different versions of a component much faster acting as a design tool so the researchers had a straightforward goal create an AI that could understand any given scene if you or I were placed into any given scene be that a savanna in Africa or Savannah Georgia we would immediately perceive and interpret everything around us we would observe where the nearest Wi-Fi was whether it was day or night based on where the Sun was positioned what type of animals rounded us like Rafiki where the nearest village might be based off of footprints our brain with Lawren to form representations of the environment that would support not only classifications of what we see but also motor control memory planning imagination and rapid skill acquisition all related to the environment all of this without any teacher telling us these things we would do it ourselves David Marr one of the originators of the field of computational neuroscience suggested in his influential book vision that there's likely a generative process in the brain that gives us this incredible ability one that doesn't require supervision in contrast the current most popular computer vision technique is to use deep convolutional neural networks on big labeled image datasets to learn how to interpret images this label centric or supervised approach requires an intensive process of humans manually labeling images then having a neural network learn the mapping between the input data and the output label only then can it classify what it sees in an image but even then this approach doesn't give the AI the ability to really discern what it's seeing and in what ways different objects in a scene relate to each other so the researchers decided to create an AI that internally represented the environment it was placed in more like we would - our constant existential crises the idea then is to give this AI some input data ideally to the images of a scene that it's placed in essentially what it sees along with its position in that scene and use that data without any kind of label has the training data set luckily they had an open-source game engine ready called deep mind lab that they made that could be used as this training data sets it's a collection of 3d environments that can be used to train an AI agent in including a simple square room that you can easily vary floor and wall textures of in a more complex randomly generated maze but it seems like a simple seven by seven square room would be a good starting point for training this AI they called their AI the generative query network or gqn it consists of two parts a representation network and a generation network the representation network takes as input what the agent observes inside of the 3d environment essentially a 2d image frame the representation network then outputs a representation of that input the idea is that this representation will capture the most important elements of the scene like the position of objects colors and the room layout in a compressed way it will learn to detect those features they aren't hand coded then the generation network will be asked to predict aka imagine a scene given both they previously unobserved viewpoints and the scene representation created by the first network the generator essentially learns how to fill in the details given the highly compressed abstract representation created by the first Network inferring likely relationships between objects and regularities in the environment I'd like in the relationship between these two networks - the relationship between a crime scene witness and a sketch artist the witness remembers fragments of a criminal their height their hair color their choice of Linux distro and the sketch artist must discern the full picture of the criminal based on a few details inferring the likely other traits based on what they're given by the witness put more formally first the algorithm collects a set of different viewpoints from the training scene each viewpoint is an image each fed sequentially into the representation network which is a convolutional neural network best known for image classification tasks an image is a matrix of numbers and through a series of matrix operations a convolutional network will continually modify that input matrix the result is the representation it will create as many representations as there are viewpoints then they performed a summation operation on the representations to create a single representation or R is then fed to the generation network for the generation Network they used a recurrent neural network since they are capable of processing sequences of data during training recurrent networks aren't just continuously fed the next data point in a data set they are also fed the learned state from the previous time step which is what gives them recurrent knowledge of the past they learn from what they've learned before allowing for a contextual understanding that incorporates time into their predictions since they wanted an agent that could predict the next frame in a sequence of 3d environment frames they needed to use a sequence model and the generator network used what's called a latent variable to mathematically vary the output of it the generator then generated a likely image for a given viewpoint and that generated image was compared to the actual viewpoint an error value was computed by computing the difference in these two images mathematically then they updated both networks using that error to be just a bit more accurate the next training iteration via the popular back propagation technique updating the weight values of each this optimization strategy meant both the representation network and the generation network were improved over time at the same time as the agents navigated whatever environment it was in making this an end-to-end approach they first trained it on a few simple 7x7 Square Maps with a few objects in them over time it rapidly learned to predict what an entire map look like so they gave it a more phlex mais instead and over time I learned how to represent that as well at first it was a bit uncertain of some parts of the map but with more observations and by more I mean only five total its uncertainty disappeared almost entirely eventually they wanted to use it to control a robotic arm to grab a colored object in a simulated environment because Yolo no not the algorithm deep reinforcement learning is a combination of deep learning aka learning a mapping and reinforcement learning aka learning from trial and error in an environment it's been behind some of the big AI successes of the past few years like alphago a notorious DQ learner the idea is that the AI agent learns a policy for playing a game by learning directly from pixels from the game frames no hints as to what the objective of the game is or what the controls mean the problem with this approach is that it requires a very long training time to converge to good results so they conducted an experiment where they first trained the gqn to learn how to represent observations of the environment then they use its learned representations as input to a policy algorithm that learned how to control the arm the representation encapsulated what da I saw the arms joint angles the position and the color of the object the colors of the walls in a much more compressed way than just using the raw input pixels and because of this they saw that it was substantially more data efficient requiring only a quarter of the training time that a raw pixel version would require very impressive indeed gqn is exciting because a major limiting factor on what it can do is computing power if given enough computing power who knows what kind of amazingly detailed environments it could generate and this is exciting for anybody designers artists engineers scientists who could use a tool to help them visualize and create things three things to remember from this video deep minds generative query network learned how to perceive and interpret an environment without labels it consists of a representation network which encodes image frames and a generation network that generates them based on those representations and it did surprisingly well requiring only a fourth of the training time for a deep reinforcement learning task that a raw pixel focused algorithm would require AI is never boring if you want to stay up-to-date on the field and survive the AI apocalypse hit the subscribe button for now I've got to keep reading papers so thanks for watchingif deepmind hired me i just reveal all their secrets so hello world it's suraj and deepmind just dropped a very impressive paper called neural scene representations and rendering their AI is capable of rendering an entire 3d environment from just one or a few input images and to make it even more impressive it learned how to do this without any labels it just learned from data it obtained itself from exploring different 3d environments I'll explain how it works in this video since there are a lot of applications for this technology for example companies like Google and GM are spending billions of dollars on research and development for self-driving cars if a self-driving car incorporated this technology into its stack it would give it yet another signal of what to expect on the road after training on millions of hours of dashcam footage it could create a reliable 3d map of what's likely to come up on its path adding to its predictive decision-making capability and thus further reducing the risk of an accident also creating a quality game is no simple task often it requires many hours of practice with a whole host of tools to begin the process of creating an intricately detailed 3d world this technology could allow anyone to generate a 3d game world from a simple 2d drawing or even a real-life photo letting them iterate on their ideas much faster and these 3d worlds could be used not just for gaming but for virtual and augmented reality the AR versions could let engineers design different versions of a component much faster acting as a design tool so the researchers had a straightforward goal create an AI that could understand any given scene if you or I were placed into any given scene be that a savanna in Africa or Savannah Georgia we would immediately perceive and interpret everything around us we would observe where the nearest Wi-Fi was whether it was day or night based on where the Sun was positioned what type of animals rounded us like Rafiki where the nearest village might be based off of footprints our brain with Lawren to form representations of the environment that would support not only classifications of what we see but also motor control memory planning imagination and rapid skill acquisition all related to the environment all of this without any teacher telling us these things we would do it ourselves David Marr one of the originators of the field of computational neuroscience suggested in his influential book vision that there's likely a generative process in the brain that gives us this incredible ability one that doesn't require supervision in contrast the current most popular computer vision technique is to use deep convolutional neural networks on big labeled image datasets to learn how to interpret images this label centric or supervised approach requires an intensive process of humans manually labeling images then having a neural network learn the mapping between the input data and the output label only then can it classify what it sees in an image but even then this approach doesn't give the AI the ability to really discern what it's seeing and in what ways different objects in a scene relate to each other so the researchers decided to create an AI that internally represented the environment it was placed in more like we would - our constant existential crises the idea then is to give this AI some input data ideally to the images of a scene that it's placed in essentially what it sees along with its position in that scene and use that data without any kind of label has the training data set luckily they had an open-source game engine ready called deep mind lab that they made that could be used as this training data sets it's a collection of 3d environments that can be used to train an AI agent in including a simple square room that you can easily vary floor and wall textures of in a more complex randomly generated maze but it seems like a simple seven by seven square room would be a good starting point for training this AI they called their AI the generative query network or gqn it consists of two parts a representation network and a generation network the representation network takes as input what the agent observes inside of the 3d environment essentially a 2d image frame the representation network then outputs a representation of that input the idea is that this representation will capture the most important elements of the scene like the position of objects colors and the room layout in a compressed way it will learn to detect those features they aren't hand coded then the generation network will be asked to predict aka imagine a scene given both they previously unobserved viewpoints and the scene representation created by the first network the generator essentially learns how to fill in the details given the highly compressed abstract representation created by the first Network inferring likely relationships between objects and regularities in the environment I'd like in the relationship between these two networks - the relationship between a crime scene witness and a sketch artist the witness remembers fragments of a criminal their height their hair color their choice of Linux distro and the sketch artist must discern the full picture of the criminal based on a few details inferring the likely other traits based on what they're given by the witness put more formally first the algorithm collects a set of different viewpoints from the training scene each viewpoint is an image each fed sequentially into the representation network which is a convolutional neural network best known for image classification tasks an image is a matrix of numbers and through a series of matrix operations a convolutional network will continually modify that input matrix the result is the representation it will create as many representations as there are viewpoints then they performed a summation operation on the representations to create a single representation or R is then fed to the generation network for the generation Network they used a recurrent neural network since they are capable of processing sequences of data during training recurrent networks aren't just continuously fed the next data point in a data set they are also fed the learned state from the previous time step which is what gives them recurrent knowledge of the past they learn from what they've learned before allowing for a contextual understanding that incorporates time into their predictions since they wanted an agent that could predict the next frame in a sequence of 3d environment frames they needed to use a sequence model and the generator network used what's called a latent variable to mathematically vary the output of it the generator then generated a likely image for a given viewpoint and that generated image was compared to the actual viewpoint an error value was computed by computing the difference in these two images mathematically then they updated both networks using that error to be just a bit more accurate the next training iteration via the popular back propagation technique updating the weight values of each this optimization strategy meant both the representation network and the generation network were improved over time at the same time as the agents navigated whatever environment it was in making this an end-to-end approach they first trained it on a few simple 7x7 Square Maps with a few objects in them over time it rapidly learned to predict what an entire map look like so they gave it a more phlex mais instead and over time I learned how to represent that as well at first it was a bit uncertain of some parts of the map but with more observations and by more I mean only five total its uncertainty disappeared almost entirely eventually they wanted to use it to control a robotic arm to grab a colored object in a simulated environment because Yolo no not the algorithm deep reinforcement learning is a combination of deep learning aka learning a mapping and reinforcement learning aka learning from trial and error in an environment it's been behind some of the big AI successes of the past few years like alphago a notorious DQ learner the idea is that the AI agent learns a policy for playing a game by learning directly from pixels from the game frames no hints as to what the objective of the game is or what the controls mean the problem with this approach is that it requires a very long training time to converge to good results so they conducted an experiment where they first trained the gqn to learn how to represent observations of the environment then they use its learned representations as input to a policy algorithm that learned how to control the arm the representation encapsulated what da I saw the arms joint angles the position and the color of the object the colors of the walls in a much more compressed way than just using the raw input pixels and because of this they saw that it was substantially more data efficient requiring only a quarter of the training time that a raw pixel version would require very impressive indeed gqn is exciting because a major limiting factor on what it can do is computing power if given enough computing power who knows what kind of amazingly detailed environments it could generate and this is exciting for anybody designers artists engineers scientists who could use a tool to help them visualize and create things three things to remember from this video deep minds generative query network learned how to perceive and interpret an environment without labels it consists of a representation network which encodes image frames and a generation network that generates them based on those representations and it did surprisingly well requiring only a fourth of the training time for a deep reinforcement learning task that a raw pixel focused algorithm would require AI is never boring if you want to stay up-to-date on the field and survive the AI apocalypse hit the subscribe button for now I've got to keep reading papers so thanks for watching\n"