Deep Learning 4 - Beyond Image Recognition, End-to-End Learning, Embeddings

The Development of Depth Perception in AI Agents

In the field of artificial intelligence, researchers have been working on developing AI agents that can perceive and navigate through their environment with ease. One key aspect of this development is the ability to recognize depth perception, which allows the agent to understand its surroundings and make decisions accordingly. In a recent study, researchers used reinforcement learning to train an AI agent to play a game where it had to find its way out of a maze.

The researchers found that by giving the agent a task to repeat, such as playing the same sequence of moves multiple times, they could develop the agent's depth perception skills. However, this process was tedious and time-consuming, requiring the agents to repeat the task 20 to 100 times. The researchers noted that even though the game was trivial, the agents still demonstrated impressive skills in navigating through the maze.

To further understand the development of depth perception in AI agents, the researchers developed a more complex task where the agent had to predict its own position within the environment. They found that this approach worked much better than simply providing the agent with depth information as input. The reason for this was that the gradients from the predicted position helped the network learn about the structure of the scene, allowing it to turn pixels into something more coherent.

The researchers also demonstrated the effectiveness of their approach by showing that the agent could decode its own position using a non-backpropagated position decoder. They showed videos of the agent navigating through both simple and complex mazes, where it was able to accurately determine its position and make decisions accordingly.

One interesting aspect of this study is that it highlights the importance of depth perception in AI agents. By providing the agent with depth information, researchers were able to improve its performance on tasks such as maze navigation. However, they also found that if the environment changed over time or had a different topology, the agent needed to pay more attention and use additional cues to navigate.

The study also explored the role of human expertise in developing AI agents' depth perception skills. The researchers asked human experts to play the game and gather insights on their thought processes. They discovered that the humans found the task useful for recognizing where they were, even when the maze layout was static. However, if they removed the visual cues (such as paintings on the wall), the agent's performance suffered.

Finally, the researchers demonstrated the effectiveness of their approach by showing that the auxiliary loss predicted a representation of the geometry of the scene. This was an interesting finding, as it suggested that the agent was learning to perceive its environment in a way that went beyond simple pixel recognition. The study provides valuable insights into the development of depth perception in AI agents and highlights the importance of using reinforcement learning to improve their performance on tasks such as maze navigation.

The Future of Depth Perception in AI

One question raised by this study is whether it would be beneficial to just give depth information as input, rather than requiring the agent to predict its own position. While the researchers found that simply providing depth information worked less well than using reinforcement learning, they also noted that making the input empty did not seem to affect the agent's performance.

The researchers have explored this idea further by trying with more and less complex wall textures. They found that changing the complexity of the wall textures had no significant impact on the agent's performance. This suggests that the key aspect of depth perception in AI agents is not just about recognizing depth information, but also about using it to learn about the structure of the scene.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation. While there are still questions about whether giving depth information as input would be beneficial, the study provides a promising approach for developing more advanced AI agents.

The Role of Human Expertise in Developing Depth Perception

One interesting aspect of this study is the role of human expertise in developing AI agents' depth perception skills. The researchers asked human experts to play the game and gather insights on their thought processes. They discovered that the humans found the task useful for recognizing where they were, even when the maze layout was static.

However, if they removed the visual cues (such as paintings on the wall), the agent's performance suffered. This suggests that human expertise plays a crucial role in developing AI agents' depth perception skills, and that removing critical cues can lead to significant performance degradation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Role of Depth Perception in AI

One key aspect of this development is the ability to recognize depth perception, which allows the agent to understand its surroundings and make decisions accordingly. In a recent study, researchers used reinforcement learning to train an AI agent to play a game where it had to find its way out of a maze.

The researchers found that by giving the agent a task to repeat, such as playing the same sequence of moves multiple times, they could develop the agent's depth perception skills. However, this process was tedious and time-consuming, requiring the agents to repeat the task 20 to 100 times. The researchers noted that even though the game was trivial, the agents still demonstrated impressive skills in navigating through the maze.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent with a task to repeat, researchers were able to develop an agent that could perceive its environment with ease. The study highlights the importance of using depth information to improve the agent's performance on tasks such as maze navigation.

The Future of Depth Perception in AI

In conclusion, this study provides valuable insights into the development of depth perception in AI agents. By using reinforcement learning and providing the agent

"WEBVTTKind: captionsLanguage: enhello everybody I don't have a mic but this room seems good acoustically I hope everybody can hear me yes okay so I'm Ryan Hansel your latest guest speaker during for this course I'm deep mind as I guess all of the guest speakers are for this course I have been at deep mind for about four years and I lead a research group in the deep learning team in particular my research focuses on aspects of continual learning lifelong learning transfer learning so I think that that's actually incredibly important for getting deep learning deep reinforcement learning to work in the real world I also work on robotics and miscellaneous other topics that come up and so I'm going to talk about sort of three topics and then if I have time that I've got a little segment at the end that shows some research that I've have been working on recently which maybe gives you just sort of a fast-forward to just a method the details of a method that's currently out there being published but I'm going to talk a fair amount about topics in computer vision so this is actually a continuation of what Karen Simonian presented I think two weeks ago so forget everything that oriole said last week and remember what Karen said and I will be continuing from that starting with talking about beyond simple image recognition or image classification so I wanted to write so quick overview talking about end-to-end learning as we go to more complex architectures and more complex tasks also doing a little case study of an end-to-end trained architecture and the spatial transformer network then learning without layer labels so how to do how to learn and embedding or manifold if you don't have supervised labels you don't want to use them and then like I said a topic on using reinforcement learning sequence learning auxilary losses together for a navigation problem a maze navigation problem first of all and to end learning it's a familiar term I just wanted to make sure we're on the same page about it somebody tell me what end-to-end learning means someone right so fundamentally we're talking about methods that we can optimize all the way from some input or all the way to an output that we want and that everything in the middle should be optimized together and usually we do this by differentiable approaches such that we can use gradient descent methods to optimize the whole thing at once end to end and I have a little slide that I use when I'm trying to convince people that aren't necessarily into deep learning why and to end learning is important so this is a you know proof via history so in 2010 the state of the art in speech recognition looked like this you started out with for speech recognition I've got an audio signal that comes in and I want to predict text from it that speech recognition and so the state of the art for doing this involved having a nice acoustic model a nice phonetic model and a nice language model all good machine learning problems in of the selves but a modular approach right each of these things were optimized separately but these definitely gave us the state of the art and speech recognition which was not bad in 2010 then in 2010 things changed the state of the art was handed off to a deep neural network that trained the whole pipeline end to end going all the way from the output that we want text back through back to audio and getting an improvement in that so sort of throwing away the understanding of the domain experts that said well first we need to get the the you know we need to get the phonemes we need to have the language model we have need to have these different explicit components in 2012 computer vision you know state of the art was something that maybe it was like this obviously different variations of it but it involved extracting some key points in an image computing sift features some other robust feature maybe training a deformable part model and before you get out labels so pixels to labels by this sort of modular pipeline of separately trained models and of course this was exceeded by a lot in the image net challenge using a deep neural network that simply took pixels in output labels again in 2014 machine translation text in text out and this has also the news the state of the art since 2014 has been different flavors of deep neural networks so right now state of the art and robotics looks like this you have your sensors you do some perception on that sense on those sensory streams you maybe put these into a map or a world model then you do some planning and then you send some control actions to the robot before actually producing the actions to me you know I really like robotics and I would love to see this method replaced again with end-to-end learning because I think that it's obvious that there is a potential here to take exactly that say you know to take this domaine do the same thing it's harder for robotics I'm not going to talk about it today but I just like to sort of think about this as a reason for why it's good to learn things and to end do you buy that is that is it a convincing argument all right let's talk about beyond imagenet classification and so one thing that we can do so Karen talked about how to train convolutional neural networks to solve a sort of image net type of problems I believe I hope that's what he talked about yes and so let's make the point first about pre training so training big models on big datasets takes a lot of time it can take several weeks on multiple GPUs to train for imagenet maybe not anymore used to take several weeks still takes takes a while and a fair amount of resources but the network trained on a big data set like image net should be useful for other data especially if we have similar classes or a similar domain but actually people have shown that they can take a network trained on image net and use those features and use that for a wide variety of new types of problems and in different domains and that's really I think the exciting thing about the image net work both the data set and the approach so how do we make use of a trained model so we train our model the trainer our big neural network we then plug it into another network and we train whatever layers we have that that were not replaced using these pre trained layers and then we can take that keep those pre trained weights fixed or we can slowly update them so this is a simple process so train step a train the confident on image nut which produces as the output a thousand dimension imagenet class likelihood vector we keep some number of layers there sometimes all of them whatever that model is sometimes only some of them out of that layer and we we initialize a new continent using some of those pre trained layers and then we can say well I've got a last layer maybe got a couple last layers in this case the output for detection might be a twenty one dimensional class likelihood for Pascal vo C I guess that's not it's not detection but classification and so we just retrained that last layer that can speed things up dramatically and it can also be provide actually a better result especially if you don't have enough data in that new new data set all right let's look at a couple of other image recognition tasks image classification just says there's a person there's a sheep there's a dog or in the case of image net its I always find image net strange because you take an image like that and the desired output label is simply dog so it just it just outputs a single single layer and you know throws away anything else in the image so image classification is is fairly blunt let's think about harder tasks so we might want to do object localization or detection so that means we actually want a bounding box around different things basically that's saying I want a classification of what the object is and also a bound around where it is and implied in that is that it means that if there are multiple sheep for instance we would want to identify all of them semantic segmentation definitely quite a bit more challenging because here we want pixel wise labels so we want to be able to have a pixel wise boundary around the different elements in the scene an instant segmentation is we actually want to know where things are different we don't want to just know sheep no sheep a B C D and E so object detection with consonants the a popular approach that was used sort of initially is just to say well detection is just a classification problem everywhere in the image so let's just sweep a sliding window across the whole image in all positions preferably at all scales as well and we'll just feed each of those individual bounding boxes into a classifier which will say yes or no for all of the different classes this is actually not quite as bad as it sounds it's bad if you do it naively it can be done sort of it can be done with a little bit more optimization so that it's not horrible but it's it's just not great you end up with the same object gets detected multiple times so you would get sort of multiple 20 different detections of the person with with jitter around it and you also so yeah you get that the same object gets detected multiple times and you also are sort of assuming that you have just a fixed number of sizes of bounding boxes and aspect ratios instead you could say well I'm just going to directly predict the bounding box so there you say where's an object let me just regress four numbers the coordinates of the box so you can just directly use a mean squared error loss and say I want to regress the pixel coordinates of the top left corner in the bottom right corner for instance and this is not as it actually works it's sort of a strange thing to ask a neural network to do at least I've always thought that but it sort of works sort of a problem though the number of boxes is unknown and it doesn't work as well as other approaches so and the the last sort of general method for doing object detection is to predict is to take some bounding box proposals which might come from a trained Network and say for each of those proposals of where there might be a bounding box let's classify if there's actually an object there or not let's look a little bit more as to what that what that looks like and then those proposals get passed through a classifier and we can decide if they're actually if there's actually something there or not so and and this provides something that looks a little bit like a tension so instead of looking at the whole network I'm gonna first use one classifier to say here's some candidate places to look now I look just those places more closely and I decide I refine that bounding box and I say yes or no what sort of object is in there and this is a lot faster because we're not exhaustively considering the whole image there's no reason to sweep a window across a big field if of empty space or a big blue sky we immediately sort of home in on possible possible objects and so this is I'll talk for a couple slides about faster our CN n our CN n stands for region CN n and this has gone through a couple of iterations in the last couple of years with people coming up with refinements on the basic on the basic approach so we start with convolutional layers that are pre-trained on imagenet and then there is a proposal stage where we have a one network here that looks at the feature layer the feature layers of the confident and says I'm going to produce a number of proposals and these are bounding box box estimates then you fill in those bounding boxes with the actual details and send it to a classifier and that classifier is going to refine the location like I said and decide what class of object is in there if any and also maybe do some region of interest pooling if you have sort of multiple detection x' in the same area yes yes you can so the so you I will actually go to the next slide because it offers a little bit more details here as to what's actually happening here and there might actually be and I thought there might be a little bit more details on the actual equations but let me let me talk about this maybe then it's it's a little bit more clear so what we do start with a so we have a convolutional feature map and we slide a window across there but it's sort of a big course window for each position of that window we actually look at a selection of what we call what are called anchor boxes and these offer a variety of different aspect ratios the sort of a bunch of templates that says let's look at these different course sort of shapes in the image and for each of these then we're going to we're going to predict whether or not there are these different boxes with these different anchor boxes with respect to this anchor location and we're going to predict if the proposal contains an object so let's see here so we take this sliding window and we take the anchor boxes and we're actually considering the content in there and that's what makes this differentiable is that we still have a sliding window approach we're just considering a limited number of different options and then we go through an intermediate layer that's got you know 256 hiddens there and we can produce two things here one is for each of those anchor positions then what what are the scores with regard to whether or not there's an object there and then whether and then a refinement on the coordinates and one of the most important things here this looks similar to the other approach but the important thing here is that everything is in reference to that fixed to this sliding window location it's anchored there so when we are predicting what the new coordinates are and how to refine that bounding box then it's then it is relative to this central position and that makes the neural network a lot more able to scale and makes it truly translation invariant across the entire image space which is important otherwise if you're asking the neural network to produce information about whether or not this bounding box should be moved to you know pixel 687 versus 689 then those aren't numbers that neural networks work with very well with a lot of precision and so this is used instead of this approach well I'm not sure exactly what this cartoon is supposed is supposed to show but I think it shows that we're producing proposals separately and here we're not we're really just considering the we're refining this this this method of scrolling across the entire image space it's a little bit more like a very structured convolutional layer because it's looking everywhere at these different aspect ratios and to further improve the performance we can always because this is differentiable then we can back prop all the way through to this to this to the convolutional stack and to those feature layers and make them a little bit better a little bit more sensitive which is important sometimes when we get at the end of training on imagenet we get these sort of we don't get crisp locations if we want to get bounding boxes we need crisp locations in the image so it can be useful to pre trade and you get a little bit different feature representation that way all right next let's talk for a moment about semantic segmentation so semantic segmentation means that we are going to label each pixel in the input image into one of the different object classes that we have available and the the usual way this is done is to classify each pixel for each class so that you end up getting the output of your system it's going to be a full resolution map so the same resolution as the input image and but with the number of channels that's equal to the number of classes you have and each each of those layers is a is a binary mask that looks for different that looks for each different class or has a likelihood in it it hasn't been thresholded and so one of the important things here when we consider doing semantic segmentation using a convolutional network is that what happens at the end of a convolutional network we have pooled and we have you know sort of lost lost resolution so that we end up with something that's very semantically Laden at the end of a say an image net continent but we don't anymore have any spatial resolution so going back to that the full resolution input size is sort of the trick to to be done here so one way to do this is to use the different resolution preserving building blocks that current Auk Taback Uppal of weeks ago so to reverse the pooling then we can do a transposed convolution which deconvolute s-- or up samples and we can also replace regular convolutions with a dilated convolution with a stride of one let's see how that works eh we can look at a like I said in the usual I guess this is a vgg net in a usual Network we would have the input resolution and then as we go through this through the layers of this network then we lose spatial resolution as we add feature layers so the output here would be 21 different layers but we don't know any longer have any or much of any spatial resolution so I've got 21 different layers representing 21 different classes and I've got a probability for each of those layers whether or not there's an object there but I've now I am far too coarsely sampled far too high of a low of a resolution so one way to do this is to simply say well I'm going to get to that point and then I am going to up sample and I'm going to use a transposed convolution and that's going to increase the spatial resolution and I'm going to get back to the scale the the resolution that I want or stop stop somewhere in the middle at some intermediate scale this does not work that well as you might guess why because you're going through a bottleneck here where you're losing a lot of the information about where things are so really what you're going to end up with if you train this is what we have here you get blobs they're nice blobs but they're not really what we're what we're looking for so we basically have that semantic information as to what objects are there but we've lost the positions so one way to deal with this is to say I like the information that's here that tells me what classes are wait what classes are in the image but I need to know where they are so that information should still be there in a previous layer so what I'm going to do is to combine this this representation with a skip connection from here and I'm going to bring these two together and so this would I would have to do a 32x up sampled prediction but if I have combined together a previous layer with the current confident and learned that combination right so this can be a learned connection here I learned fully connected usually a linear layer then what I can now do now I've got a space that has the semantic information has more resolution and I can just do a 16x up sample to get something more like this or obviously I can repeat this I can say let me have actually information from further back in the architecture when I had even more higher resolution information and bring that together and be able to now have a representation with features more resolution now I can just do an 8x up sampled prediction of the actual mask and get a better result at the end yes to be honest let's see here we are doing a we are taking this information and we are doing a 2x up sampling of this so just repeating information every 2 by 2 right and then we are combining it with we're adding another layer that is the that's just a copy right so now we would have two layers and these are of course many feature layers here but I've got one that I've simply copied the information to up sample at once and I've added in another layer that gives the features then I can then this up sampled this transposed convolution that I'm doing here has more to work with it's got both information at both layers the course semantic information and also that and then if I just keep on doing this that two by two then I get reasonable this notion of using of having an architecture by the way that has a bottleneck which is very useful for learning an abstract semantic disentangled however you want to call it learns a good representation of the data but combining that with a skip connection it's very powerful architecture and it's you see this theme coming up in different types of work so you guys looked at auto-encoders yes so there you see that you have a similar sort of thing even though that is a would be trained in a different way is an unsupervised approach but where you want to start at some resolution of your data you want to learn a representation that gets sort of narrower and narrower goes through a bottleneck and then you want to go back to some sort of a back to a finer resolution so you add these skipped connections right so I've got this bottleneck architecture and then these skip connections that help carry that information through the other side so it's just sort of a I find that there's a lot of different sort of applications of neural networks that end up I often don't want the small amount of knowledge I often want something a little bit bigger and using skip connections to lower layers can be very helpful and of course this is the one of the idea the the idea that grounds residual networks is having that residual or skip connection yes yeah I mean I also have the question is to you know why not why not go and step forward and I'm I'm not positive as to whether or not they did those experiments for this paper now the there is a nice work a nice paper by Jason use insky called do deep neural networks really learn to transfer I think it's a question Jason knows in ski and it's just a really nice sort of examination of if I train this big long network then where do I want to actually draw knowledge from when I use this for a different task and and as you would expect you do get that down here I've got you know the the the information is is very specific to the data to that problem and they're the features are very sort of low-level the features being low-level means that it will transfer very easily at a higher level it tends to become more sort of problems specific I think I said that the wrong way the earlier layers are more general the later the later you go in the network then the more sort of problem specific you get all the way to the layer where you're actually classifying a particular set of classes there aren't any sort of magic answers but it does give an interesting insight to this and some interesting experiments on it for the most part with this sort of an architecture as with most deep learning then there's a lot of empirical you know experiments to say is it more useful to draw you know from pool 3 or pool 4 or both I guess you'd have to from both alright yes so the classification is going to be present there but not explicitly here it is explicitly they're a class label a class likelihood over each possible class so here obviously the information is there but what you get is a little bit more attention to the details for instance if one of your classes is a person then at this level you're going to get clearly yes there is a person in the image but back here you're going to get I see an arm I see another arm I see a leg I see another leg and together that information gets put together so but of course you could end up putting together that I see a leg and a leg and an arm and an arm and at the end say it's actually I don't know a doll or a robot or something like that at the highest level right so you give you let the highest level make that final decision at the level of the class you're trying to you're trying to predict the lowest level can say but if there's a he an arm here then this is how I want to segment it and an arm here so make a decision and then come back down again what does I think what we do I'm going you know if you ask a kid to outline things in an image or adults as you do for all of these labelled data sets there's adults out there that are grad students I don't know he's sorry and you're saying that you don't think that the the training on this would generalize too right if that's all you had as your what you're training what you're learning from then it would be really hard to solve the problem if all you had was for instance crowds of people wherever where everything was was overlapping but luckily that's the point of data sets being big enough that they capture lots of different things and sometimes you will get humans that are you know prototypical and sometimes you're going to get more muddled scenes and more noise things mislabel etc what we sort of rely on when we train these things is that given enough data we are going to get enough of a learning signal to be able to learn to learn what we want and indeed these these approaches work surprisingly well what they don't do well on is you know you don't think that there's an example here you look at scenes where you have a people are interested in doing semantic segmentation I think that the number one reason why people want to do semantic segmentation in sort of an industry is for autonomous cars because everybody wants to take a street scene and understand it you've got lidar you've got other sensors but what you need to know is is that a pedestrian and are they about to cross in front of the car or on you know understanding different aspects of the scene so you want to do scene segmentation there and understand these different classes and you look at the results that people have on these sort of big street scenes with cars and buildings and street signs and people and bicyclists etc and they do really well on parts of it but then parts of it will will be really really poorly done whereas humans can take just a couple of pixels in that image given the context of the whole scene and say yeah that's a you know telephone pole that's not a that's not a human so I'm both impressed where this this sort of work has managed to get in the last 10 years or so but also there's still a lot more work to go so another way to do this just look ahead another way to do this is to instead of using a regular convolution at all you avoid the problem of having the reduced resolution by using throughout the network you can use a dilated convolution and so this means that you are removing your pooling altogether and you are replacing convolution with a dilated convolution with a step size of one if you think about how that works you're actually going to end up with you have this broad receptive field meaning you're looking at a larger part of the image to make a decision which is one of the reasons why we do pooling is so that a convolutional filter can look it can have a receptive window that's broader on the scene so that you get more high-level information but instead we can sort of say well I'm just going to look at every other pixel but I'm going to move this thing slowly across the whole image right so this allows you to have the same receptive field but have no decrease in the resolution as you go through the layers of the network and this gives you it's sort of a simpler architecture because you don't need to worry about getting back up to the full resolution and it also gives higher accuracy because now you're really a little bit more just directly training for the type of information that you want you're saying simultaneously give me precision pixel level precision and give me high level information and let the network weight sort of work out how to do this from this sort of a structure does that make sense all right and video classification with confidence I went to cvpr which is the biggest conference that looks at it does computer vision that works in the computer vision domain and I was still struck by how many papers were there that we're focusing on single images the world is not presented to us as single images why aren't we working on video but there's still a lot of work being done but we do have means of doing classification and segmentation and all of these sorts of problems in videos so here's a few different ways to ways to do this first of all starting on the Left we can just say well I'm going to process one frame at a time right I'm going to pretend that my video is just a set of images that may or may not be related and I'm just going to run a continent through every single one of them this is sort of like doing a sliding window approach to do detection using a classifier network this is sort of exhaustively looking for dogs for instance by considering separately every single frame this is inefficient but moreover it doesn't work well because the whole point of considering multiple frames is that you can build up your certainty over time when I see just a couple pixels of a light post and I'm trying to figure out if it's a light post or a person down the road then I want to see you know more information just getting a few more few more samples can help me make that decision so another way to do this is I'm going to run my classifier over all of the images but then I'm going to train a layer that's going to so imagine we're training the same or we're running the same convolutional net work independently over each frame in this video sequence but then at the end I'm taking a I'm taking the outputs of across all of those frames and I'm training one network one layer at the top to say there's a there's been a dog scene there's been a human scene something like that so late fusion early fusion let's instead take advantage of the fact of the let's use the neural network to reason about multiple images at the same time so instead we feed in a block of images so instead of my input being RGB now I've got n different images stacked up and my convolutions then can go in my confident can go across the image space and sort of XY but they can also go through the time direction as well so just a simple extension on your standard convolutional Network and everything is exactly the same it's just now my input is a block of images rather than a single one and we can call this an early fusion model this means that my network all along the way obviously I would need to fine-tune this but along the way I would be able to make a better decision because I'm looking at those features and motion at a lower level another approach would be slow fusion and the idea here is that I'm going to do some of both I'm going to run independent feature extractors confidence over each individual frame but then I'm going to in the middle start to put these together I will point out this is for vid video classification but we consider exactly the same gamut of different options all right yes I'll take the question first does which approach work so they do work the the thing that sort of that's not great here is that all of these approaches assume a fixed temporal window that I'm going to consider right they all assume that you know for instance that ten frames is good enough to detect everything so that means that you're not going to be able to see a glimpse of the tail of a dog and then the head of a dog you know 20 frames later and be able to say I saw a dog so and you can always construct a case where you want to have a wider temporal window or where a narrower one would be better this is exactly the motivation for using a recurrent neural network instead which is probably what Oriole talked about last week maybe or did you talk about text yeah okay yes right so you can definitely use a dilated convolution in overtime and be able to get a much better a much better field of view temporally in exactly the same way that we want to might want a broader field of view over the the image space and this is that's what's used for for instance pixel net or wave net pixel net wave net these approaches they take they process a nut pixel net because that would be pixels wave net is an approach that does that does speech generation or audio generation and it learns via dilated temporal convolutions exactly I the the slight tangent that I wanted to mention is that we're talking here about video about a single modality but this is the this is the same gamut of different approaches that we consider any time when you have two modalities so if I've got say audio and video which is honestly what we should be looking at not even just a video we sort of understand the world through the media of audio and video now you've got these two different modalities how do I understand those do i process them completely separately and then at the end put them together and try to solve my problem you know do do speech recognition or some problem from there or do i somehow fuse together these different sensory modalities early on we do the same thing in robotics if I've got a robotic arm then I want to be able to both process my image of that hand moving as well as the proprioception which means what is the joint position you know my knowledge of how what my hand what my joints are doing their velocity but so tactile information right I've got these different sorts of information coming in how can I combine those what is the best way and I think that this is an extremely interesting question because you can really you you can come up with arguments for any of these different approaches and with and without recurrent networks and etc there isn't a best answer but I think that there should be a little bit more principled you know research and understanding of why to use these when how to use the different architectures interestingly all right quick tangent to in the brain they used to think and I'm saying this from a colleague of mine I am NOT a neuroscientist but I was told that they that neuroscientists used to think that there was late fusion of different sensory modalities in the brain so the way in which we process audio the way we process vision whatever else then those get fused sort of at the end so there's the independent things they've and that was because you have your visual cortex you have your auditory cortex etc and the two are relatively separate just recently they've discovered actually there's all of these pathways that go in between so that maybe looks a little bit more like this or like this but with lateral connections here so there's some separation there different dedicated processing but then there are all of these pathways of connections that allow the two to communicate so that you can get feedback early on in the cortical processing between what I'm hearing and seeing and touching etc which makes sense quick example of doing of a specific means of doing a processing video the idea here is that we want to use two sources of information one being the motion and one being the visual information the idea is is that maybe if we prot if we understand sort of process these separately that we can get a better results better accuracy of doing action recognition I'm pretty sure that this was for action recognition fact I'm sure it wasn't so we trained or this is actually from Karen in andrew zisserman you trained to confidence and one of them is going to have as its input a stack of images and the other one is going to have as its input a single image and what you're going to try to do here is you're going to hit you're going to train this with a loss that tries to predict the optical flow and you're going to train this one with that is predicting I don't remember my guess is that you're predicting here the that it's pre trained using image net and then we've got a neural network layer fully connected layer that brings that has its input the two output layers of these two different sort of pathways and unifies them and comes up with the signal single classification of what type of action is this okay that's the end of that section maybe let's do the five minute break now and then I can jump into the next section so this is a paper from a couple years ago from Max yata Berg deep mind and just to motivate it let's think about convolutional neural networks they have pooling layers why do they have pooling layers because we want more invariance translational invariance right we want to be able to have the same activations we want to pool together activations over broader areas or rather sorry convolutional layers give us you know translational invariance some amount of it and pooling sort of accommodates different spatial translations to give a more uniform result and make the learning easier so pooling does two things pooling increases the the field of view for the next layer and says now I'm looking at information over a bigger projection onto the original image so that I can make a more a higher-level feature detection but it also acts to say whether I saw the arm here or here or here or here it's still an arm so it works in concert with the convolutional operator which is able to do to give a uniform detection across different areas and it pools that together so that it just has representation of yes there was an arm I don't care where it was but the this this nice system only works strictly you know only works for for translations and there's lots of other types of transformations that we're going to see in particular and a visual scene and it's hard coded into the architecture as well so various people have come up with cool architectures where now the weight tying is instead of just having a convolution this way then you also have weight tying across different rotations and there's different ways to do this it gets a little bit ugly though right but the usual thing is to just say well if I want to detect M nist digits you know that are turned upside down or faces that are turned upside down I'm just going to learn on a lot of data so that the basically then you need to learn to recognize fours versus 7s when the right-side up and when they're sideways and when they're upside down so you're making more work because you have anything that's in your in your architecture it's that's innate that will accommodate these sort of transformations so let's learn to transform the input instead yes exactly exactly so this is done routinely a called data augmentation where I'm going to introduce some variations to my data so that I can learn across that obviously it makes the learning harder ok now I've got a confident that yes it recognizes rotations and and and this is and this is still the the standard approach and a wise thing to do this offers a different complimentary approach that's sort of interesting because it's a way of tailoring what sort of in variances you want to your actual problem so here's the here's the the challenge I'm given data that looks like this different orientations wouldn't it be nice if I had a magical transform that recognized what sort of or what sort of transformations there are in my input space and got me to a canonical pose such that my Convenant has a little bit less work to do that's true and that you're right that you know the these low level the first layer of Gabor filters what look like Gabor filters are extremely general and have all of the rotations in them and so but the problem is is that in order to recognize this versus this I need different filters to do that so I would need the entire gamut of different filters which we have but that's to recognize that that's to recognize different different types of things I mean the problem is not at the first level it's somewhere in the middle when I start wanting to put these together to recognize a7 and I'd much rather be able to assume that a 7 always has a point going you know in the same orientation if I have to recognize that particular little V feature its distinctive of a7 in all orientations I have to have that at the next layer as well so just having all rotations of my Gabor filters the first level doesn't give me the rotational invariance at the next level or at the highest level alright so if we were to make this differentiable then ideally we'd be able to learn a transformation there that would be useful for exactly my data type rather than saying externally I want to build in rotation invariance what if I don't know exactly what sort of problems there are and my data or what the canonical pose that would be most useful for faces we have an idea for other types of data who knows I just know that I've got a lot of data it's got a lot of different variants and is there some way of making this a little bit more homogeneous so that when I apply my confident it doesn't have to work quite so hard so that's the idea of learning tea learning something that will transform this to make it understood we can think about this more generally this goes back to your question those first level Gabor filters are pretty useful in pretty general already may I want to just keep those here maybe I just want to have something that I can insert between two layers to say take this input take this output from one layer of my processing and transform it before you feed it into the next layer and learn how to do this then you might get that this transformation here is not very useful so you would hope to just learn an identity there this might be the useful one where I would want to get some sort of rotation for instance to a canonical pose so this is the convolutional network in a nutshell the idea is that again I'm imagining that this is planted between two other layers of processing so I have some input you previous feature layers what I want to do is to first predict theta so these are the parameters for some set of transformations that parameterize is this function tau which is my grid generator that's going to produce a sampling grid the output of this is an actual sampler which is going to take the features in you and turn it into my input into the my next layer processing V illumination is fairly I mean sure yes you could illumination you're right you could get a better normalization than you would get through the bias correction that you get sort of for free and a convolutional Network the bias correction that you get and a convent net is going to apply to the whole to the whole feature layer so you're right you might get a nice sort of normalization of of it if you could do if you could do something a little bit different often illumination does get handled fairly well by a confident already as long as it as you don't train it on dark images and then try to test them outside of the set so this relies in order to make this differentiable then we want to have a components which are differentiable so here we consider these three components like I said first we have something which is called the localization net which is going to be a trainable function that's really doing all of the work here this is the thing that it's where we're actually optimizing parameters that's going to map our features to our trans transformation function the grid generator is going to take those parameters that have been inferred from my current input theta and create a sampling grid and then the actual sampler is going to do the do the transformation and feed it in today and feed into the next layer so this is simply I am going to use r6 because we'll start out by just thinking about affine transformations there are different types of that that's the one thing that you do need to define is you do need to say what is my transformation function that I am going to be embedding here the rest of it the actual parameters of it what that transformation is is going to be determined by this localization that per each image that comes in which is why we're not just applying a general rotation to all the images but each one is going to be rotated or transformed separately but one could use the same approach and have many different types of functions embedded there so we have a trainable so first of all the localization Network is a trainable function that's going to act on the outputs of you and produce theta and so our forward pass just looks like normal inference with a neural network of your choice with the neural layer of your choice some sets of weights right second component the grid generator so this is parametrized by the theta which we have inferred and this generates an output map and so we're going to have a output we're going to take our parameters theta and we're going to produce something that is in the same size the same dimensions of our of V of where we are going into and then the last piece of sorry still in the the grid generator so this is the forward pass that we would use for affine transforms to be specific about it the the six estimates of theta that give that rotation translation scale and so we can think about this as being that the output of the grid generator is a sampling grid that says for each component in that for X s and YS then I'm going to index into my input map to figure out where I should sample that to get my new my new output and the sampler is going to the sampler is going to actually do the sampling to fig to apply a kernel function to say where in how am I going to get go from u to V based on the mapping given to me with XS YS and the forward passed there then looks like this general formula which uses some kernel function and then we can look at this for a particular transformation in her particular sampling sampler such as bilinear interpolation and that gets me to a specific formula for what that sampler is going to look like for the forward pass and as long as this is differentiable with respect to x and y then you're good and I think it is for all kernels right so now we need to go in the opposite direction so we've looked at our forward pass localization network the grid generator the sampler and that creates the new input and then we proceed onwards as we're coming backwards through the network we want to first back propagate through that bilinear sampler to get the derivatives of V with respect to with respect to U and x and y right so here we've got the gradients going this way the gradients going that way and so this is the derivative of V with respect to you going through that bilinear interpolation and this is with respect to X I and why I would be the same and this uses this has discontinuities in it so we use sub gradients because of the Max next through up to backprop through the grid generator the function tau and we need to because we want to get to the derivative of x and y with respect to theta with respect to our output from the localization Network and last we are going to back prop the localization Network in the usual way because that's neural layers it's just going to be a set of weights and a bias may be a non-linearity and that will give us the the gradient of theta with respect to U and that sorry and then we can and then we can obviously continue to back prop through whatever other things we have in our network stack at that point but really this is just a matter of sort of choosing things to begin with that were reasonably differentiable even if there's discontinuities and being able to have that produce those those gradients so let's take a look at how this works maybe this video is working don't have any control over the video all right so this video actually started earlier so what do we see happening here these are two different so those are an affine function that's been used there okay I can step through it okay so there's a bunch of experiments that we're done on this I almost all of them with em nest although not all and the idea here is to try different types of transformations such as a thin plate spline versus an affine transformation as the sort of chosen space of functions and then we can see how it does and what we're seeing in first on the left is what the input is and then and all we're trying to do here note that the only way that we're training this is by trying to predict what the what the digit is right we're just trying to predict if it's a five so this means that it's up to the network to to produce this spatial transformer like I said this could just end up being identity and that is exactly what you get if your inputs are relatively well normalized centered etc but if you start moving them around then what you learn is this transformation that gives you in the the output after that spatial transformer is quite stable it's not completely stable but it's stable enough for the rest of the continent to do a better job on this and that that's sort of the the important take-home here this was also used for M NIST edition so I've now got to the I've got two channels being fed in together that and we have two and we have two different spatial Transformers one learns to trans to stabilize channel one the other one learns to transform channel two and in this case the only thing that we're training it on is what the out what you know three plus eight is is what the output is of image a plus image B which makes it into a harder problem and just demonstrates that you can still learn through this complex architecture and get something reasonable lots of moving things yeah right more moving things with noise I can't move past this there we go okay next I'd like to talk about yes you can I don't remember it's a good it's a good question obviously six and nine is a little bit of a problem I'm not sure if they constrained it to not be a full rotation beat for that reason that'd be a problem for you as well I would point out there's no magic here yeah go ahead because otherwise if you don't use a kernel to sample the input then what you're going to get in the output is something that is very has holes in it and is less accurate that's why we if you're if you're sampling an image into a warp some transform then you need to use a kernel at least bilinear interpolation is going to give you something smoother than using nearest-neighbor which is going to give you something smoother than just using the targeted pixel no it's just about retaining more of the information content I mean imagine that my transformation is is zooming in then I'm going to be sampling a lot all in one area and you're going to end up with it being you know the areas between different pixels get sort of blown up and distorted and it's not going to look smooth it's the same thing you get if you try with different sampling techniques and you're you know just an image processing program on the computer you get quite differ results if you use different types of kernel sampling know the only learned part there is in the theta is in the localization net the rest of it is just turn the crank it's just machinery put in place that we can back prop through the sampling is what's actually transforming the output of you into something normalized that we feed into V no not sampling in that sense all right we good to go yes it's a nice paper if you want a nice read on this this method I enjoy the spatial transformer paper alright learning relationship so rather than learning classification or regression sometimes that's not the goal sometimes we just want to know similarity and dissimilarity for instance so I don't want to know yeah enough said sometimes we want to do visualization in which case it's not really interesting to set two to do classification we want to know how a bunch of data is all related and so for the purpose of visualization I might want to infer relationships between between things you know if I understand x and y how do I really said to two x and y I might want to do clustering or I might want to understand something fundamental about the underlying structure of a set of data so these are all cases in which it is not may not be helpful or possible to do a standard supervised learning approach like classification or regression fundamentally they all relate to taking something that's high dimensional and producing something that is low dimensional but that still capture some properties of the data that we care about and that's we call that people are very quite loose in their terminology there but this is generally called the the embedding of the data or the manifold of the data learning so in a standard so one way to do this that people often use if they've trained a imagenet network for instance for classification or you can just simply take off the classification layer and then you can say aha there's my embedding right there's my feature space maybe a hundred dimensional or you know something higher but I'm just going to say that is my manifold of the data and that works for some cases and you might do this but you may not have any training lay labels or you might want to generalize to new classes in which case having this this trained doesn't really make sense so a different way to do this is to think about this in terms of not supervised learning where we associate each input with a label but instead have embedding losses which in which associate inputs with each other right so fundamental idea here pull similar things together push dissimilar things apart in that manifold space right these are just similar but they look similar we want these to be separate we don't want these to get mapped together if you looked at excel wise similarities those are going to get those are going to be maybe not nearest neighbors but pretty close right in in our pixel space we want to learn a space where these are far apart these are both buses we want them to be mapped together so that's an example of where we're actually taking the label of the object so we could just do super on these or we could use this these labels in other ways and there might be other reasons why we have information about which things are similar in which things are different get back to that moment all right so how do we design losses for this if now I just have a relationship between two inputs rather than and rather than a label that I'm trying to predict so typically all of these approaches involve distances in feature space often an l2 distance so one example is a contrastive squared loss so here we say that I'm going to take two inputs X I and XJ and I'm going to take a label but this is a similarity label so it says either these are similar or these are different for X I and XJ and I am going to have a my loss function is going to say that I am going to pay if Y IJ equals 1 which means that these are similar then I want them to be close in my feature space and I'm going to pay a cost if they are far apart quadratic cost for them being far apart if they are if Y I equals negative 1 which means that they are dissimilar then I'm going to pay a cost for having them close together and I want to push them further apart up to a margin if you don't have this margin M on your space then you will be trying to push things explode it's infinitely far apart it's not well constrained so this is a contrastive squared loss where I've got two different penalties depending on which samples I haven't if I want to pull them closer in my in my space which is a function f or if I want to push them further apart and I trained my network with this and it's sort of like a like an energy system where I'm going to be just trying to rearrange my feature space such that the these different things are these two different constraints workout pardon me the x-axis would be the distance in the the distance in the feature space the Lydian distance between f of X I and F of XJ and the y-axis is the is the cost as the loss and so this can be trained using something that has been called a Siamese Network where I have an identical copy of F I passed X I through the through F I pass XJ through F I compute this distance including and distance between them in the in their feature space and then I'm going to back prop through the through the network and they since they share weights then then both sides get are updated we can use different losses this can use this uses a cosine similarity loss so here we have that our distance D is the cosine similarity between f of X I and F of XJ and I have forgotten there's some there's there some work that that sort of compares and contrasts between these these two these two different losses for similarities honestly forgotten what the R is the the result of it was that method C which is the next one worked better I don't remember what the comparison was between the first two that I showed so the third way the third formulation that's often used and that I'd say is most common common now is called a triplet loss and the triplet loss see if I have another the idea here is that I'm going to have three points that are input so I've got X a X s and X D and what I know is simply the relative information that XD is X s is more similar to the anchor than X D is so what you want to do is push and pull the system train your weights such that you go from that or it from that to that so you're never trying to completely pull two elements together and you're never trying to push two elements completely apart you're just saying relative if I take three things I want to move pull one closer and push one further away and this works very well it's nice in general it's it's balanced the training works well doesn't explode and the loss function is just as you would think that you're going to try to the distance between the between the dissimilar one you're going to pay a penalty if that is much larger than the distance from the similar one and there's a margin there as well and how are these used so one one interesting way in which they're used is that all of the face detection algorithms that are out there these days all use this approach and why is that well that's because people used to do face recognition by saying okay I'm going to take a hundred people and take a bunch of photos of each one and I want to be able to recognize I'm going to classify each of those people so I'm going to I'm going to recognize by name by ID each of those people and that's how I'm going to tell if two people are different as if they come up with different IDs when I run a classifier on their on their faces or if they're saying they come up with the same ID this is a problem when you have lots and lots of people Facebook has too many people to be able to use anything like this it doesn't scale to do use classifier instead all you want to know is given two images are they the same person or are they different people so instead you use this method of training and embedding space and then all you have to look at is the distance in that really nice feature space that you've made and that will tell you what the likelihood is that two images are of the same person and I don't any longer need to actually keep the I mean obviously the IDs are there but you don't need to explicitly learn using those IDs so for instance this is from face neck-deep face I think is the best one currently but that might be a little bit old but this is from face net which is also very good these are all images of the same person and they're all taken as nearest neighbors from one image in the feature space so you can tell by that that if those are all nearest neighbors of one point then you've learned something that's really really robust to all the different ways in which people can can look can appear yes similar distantly yeah you get that to some extent with the triplet loss but yes you could you could definitely take the original the the contrastive squared loss and you can instead of just using a binary class of class on that then you could use a continuous continuous class and that will simply change how it works so there has been a couple papers that have done that and on the other side of it so that's how well it works these are false accepts so each of these pears pear pear pear pear pear pear pear those are all incorrectly matched by deep face as well your face net so incorrectly it thought these two were the same person and clearly they're completely different people so yes we would we would make most of the same so these facial recognition networks are now better significantly better than humans can do on the same problem although that's that's from a data set I think that humans do better if you actually know the person right so the people that we know that we work with and our families our friends etc we still luckily beat the the confidence at robustly recognizing the identity of those images but if it's just a data set then then we lose all right I have maybe 15 minutes left I was going to run through something that I worked on is that sound alright okay so it probably uses a bunch of a bunch of things in deep RL that you guys have not have not covered yet but yes yeah there's a lot of different way that that's one of the cool things about it is that there's a lot of different ways of getting those those relationship labels right so you can take images take lots and lots of images and just say well if if two objects appear together next to each other then yes we should definitely say that these two things have some similarity to them if I never see two things together there should be different and distant in the feature space and then you get something that will group together office stuff versus outdoor stuff etc you can also use this so somebody used this to say I don't understand this biological data that I'm getting in some test that was being done on cancer patients I don't understand this I don't know what the structure is but I do know that I get the these these readings from individual patients so they just said let's group these together right and then just say that if the same if these readings come from the same patient then they should have some similarity I think it was from two different tests that were being done that were not obviously correlated but they understood a sort of an unknown structure in different types of cancer because of this and that was just a matter of saying there's a relationship between these because they're coming from the same person you can also use this for temporal information so you can just say that in streaming video frames that are close to each other should be more similar than frames that are further apart from each other and then you get something that's often referred to as slow features you get this very different sort of features where you get where things are very invariant over short amounts of time or transformations so yes very very broad area can do a lot of different things with with these approaches all right so I like navigation navigations a fun problem we all navigate I navigate you navigate I wanted to make a problem in a simulator that I could try different deep reinforcement learning approaches on and I started working on this at a time when deep mind had was just working on Atari and I really wanted to go beyond Atari in terms of a set of interesting problems and so navigation mazes have the problem of that if you can look at the maze from here you can solve the maze if you're looking at it from there it becomes much more challenging because you only have partial observability of of the domain so I need to remember things over time and I need to learn to recognize structure just from a visual input so I worked with my colleagues at the mine we developed a simulator to produce these procedurally produce these mazes and we made up a game that says I'm going to start somewhere in this maze anywhere in this maze and I'm going to try to find the goal if I find the goal I get plus 10 points and I immediately get teleported somewhere else and I have to find my way back again and I'm going to repeat that as quickly as I can for a fixed episode length right wander around the maze find the goal get teleported elsewhere find your way back again get teleported find your way back that's the goal there's also some apples laying around these help with getting the learning off the ground we found out later they're not necessary but they're there for for because we assumed initially that we would need those in order to start the learning process we can look at different variants here so we could say well we've got a static maze in the static goal and the thing that changes is where the agent gets gets spawned where you start and where you get teleported to or we can say well the maze layout is fixed but I've got a goal that moves around on every episode or I can say everything is random all the time and the inputs I get are the RGB image and then my velocity just my instantaneous velocity in ancient relative coordinate frame and I can take actions that involve moving sideways forward backwards we're rotating looking around and I can look at a few different mazes we have a large one takes about five minutes per episode almost 11,000 steps in an episode so longer space of time and bigger mazes and we also have that little what we call the eye maze where the goal is only appears in the four corners and the agent always starts somewhere in the middle here and you really just you know exactly the behavior you want you want the agent to methodically check the four corners when it finds the goal just immediately go back there again but you can see from these sorts of traces this is after learning that it has finds the goal and then goes back there again and again throughout the episode so that's the problem in an in a very quick nutshell so the what we have a sparse rewards it makes this this we can train an agent using sort of standard deep reinforcement learning methods on the game I present it but it's slow it is hard to do it's slow takes a long time very data if in efficient we discovered that we could substantially accelerate the speed of the learning if we used auxilary losses so this means that instead of just trying to maximize reward through learning a value function and updating the policy I'm also at the same time going to predict simple things in my environment just using supervised learning or unsupervised learning depending on how you want to call it and we decided to try using depth prediction and loop closure prediction so what is so and I will tell you what that moment means more in a moment first of all let's take a look at the architecture that we used so our input is RGB images feed it through a convolutional encoder three layers then we add a two layer LS TM so we need to have memory for this task right I need to be able to get to the goal and then remember where it was so I can efficiently get back to it so we know that we need memory we use an LS TM we used to just two is better and we have a skip connection from the Convenant said skip connections are useful general tools and that helps the learning we can add some additional inputs to the system the instantaneous velocity like I said just an agent relative coordinates how fast I'm moving laterally and rotationally previous action previous reward and then we trained this using a three C which is a synchronous advantage actor critic which you will know about by the end of this course if you don't know now but it is a method for Policy gradient learning where we use the case step advantage function to update the value and we use the thing that we are really interested in here is the axillary tasks so we're going to predict depth by that I mean the input to the system is actually RGB and D the depth in the image space right how far away things are we're going to not give that as an input but instead try to predict it and we can try to predict it as a MLP on the convolutional features or we can do it off of the LS TM we're trying to predict that depth Channel and just a just a sub sampled version of it a course prediction of the depth of that image we also tried experimenting with loop closure prediction so this is just a Bernoulli loss we're going to predict at each time step have we been here before at this place in the maze in this episode and so that's off of the LS TM because we need memory for that one and then we actually add a position encoder which we don't we just use this as a as a decoder but we don't back prop gradients through it's just to say can I can I predict the position of the agent from what it's thinking little heartbeat stethoscope this produce produces a plumbing I'm going to skip this it's in the it's in the paper if you're interested and we're just going to combine all of those different losses together by the way the axillary losses and the RL loss we're just going to we're just going to add them all together and back propagate yes there aren't that many that you can you know this was one of the main questions was whether or not obviously this is something about the visual system and we knew from some related work that this could accelerate learning we didn't know if if it would work to have the gradients have to pass through to to LST MS to get to the visual feature layers but that actually works very well there aren't that many different places you can you can attach these things and it's relatively easy to to test the effect of it alright so different architectures on the left the plain vanilla a3c feedforward no memory be we add on an LST M make it recurrent see we call our nav a 3c we've added our additional our additional inputs are additional plumbing and additional LST m and then the last one where we add on our auxilary losses axillary tasks okay so how does this look on a large maze hmm won't show the video yet okay so these are learning curves this is over a hundred million steps of experience in this maze and this is what we get with the vanilla agent without memory it can't solve the task very well it can just learn to be a speedy Explorer but it can't learn to remember really remember where it is and get back to the goal and we ran five five seeds for each one and we show the show the mean if we add the LST I'm the second agent that I showed there then we do much better but you can say that takes a long time to learn before we get to that inflection point where the where we actually figure out what's where the agent figures out what's going on and that's what we typically see with LST ms by the way if we add the additional inputs and the additional LST m then it's about the same it's a little bit more unstable if we add loop closure prediction then we get fairly unstable performance from using that because often there's not a strong gradient signal because often you don't close the loop again in these mazes but it does give enough information that you can see that the that inflection point all of a sudden moves to the left by well a day in terms of training this which is nice to see if we add on the depth prediction wow we're way over there all of a sudden now all of a sudden we're getting to almost peak top performance very very very very quickly and this is this is remarkable to see what the difference it makes in in this in this task and we see that it helps in all the tasks that we tried not always this dramatically but it always improves if we use d2 d2 is placing the depth prediction on the LST M instead a little bit later to start off but doesn't really matter and it finishes just a smidge higher performance we can put together d1 d2 and L so that with the loop closure doesn't really change things too much and for reference that's where a human expert gets to it's that's does much better than I do I only get 110 points at deepmind we have not one but two dedicated video game players that put in a hard 40-hour week of playing various games we throw at them I have to say there's something that they don't like very much the mazes were not too bad there were some that were pretty pretty pretty unpleasant for them because they tend to be quite easy Atari Atari oh those are fun some of the other things that we've done are fun but some things when we're asking them to repeat playing it twenty to a hundred times but it's a really trivial game then they get annoyed at us yes so they are professionals and and and I must say there's there's a lot of skill involved I mean I can't come close to their performance on various things it's interesting we actually also tend to if we develop a task deep mind we will say I want to look for how we use memory or you know different types of things or attention or something like that we develop it a task and we have our human experts learn to play it and then we interview them that we say we ask them what was your process of learning how to do this what did you feel what were the key things what were you looking at you know what were you observing what was hard and that's can be really really interesting really informative I'm not sure that they've ever had a task where they've wanted to have that but we would probably let them do that we just try to not give them we actually for this because depth was an important element then we did give them a stare stereo goggles to to look at a representation so that they had a heightened sense of depth to see what difference that that made which didn't make any fun all right so just have one minute left let me skip past that video I'll show a video at the end so an important question is here is is if depth gives such an amazing difference then should it why not just give it as an input you know why give it as a prediction target instead and the answer is that that actually works much better so we compared between if we had an architecture like this on the Left where we fed in RGB D we just said okay here's the depth full resolution the whole thing then we actually don't learn as well as if we have to predict it and that's because the important thing here is not the actual depth information it's the gradients the gradients sort of scaffold the very noisy learning that goes on rent with reinforcement learning if these very noisy great it's coming from reinforcement learning if I can on every frame give give something meaningful that lets the network learn something about the structure of the scene and turn from a set of pixels into something that's a little bit more coherent then then it works better we showed that the agent is memory is remembering the goal location because it gets to the goal faster at least on the smaller mazes and then position decoding it knows where it is and then you can just see it's sort of zooming around here so this is on the AI maze where it's going to check the corners and it just found the goal on the this arm of the maze and now it's just going to repeatedly for the rest of them 90 seconds or so of the episode is just going to repeatedly come back here again because it remembered where it was an easy task but we wanted to see this this is in a larger maze we show that we can decode the position of the agent using that non back propped position decoder you can see it just zooming through very effectively when it got to the goal it got respawn somewhere else has to come back again the last thing that I did want to show here so this is because in the mazes that are static so the maze layout is static and only the goal position changes then it knows just where it is it doesn't need to go forward so it can go backwards because it's really memorized that maze as soon as you train it animes that can change where the topology of it changes over time or changes with each episode then you see that it pays a little bit more attention it doesn't do the same nice sliding moves that's true if you put in a cost of hitting the wall then it does it does it does worse yes exactly and to help with the memory system and we have actually shown that it's not the agent does not use that the human does one of the things we asked the human expert are the paintings on the wall useful for recognizing where you are and they said yes absolutely the agent we can take the paintings away and you lose like an epsilon of performance so it integrates over time so if it's in an ambiguous space then it can just go down the hall and it just uses the LS TM to accumulate evidence about where it is well done this is what I wanted to show this is the actual thing that's being predicted by the agent that's the auxiliary loss that makes all the difference there you can see that it gives that it's it's predicting hoarsely some thing about the geometry of the scene but it's interesting what if you make it really empty I think it would probably do fine and I have not tried that specifically but we have tried with more and less complex sort of wall textures and that has not made a difference all right I am I am all done thank you very muchhello everybody I don't have a mic but this room seems good acoustically I hope everybody can hear me yes okay so I'm Ryan Hansel your latest guest speaker during for this course I'm deep mind as I guess all of the guest speakers are for this course I have been at deep mind for about four years and I lead a research group in the deep learning team in particular my research focuses on aspects of continual learning lifelong learning transfer learning so I think that that's actually incredibly important for getting deep learning deep reinforcement learning to work in the real world I also work on robotics and miscellaneous other topics that come up and so I'm going to talk about sort of three topics and then if I have time that I've got a little segment at the end that shows some research that I've have been working on recently which maybe gives you just sort of a fast-forward to just a method the details of a method that's currently out there being published but I'm going to talk a fair amount about topics in computer vision so this is actually a continuation of what Karen Simonian presented I think two weeks ago so forget everything that oriole said last week and remember what Karen said and I will be continuing from that starting with talking about beyond simple image recognition or image classification so I wanted to write so quick overview talking about end-to-end learning as we go to more complex architectures and more complex tasks also doing a little case study of an end-to-end trained architecture and the spatial transformer network then learning without layer labels so how to do how to learn and embedding or manifold if you don't have supervised labels you don't want to use them and then like I said a topic on using reinforcement learning sequence learning auxilary losses together for a navigation problem a maze navigation problem first of all and to end learning it's a familiar term I just wanted to make sure we're on the same page about it somebody tell me what end-to-end learning means someone right so fundamentally we're talking about methods that we can optimize all the way from some input or all the way to an output that we want and that everything in the middle should be optimized together and usually we do this by differentiable approaches such that we can use gradient descent methods to optimize the whole thing at once end to end and I have a little slide that I use when I'm trying to convince people that aren't necessarily into deep learning why and to end learning is important so this is a you know proof via history so in 2010 the state of the art in speech recognition looked like this you started out with for speech recognition I've got an audio signal that comes in and I want to predict text from it that speech recognition and so the state of the art for doing this involved having a nice acoustic model a nice phonetic model and a nice language model all good machine learning problems in of the selves but a modular approach right each of these things were optimized separately but these definitely gave us the state of the art and speech recognition which was not bad in 2010 then in 2010 things changed the state of the art was handed off to a deep neural network that trained the whole pipeline end to end going all the way from the output that we want text back through back to audio and getting an improvement in that so sort of throwing away the understanding of the domain experts that said well first we need to get the the you know we need to get the phonemes we need to have the language model we have need to have these different explicit components in 2012 computer vision you know state of the art was something that maybe it was like this obviously different variations of it but it involved extracting some key points in an image computing sift features some other robust feature maybe training a deformable part model and before you get out labels so pixels to labels by this sort of modular pipeline of separately trained models and of course this was exceeded by a lot in the image net challenge using a deep neural network that simply took pixels in output labels again in 2014 machine translation text in text out and this has also the news the state of the art since 2014 has been different flavors of deep neural networks so right now state of the art and robotics looks like this you have your sensors you do some perception on that sense on those sensory streams you maybe put these into a map or a world model then you do some planning and then you send some control actions to the robot before actually producing the actions to me you know I really like robotics and I would love to see this method replaced again with end-to-end learning because I think that it's obvious that there is a potential here to take exactly that say you know to take this domaine do the same thing it's harder for robotics I'm not going to talk about it today but I just like to sort of think about this as a reason for why it's good to learn things and to end do you buy that is that is it a convincing argument all right let's talk about beyond imagenet classification and so one thing that we can do so Karen talked about how to train convolutional neural networks to solve a sort of image net type of problems I believe I hope that's what he talked about yes and so let's make the point first about pre training so training big models on big datasets takes a lot of time it can take several weeks on multiple GPUs to train for imagenet maybe not anymore used to take several weeks still takes takes a while and a fair amount of resources but the network trained on a big data set like image net should be useful for other data especially if we have similar classes or a similar domain but actually people have shown that they can take a network trained on image net and use those features and use that for a wide variety of new types of problems and in different domains and that's really I think the exciting thing about the image net work both the data set and the approach so how do we make use of a trained model so we train our model the trainer our big neural network we then plug it into another network and we train whatever layers we have that that were not replaced using these pre trained layers and then we can take that keep those pre trained weights fixed or we can slowly update them so this is a simple process so train step a train the confident on image nut which produces as the output a thousand dimension imagenet class likelihood vector we keep some number of layers there sometimes all of them whatever that model is sometimes only some of them out of that layer and we we initialize a new continent using some of those pre trained layers and then we can say well I've got a last layer maybe got a couple last layers in this case the output for detection might be a twenty one dimensional class likelihood for Pascal vo C I guess that's not it's not detection but classification and so we just retrained that last layer that can speed things up dramatically and it can also be provide actually a better result especially if you don't have enough data in that new new data set all right let's look at a couple of other image recognition tasks image classification just says there's a person there's a sheep there's a dog or in the case of image net its I always find image net strange because you take an image like that and the desired output label is simply dog so it just it just outputs a single single layer and you know throws away anything else in the image so image classification is is fairly blunt let's think about harder tasks so we might want to do object localization or detection so that means we actually want a bounding box around different things basically that's saying I want a classification of what the object is and also a bound around where it is and implied in that is that it means that if there are multiple sheep for instance we would want to identify all of them semantic segmentation definitely quite a bit more challenging because here we want pixel wise labels so we want to be able to have a pixel wise boundary around the different elements in the scene an instant segmentation is we actually want to know where things are different we don't want to just know sheep no sheep a B C D and E so object detection with consonants the a popular approach that was used sort of initially is just to say well detection is just a classification problem everywhere in the image so let's just sweep a sliding window across the whole image in all positions preferably at all scales as well and we'll just feed each of those individual bounding boxes into a classifier which will say yes or no for all of the different classes this is actually not quite as bad as it sounds it's bad if you do it naively it can be done sort of it can be done with a little bit more optimization so that it's not horrible but it's it's just not great you end up with the same object gets detected multiple times so you would get sort of multiple 20 different detections of the person with with jitter around it and you also so yeah you get that the same object gets detected multiple times and you also are sort of assuming that you have just a fixed number of sizes of bounding boxes and aspect ratios instead you could say well I'm just going to directly predict the bounding box so there you say where's an object let me just regress four numbers the coordinates of the box so you can just directly use a mean squared error loss and say I want to regress the pixel coordinates of the top left corner in the bottom right corner for instance and this is not as it actually works it's sort of a strange thing to ask a neural network to do at least I've always thought that but it sort of works sort of a problem though the number of boxes is unknown and it doesn't work as well as other approaches so and the the last sort of general method for doing object detection is to predict is to take some bounding box proposals which might come from a trained Network and say for each of those proposals of where there might be a bounding box let's classify if there's actually an object there or not let's look a little bit more as to what that what that looks like and then those proposals get passed through a classifier and we can decide if they're actually if there's actually something there or not so and and this provides something that looks a little bit like a tension so instead of looking at the whole network I'm gonna first use one classifier to say here's some candidate places to look now I look just those places more closely and I decide I refine that bounding box and I say yes or no what sort of object is in there and this is a lot faster because we're not exhaustively considering the whole image there's no reason to sweep a window across a big field if of empty space or a big blue sky we immediately sort of home in on possible possible objects and so this is I'll talk for a couple slides about faster our CN n our CN n stands for region CN n and this has gone through a couple of iterations in the last couple of years with people coming up with refinements on the basic on the basic approach so we start with convolutional layers that are pre-trained on imagenet and then there is a proposal stage where we have a one network here that looks at the feature layer the feature layers of the confident and says I'm going to produce a number of proposals and these are bounding box box estimates then you fill in those bounding boxes with the actual details and send it to a classifier and that classifier is going to refine the location like I said and decide what class of object is in there if any and also maybe do some region of interest pooling if you have sort of multiple detection x' in the same area yes yes you can so the so you I will actually go to the next slide because it offers a little bit more details here as to what's actually happening here and there might actually be and I thought there might be a little bit more details on the actual equations but let me let me talk about this maybe then it's it's a little bit more clear so what we do start with a so we have a convolutional feature map and we slide a window across there but it's sort of a big course window for each position of that window we actually look at a selection of what we call what are called anchor boxes and these offer a variety of different aspect ratios the sort of a bunch of templates that says let's look at these different course sort of shapes in the image and for each of these then we're going to we're going to predict whether or not there are these different boxes with these different anchor boxes with respect to this anchor location and we're going to predict if the proposal contains an object so let's see here so we take this sliding window and we take the anchor boxes and we're actually considering the content in there and that's what makes this differentiable is that we still have a sliding window approach we're just considering a limited number of different options and then we go through an intermediate layer that's got you know 256 hiddens there and we can produce two things here one is for each of those anchor positions then what what are the scores with regard to whether or not there's an object there and then whether and then a refinement on the coordinates and one of the most important things here this looks similar to the other approach but the important thing here is that everything is in reference to that fixed to this sliding window location it's anchored there so when we are predicting what the new coordinates are and how to refine that bounding box then it's then it is relative to this central position and that makes the neural network a lot more able to scale and makes it truly translation invariant across the entire image space which is important otherwise if you're asking the neural network to produce information about whether or not this bounding box should be moved to you know pixel 687 versus 689 then those aren't numbers that neural networks work with very well with a lot of precision and so this is used instead of this approach well I'm not sure exactly what this cartoon is supposed is supposed to show but I think it shows that we're producing proposals separately and here we're not we're really just considering the we're refining this this this method of scrolling across the entire image space it's a little bit more like a very structured convolutional layer because it's looking everywhere at these different aspect ratios and to further improve the performance we can always because this is differentiable then we can back prop all the way through to this to this to the convolutional stack and to those feature layers and make them a little bit better a little bit more sensitive which is important sometimes when we get at the end of training on imagenet we get these sort of we don't get crisp locations if we want to get bounding boxes we need crisp locations in the image so it can be useful to pre trade and you get a little bit different feature representation that way all right next let's talk for a moment about semantic segmentation so semantic segmentation means that we are going to label each pixel in the input image into one of the different object classes that we have available and the the usual way this is done is to classify each pixel for each class so that you end up getting the output of your system it's going to be a full resolution map so the same resolution as the input image and but with the number of channels that's equal to the number of classes you have and each each of those layers is a is a binary mask that looks for different that looks for each different class or has a likelihood in it it hasn't been thresholded and so one of the important things here when we consider doing semantic segmentation using a convolutional network is that what happens at the end of a convolutional network we have pooled and we have you know sort of lost lost resolution so that we end up with something that's very semantically Laden at the end of a say an image net continent but we don't anymore have any spatial resolution so going back to that the full resolution input size is sort of the trick to to be done here so one way to do this is to use the different resolution preserving building blocks that current Auk Taback Uppal of weeks ago so to reverse the pooling then we can do a transposed convolution which deconvolute s-- or up samples and we can also replace regular convolutions with a dilated convolution with a stride of one let's see how that works eh we can look at a like I said in the usual I guess this is a vgg net in a usual Network we would have the input resolution and then as we go through this through the layers of this network then we lose spatial resolution as we add feature layers so the output here would be 21 different layers but we don't know any longer have any or much of any spatial resolution so I've got 21 different layers representing 21 different classes and I've got a probability for each of those layers whether or not there's an object there but I've now I am far too coarsely sampled far too high of a low of a resolution so one way to do this is to simply say well I'm going to get to that point and then I am going to up sample and I'm going to use a transposed convolution and that's going to increase the spatial resolution and I'm going to get back to the scale the the resolution that I want or stop stop somewhere in the middle at some intermediate scale this does not work that well as you might guess why because you're going through a bottleneck here where you're losing a lot of the information about where things are so really what you're going to end up with if you train this is what we have here you get blobs they're nice blobs but they're not really what we're what we're looking for so we basically have that semantic information as to what objects are there but we've lost the positions so one way to deal with this is to say I like the information that's here that tells me what classes are wait what classes are in the image but I need to know where they are so that information should still be there in a previous layer so what I'm going to do is to combine this this representation with a skip connection from here and I'm going to bring these two together and so this would I would have to do a 32x up sampled prediction but if I have combined together a previous layer with the current confident and learned that combination right so this can be a learned connection here I learned fully connected usually a linear layer then what I can now do now I've got a space that has the semantic information has more resolution and I can just do a 16x up sample to get something more like this or obviously I can repeat this I can say let me have actually information from further back in the architecture when I had even more higher resolution information and bring that together and be able to now have a representation with features more resolution now I can just do an 8x up sampled prediction of the actual mask and get a better result at the end yes to be honest let's see here we are doing a we are taking this information and we are doing a 2x up sampling of this so just repeating information every 2 by 2 right and then we are combining it with we're adding another layer that is the that's just a copy right so now we would have two layers and these are of course many feature layers here but I've got one that I've simply copied the information to up sample at once and I've added in another layer that gives the features then I can then this up sampled this transposed convolution that I'm doing here has more to work with it's got both information at both layers the course semantic information and also that and then if I just keep on doing this that two by two then I get reasonable this notion of using of having an architecture by the way that has a bottleneck which is very useful for learning an abstract semantic disentangled however you want to call it learns a good representation of the data but combining that with a skip connection it's very powerful architecture and it's you see this theme coming up in different types of work so you guys looked at auto-encoders yes so there you see that you have a similar sort of thing even though that is a would be trained in a different way is an unsupervised approach but where you want to start at some resolution of your data you want to learn a representation that gets sort of narrower and narrower goes through a bottleneck and then you want to go back to some sort of a back to a finer resolution so you add these skipped connections right so I've got this bottleneck architecture and then these skip connections that help carry that information through the other side so it's just sort of a I find that there's a lot of different sort of applications of neural networks that end up I often don't want the small amount of knowledge I often want something a little bit bigger and using skip connections to lower layers can be very helpful and of course this is the one of the idea the the idea that grounds residual networks is having that residual or skip connection yes yeah I mean I also have the question is to you know why not why not go and step forward and I'm I'm not positive as to whether or not they did those experiments for this paper now the there is a nice work a nice paper by Jason use insky called do deep neural networks really learn to transfer I think it's a question Jason knows in ski and it's just a really nice sort of examination of if I train this big long network then where do I want to actually draw knowledge from when I use this for a different task and and as you would expect you do get that down here I've got you know the the the information is is very specific to the data to that problem and they're the features are very sort of low-level the features being low-level means that it will transfer very easily at a higher level it tends to become more sort of problems specific I think I said that the wrong way the earlier layers are more general the later the later you go in the network then the more sort of problem specific you get all the way to the layer where you're actually classifying a particular set of classes there aren't any sort of magic answers but it does give an interesting insight to this and some interesting experiments on it for the most part with this sort of an architecture as with most deep learning then there's a lot of empirical you know experiments to say is it more useful to draw you know from pool 3 or pool 4 or both I guess you'd have to from both alright yes so the classification is going to be present there but not explicitly here it is explicitly they're a class label a class likelihood over each possible class so here obviously the information is there but what you get is a little bit more attention to the details for instance if one of your classes is a person then at this level you're going to get clearly yes there is a person in the image but back here you're going to get I see an arm I see another arm I see a leg I see another leg and together that information gets put together so but of course you could end up putting together that I see a leg and a leg and an arm and an arm and at the end say it's actually I don't know a doll or a robot or something like that at the highest level right so you give you let the highest level make that final decision at the level of the class you're trying to you're trying to predict the lowest level can say but if there's a he an arm here then this is how I want to segment it and an arm here so make a decision and then come back down again what does I think what we do I'm going you know if you ask a kid to outline things in an image or adults as you do for all of these labelled data sets there's adults out there that are grad students I don't know he's sorry and you're saying that you don't think that the the training on this would generalize too right if that's all you had as your what you're training what you're learning from then it would be really hard to solve the problem if all you had was for instance crowds of people wherever where everything was was overlapping but luckily that's the point of data sets being big enough that they capture lots of different things and sometimes you will get humans that are you know prototypical and sometimes you're going to get more muddled scenes and more noise things mislabel etc what we sort of rely on when we train these things is that given enough data we are going to get enough of a learning signal to be able to learn to learn what we want and indeed these these approaches work surprisingly well what they don't do well on is you know you don't think that there's an example here you look at scenes where you have a people are interested in doing semantic segmentation I think that the number one reason why people want to do semantic segmentation in sort of an industry is for autonomous cars because everybody wants to take a street scene and understand it you've got lidar you've got other sensors but what you need to know is is that a pedestrian and are they about to cross in front of the car or on you know understanding different aspects of the scene so you want to do scene segmentation there and understand these different classes and you look at the results that people have on these sort of big street scenes with cars and buildings and street signs and people and bicyclists etc and they do really well on parts of it but then parts of it will will be really really poorly done whereas humans can take just a couple of pixels in that image given the context of the whole scene and say yeah that's a you know telephone pole that's not a that's not a human so I'm both impressed where this this sort of work has managed to get in the last 10 years or so but also there's still a lot more work to go so another way to do this just look ahead another way to do this is to instead of using a regular convolution at all you avoid the problem of having the reduced resolution by using throughout the network you can use a dilated convolution and so this means that you are removing your pooling altogether and you are replacing convolution with a dilated convolution with a step size of one if you think about how that works you're actually going to end up with you have this broad receptive field meaning you're looking at a larger part of the image to make a decision which is one of the reasons why we do pooling is so that a convolutional filter can look it can have a receptive window that's broader on the scene so that you get more high-level information but instead we can sort of say well I'm just going to look at every other pixel but I'm going to move this thing slowly across the whole image right so this allows you to have the same receptive field but have no decrease in the resolution as you go through the layers of the network and this gives you it's sort of a simpler architecture because you don't need to worry about getting back up to the full resolution and it also gives higher accuracy because now you're really a little bit more just directly training for the type of information that you want you're saying simultaneously give me precision pixel level precision and give me high level information and let the network weight sort of work out how to do this from this sort of a structure does that make sense all right and video classification with confidence I went to cvpr which is the biggest conference that looks at it does computer vision that works in the computer vision domain and I was still struck by how many papers were there that we're focusing on single images the world is not presented to us as single images why aren't we working on video but there's still a lot of work being done but we do have means of doing classification and segmentation and all of these sorts of problems in videos so here's a few different ways to ways to do this first of all starting on the Left we can just say well I'm going to process one frame at a time right I'm going to pretend that my video is just a set of images that may or may not be related and I'm just going to run a continent through every single one of them this is sort of like doing a sliding window approach to do detection using a classifier network this is sort of exhaustively looking for dogs for instance by considering separately every single frame this is inefficient but moreover it doesn't work well because the whole point of considering multiple frames is that you can build up your certainty over time when I see just a couple pixels of a light post and I'm trying to figure out if it's a light post or a person down the road then I want to see you know more information just getting a few more few more samples can help me make that decision so another way to do this is I'm going to run my classifier over all of the images but then I'm going to train a layer that's going to so imagine we're training the same or we're running the same convolutional net work independently over each frame in this video sequence but then at the end I'm taking a I'm taking the outputs of across all of those frames and I'm training one network one layer at the top to say there's a there's been a dog scene there's been a human scene something like that so late fusion early fusion let's instead take advantage of the fact of the let's use the neural network to reason about multiple images at the same time so instead we feed in a block of images so instead of my input being RGB now I've got n different images stacked up and my convolutions then can go in my confident can go across the image space and sort of XY but they can also go through the time direction as well so just a simple extension on your standard convolutional Network and everything is exactly the same it's just now my input is a block of images rather than a single one and we can call this an early fusion model this means that my network all along the way obviously I would need to fine-tune this but along the way I would be able to make a better decision because I'm looking at those features and motion at a lower level another approach would be slow fusion and the idea here is that I'm going to do some of both I'm going to run independent feature extractors confidence over each individual frame but then I'm going to in the middle start to put these together I will point out this is for vid video classification but we consider exactly the same gamut of different options all right yes I'll take the question first does which approach work so they do work the the thing that sort of that's not great here is that all of these approaches assume a fixed temporal window that I'm going to consider right they all assume that you know for instance that ten frames is good enough to detect everything so that means that you're not going to be able to see a glimpse of the tail of a dog and then the head of a dog you know 20 frames later and be able to say I saw a dog so and you can always construct a case where you want to have a wider temporal window or where a narrower one would be better this is exactly the motivation for using a recurrent neural network instead which is probably what Oriole talked about last week maybe or did you talk about text yeah okay yes right so you can definitely use a dilated convolution in overtime and be able to get a much better a much better field of view temporally in exactly the same way that we want to might want a broader field of view over the the image space and this is that's what's used for for instance pixel net or wave net pixel net wave net these approaches they take they process a nut pixel net because that would be pixels wave net is an approach that does that does speech generation or audio generation and it learns via dilated temporal convolutions exactly I the the slight tangent that I wanted to mention is that we're talking here about video about a single modality but this is the this is the same gamut of different approaches that we consider any time when you have two modalities so if I've got say audio and video which is honestly what we should be looking at not even just a video we sort of understand the world through the media of audio and video now you've got these two different modalities how do I understand those do i process them completely separately and then at the end put them together and try to solve my problem you know do do speech recognition or some problem from there or do i somehow fuse together these different sensory modalities early on we do the same thing in robotics if I've got a robotic arm then I want to be able to both process my image of that hand moving as well as the proprioception which means what is the joint position you know my knowledge of how what my hand what my joints are doing their velocity but so tactile information right I've got these different sorts of information coming in how can I combine those what is the best way and I think that this is an extremely interesting question because you can really you you can come up with arguments for any of these different approaches and with and without recurrent networks and etc there isn't a best answer but I think that there should be a little bit more principled you know research and understanding of why to use these when how to use the different architectures interestingly all right quick tangent to in the brain they used to think and I'm saying this from a colleague of mine I am NOT a neuroscientist but I was told that they that neuroscientists used to think that there was late fusion of different sensory modalities in the brain so the way in which we process audio the way we process vision whatever else then those get fused sort of at the end so there's the independent things they've and that was because you have your visual cortex you have your auditory cortex etc and the two are relatively separate just recently they've discovered actually there's all of these pathways that go in between so that maybe looks a little bit more like this or like this but with lateral connections here so there's some separation there different dedicated processing but then there are all of these pathways of connections that allow the two to communicate so that you can get feedback early on in the cortical processing between what I'm hearing and seeing and touching etc which makes sense quick example of doing of a specific means of doing a processing video the idea here is that we want to use two sources of information one being the motion and one being the visual information the idea is is that maybe if we prot if we understand sort of process these separately that we can get a better results better accuracy of doing action recognition I'm pretty sure that this was for action recognition fact I'm sure it wasn't so we trained or this is actually from Karen in andrew zisserman you trained to confidence and one of them is going to have as its input a stack of images and the other one is going to have as its input a single image and what you're going to try to do here is you're going to hit you're going to train this with a loss that tries to predict the optical flow and you're going to train this one with that is predicting I don't remember my guess is that you're predicting here the that it's pre trained using image net and then we've got a neural network layer fully connected layer that brings that has its input the two output layers of these two different sort of pathways and unifies them and comes up with the signal single classification of what type of action is this okay that's the end of that section maybe let's do the five minute break now and then I can jump into the next section so this is a paper from a couple years ago from Max yata Berg deep mind and just to motivate it let's think about convolutional neural networks they have pooling layers why do they have pooling layers because we want more invariance translational invariance right we want to be able to have the same activations we want to pool together activations over broader areas or rather sorry convolutional layers give us you know translational invariance some amount of it and pooling sort of accommodates different spatial translations to give a more uniform result and make the learning easier so pooling does two things pooling increases the the field of view for the next layer and says now I'm looking at information over a bigger projection onto the original image so that I can make a more a higher-level feature detection but it also acts to say whether I saw the arm here or here or here or here it's still an arm so it works in concert with the convolutional operator which is able to do to give a uniform detection across different areas and it pools that together so that it just has representation of yes there was an arm I don't care where it was but the this this nice system only works strictly you know only works for for translations and there's lots of other types of transformations that we're going to see in particular and a visual scene and it's hard coded into the architecture as well so various people have come up with cool architectures where now the weight tying is instead of just having a convolution this way then you also have weight tying across different rotations and there's different ways to do this it gets a little bit ugly though right but the usual thing is to just say well if I want to detect M nist digits you know that are turned upside down or faces that are turned upside down I'm just going to learn on a lot of data so that the basically then you need to learn to recognize fours versus 7s when the right-side up and when they're sideways and when they're upside down so you're making more work because you have anything that's in your in your architecture it's that's innate that will accommodate these sort of transformations so let's learn to transform the input instead yes exactly exactly so this is done routinely a called data augmentation where I'm going to introduce some variations to my data so that I can learn across that obviously it makes the learning harder ok now I've got a confident that yes it recognizes rotations and and and this is and this is still the the standard approach and a wise thing to do this offers a different complimentary approach that's sort of interesting because it's a way of tailoring what sort of in variances you want to your actual problem so here's the here's the the challenge I'm given data that looks like this different orientations wouldn't it be nice if I had a magical transform that recognized what sort of or what sort of transformations there are in my input space and got me to a canonical pose such that my Convenant has a little bit less work to do that's true and that you're right that you know the these low level the first layer of Gabor filters what look like Gabor filters are extremely general and have all of the rotations in them and so but the problem is is that in order to recognize this versus this I need different filters to do that so I would need the entire gamut of different filters which we have but that's to recognize that that's to recognize different different types of things I mean the problem is not at the first level it's somewhere in the middle when I start wanting to put these together to recognize a7 and I'd much rather be able to assume that a 7 always has a point going you know in the same orientation if I have to recognize that particular little V feature its distinctive of a7 in all orientations I have to have that at the next layer as well so just having all rotations of my Gabor filters the first level doesn't give me the rotational invariance at the next level or at the highest level alright so if we were to make this differentiable then ideally we'd be able to learn a transformation there that would be useful for exactly my data type rather than saying externally I want to build in rotation invariance what if I don't know exactly what sort of problems there are and my data or what the canonical pose that would be most useful for faces we have an idea for other types of data who knows I just know that I've got a lot of data it's got a lot of different variants and is there some way of making this a little bit more homogeneous so that when I apply my confident it doesn't have to work quite so hard so that's the idea of learning tea learning something that will transform this to make it understood we can think about this more generally this goes back to your question those first level Gabor filters are pretty useful in pretty general already may I want to just keep those here maybe I just want to have something that I can insert between two layers to say take this input take this output from one layer of my processing and transform it before you feed it into the next layer and learn how to do this then you might get that this transformation here is not very useful so you would hope to just learn an identity there this might be the useful one where I would want to get some sort of rotation for instance to a canonical pose so this is the convolutional network in a nutshell the idea is that again I'm imagining that this is planted between two other layers of processing so I have some input you previous feature layers what I want to do is to first predict theta so these are the parameters for some set of transformations that parameterize is this function tau which is my grid generator that's going to produce a sampling grid the output of this is an actual sampler which is going to take the features in you and turn it into my input into the my next layer processing V illumination is fairly I mean sure yes you could illumination you're right you could get a better normalization than you would get through the bias correction that you get sort of for free and a convolutional Network the bias correction that you get and a convent net is going to apply to the whole to the whole feature layer so you're right you might get a nice sort of normalization of of it if you could do if you could do something a little bit different often illumination does get handled fairly well by a confident already as long as it as you don't train it on dark images and then try to test them outside of the set so this relies in order to make this differentiable then we want to have a components which are differentiable so here we consider these three components like I said first we have something which is called the localization net which is going to be a trainable function that's really doing all of the work here this is the thing that it's where we're actually optimizing parameters that's going to map our features to our trans transformation function the grid generator is going to take those parameters that have been inferred from my current input theta and create a sampling grid and then the actual sampler is going to do the do the transformation and feed it in today and feed into the next layer so this is simply I am going to use r6 because we'll start out by just thinking about affine transformations there are different types of that that's the one thing that you do need to define is you do need to say what is my transformation function that I am going to be embedding here the rest of it the actual parameters of it what that transformation is is going to be determined by this localization that per each image that comes in which is why we're not just applying a general rotation to all the images but each one is going to be rotated or transformed separately but one could use the same approach and have many different types of functions embedded there so we have a trainable so first of all the localization Network is a trainable function that's going to act on the outputs of you and produce theta and so our forward pass just looks like normal inference with a neural network of your choice with the neural layer of your choice some sets of weights right second component the grid generator so this is parametrized by the theta which we have inferred and this generates an output map and so we're going to have a output we're going to take our parameters theta and we're going to produce something that is in the same size the same dimensions of our of V of where we are going into and then the last piece of sorry still in the the grid generator so this is the forward pass that we would use for affine transforms to be specific about it the the six estimates of theta that give that rotation translation scale and so we can think about this as being that the output of the grid generator is a sampling grid that says for each component in that for X s and YS then I'm going to index into my input map to figure out where I should sample that to get my new my new output and the sampler is going to the sampler is going to actually do the sampling to fig to apply a kernel function to say where in how am I going to get go from u to V based on the mapping given to me with XS YS and the forward passed there then looks like this general formula which uses some kernel function and then we can look at this for a particular transformation in her particular sampling sampler such as bilinear interpolation and that gets me to a specific formula for what that sampler is going to look like for the forward pass and as long as this is differentiable with respect to x and y then you're good and I think it is for all kernels right so now we need to go in the opposite direction so we've looked at our forward pass localization network the grid generator the sampler and that creates the new input and then we proceed onwards as we're coming backwards through the network we want to first back propagate through that bilinear sampler to get the derivatives of V with respect to with respect to U and x and y right so here we've got the gradients going this way the gradients going that way and so this is the derivative of V with respect to you going through that bilinear interpolation and this is with respect to X I and why I would be the same and this uses this has discontinuities in it so we use sub gradients because of the Max next through up to backprop through the grid generator the function tau and we need to because we want to get to the derivative of x and y with respect to theta with respect to our output from the localization Network and last we are going to back prop the localization Network in the usual way because that's neural layers it's just going to be a set of weights and a bias may be a non-linearity and that will give us the the gradient of theta with respect to U and that sorry and then we can and then we can obviously continue to back prop through whatever other things we have in our network stack at that point but really this is just a matter of sort of choosing things to begin with that were reasonably differentiable even if there's discontinuities and being able to have that produce those those gradients so let's take a look at how this works maybe this video is working don't have any control over the video all right so this video actually started earlier so what do we see happening here these are two different so those are an affine function that's been used there okay I can step through it okay so there's a bunch of experiments that we're done on this I almost all of them with em nest although not all and the idea here is to try different types of transformations such as a thin plate spline versus an affine transformation as the sort of chosen space of functions and then we can see how it does and what we're seeing in first on the left is what the input is and then and all we're trying to do here note that the only way that we're training this is by trying to predict what the what the digit is right we're just trying to predict if it's a five so this means that it's up to the network to to produce this spatial transformer like I said this could just end up being identity and that is exactly what you get if your inputs are relatively well normalized centered etc but if you start moving them around then what you learn is this transformation that gives you in the the output after that spatial transformer is quite stable it's not completely stable but it's stable enough for the rest of the continent to do a better job on this and that that's sort of the the important take-home here this was also used for M NIST edition so I've now got to the I've got two channels being fed in together that and we have two and we have two different spatial Transformers one learns to trans to stabilize channel one the other one learns to transform channel two and in this case the only thing that we're training it on is what the out what you know three plus eight is is what the output is of image a plus image B which makes it into a harder problem and just demonstrates that you can still learn through this complex architecture and get something reasonable lots of moving things yeah right more moving things with noise I can't move past this there we go okay next I'd like to talk about yes you can I don't remember it's a good it's a good question obviously six and nine is a little bit of a problem I'm not sure if they constrained it to not be a full rotation beat for that reason that'd be a problem for you as well I would point out there's no magic here yeah go ahead because otherwise if you don't use a kernel to sample the input then what you're going to get in the output is something that is very has holes in it and is less accurate that's why we if you're if you're sampling an image into a warp some transform then you need to use a kernel at least bilinear interpolation is going to give you something smoother than using nearest-neighbor which is going to give you something smoother than just using the targeted pixel no it's just about retaining more of the information content I mean imagine that my transformation is is zooming in then I'm going to be sampling a lot all in one area and you're going to end up with it being you know the areas between different pixels get sort of blown up and distorted and it's not going to look smooth it's the same thing you get if you try with different sampling techniques and you're you know just an image processing program on the computer you get quite differ results if you use different types of kernel sampling know the only learned part there is in the theta is in the localization net the rest of it is just turn the crank it's just machinery put in place that we can back prop through the sampling is what's actually transforming the output of you into something normalized that we feed into V no not sampling in that sense all right we good to go yes it's a nice paper if you want a nice read on this this method I enjoy the spatial transformer paper alright learning relationship so rather than learning classification or regression sometimes that's not the goal sometimes we just want to know similarity and dissimilarity for instance so I don't want to know yeah enough said sometimes we want to do visualization in which case it's not really interesting to set two to do classification we want to know how a bunch of data is all related and so for the purpose of visualization I might want to infer relationships between between things you know if I understand x and y how do I really said to two x and y I might want to do clustering or I might want to understand something fundamental about the underlying structure of a set of data so these are all cases in which it is not may not be helpful or possible to do a standard supervised learning approach like classification or regression fundamentally they all relate to taking something that's high dimensional and producing something that is low dimensional but that still capture some properties of the data that we care about and that's we call that people are very quite loose in their terminology there but this is generally called the the embedding of the data or the manifold of the data learning so in a standard so one way to do this that people often use if they've trained a imagenet network for instance for classification or you can just simply take off the classification layer and then you can say aha there's my embedding right there's my feature space maybe a hundred dimensional or you know something higher but I'm just going to say that is my manifold of the data and that works for some cases and you might do this but you may not have any training lay labels or you might want to generalize to new classes in which case having this this trained doesn't really make sense so a different way to do this is to think about this in terms of not supervised learning where we associate each input with a label but instead have embedding losses which in which associate inputs with each other right so fundamental idea here pull similar things together push dissimilar things apart in that manifold space right these are just similar but they look similar we want these to be separate we don't want these to get mapped together if you looked at excel wise similarities those are going to get those are going to be maybe not nearest neighbors but pretty close right in in our pixel space we want to learn a space where these are far apart these are both buses we want them to be mapped together so that's an example of where we're actually taking the label of the object so we could just do super on these or we could use this these labels in other ways and there might be other reasons why we have information about which things are similar in which things are different get back to that moment all right so how do we design losses for this if now I just have a relationship between two inputs rather than and rather than a label that I'm trying to predict so typically all of these approaches involve distances in feature space often an l2 distance so one example is a contrastive squared loss so here we say that I'm going to take two inputs X I and XJ and I'm going to take a label but this is a similarity label so it says either these are similar or these are different for X I and XJ and I am going to have a my loss function is going to say that I am going to pay if Y IJ equals 1 which means that these are similar then I want them to be close in my feature space and I'm going to pay a cost if they are far apart quadratic cost for them being far apart if they are if Y I equals negative 1 which means that they are dissimilar then I'm going to pay a cost for having them close together and I want to push them further apart up to a margin if you don't have this margin M on your space then you will be trying to push things explode it's infinitely far apart it's not well constrained so this is a contrastive squared loss where I've got two different penalties depending on which samples I haven't if I want to pull them closer in my in my space which is a function f or if I want to push them further apart and I trained my network with this and it's sort of like a like an energy system where I'm going to be just trying to rearrange my feature space such that the these different things are these two different constraints workout pardon me the x-axis would be the distance in the the distance in the feature space the Lydian distance between f of X I and F of XJ and the y-axis is the is the cost as the loss and so this can be trained using something that has been called a Siamese Network where I have an identical copy of F I passed X I through the through F I pass XJ through F I compute this distance including and distance between them in the in their feature space and then I'm going to back prop through the through the network and they since they share weights then then both sides get are updated we can use different losses this can use this uses a cosine similarity loss so here we have that our distance D is the cosine similarity between f of X I and F of XJ and I have forgotten there's some there's there some work that that sort of compares and contrasts between these these two these two different losses for similarities honestly forgotten what the R is the the result of it was that method C which is the next one worked better I don't remember what the comparison was between the first two that I showed so the third way the third formulation that's often used and that I'd say is most common common now is called a triplet loss and the triplet loss see if I have another the idea here is that I'm going to have three points that are input so I've got X a X s and X D and what I know is simply the relative information that XD is X s is more similar to the anchor than X D is so what you want to do is push and pull the system train your weights such that you go from that or it from that to that so you're never trying to completely pull two elements together and you're never trying to push two elements completely apart you're just saying relative if I take three things I want to move pull one closer and push one further away and this works very well it's nice in general it's it's balanced the training works well doesn't explode and the loss function is just as you would think that you're going to try to the distance between the between the dissimilar one you're going to pay a penalty if that is much larger than the distance from the similar one and there's a margin there as well and how are these used so one one interesting way in which they're used is that all of the face detection algorithms that are out there these days all use this approach and why is that well that's because people used to do face recognition by saying okay I'm going to take a hundred people and take a bunch of photos of each one and I want to be able to recognize I'm going to classify each of those people so I'm going to I'm going to recognize by name by ID each of those people and that's how I'm going to tell if two people are different as if they come up with different IDs when I run a classifier on their on their faces or if they're saying they come up with the same ID this is a problem when you have lots and lots of people Facebook has too many people to be able to use anything like this it doesn't scale to do use classifier instead all you want to know is given two images are they the same person or are they different people so instead you use this method of training and embedding space and then all you have to look at is the distance in that really nice feature space that you've made and that will tell you what the likelihood is that two images are of the same person and I don't any longer need to actually keep the I mean obviously the IDs are there but you don't need to explicitly learn using those IDs so for instance this is from face neck-deep face I think is the best one currently but that might be a little bit old but this is from face net which is also very good these are all images of the same person and they're all taken as nearest neighbors from one image in the feature space so you can tell by that that if those are all nearest neighbors of one point then you've learned something that's really really robust to all the different ways in which people can can look can appear yes similar distantly yeah you get that to some extent with the triplet loss but yes you could you could definitely take the original the the contrastive squared loss and you can instead of just using a binary class of class on that then you could use a continuous continuous class and that will simply change how it works so there has been a couple papers that have done that and on the other side of it so that's how well it works these are false accepts so each of these pears pear pear pear pear pear pear pear those are all incorrectly matched by deep face as well your face net so incorrectly it thought these two were the same person and clearly they're completely different people so yes we would we would make most of the same so these facial recognition networks are now better significantly better than humans can do on the same problem although that's that's from a data set I think that humans do better if you actually know the person right so the people that we know that we work with and our families our friends etc we still luckily beat the the confidence at robustly recognizing the identity of those images but if it's just a data set then then we lose all right I have maybe 15 minutes left I was going to run through something that I worked on is that sound alright okay so it probably uses a bunch of a bunch of things in deep RL that you guys have not have not covered yet but yes yeah there's a lot of different way that that's one of the cool things about it is that there's a lot of different ways of getting those those relationship labels right so you can take images take lots and lots of images and just say well if if two objects appear together next to each other then yes we should definitely say that these two things have some similarity to them if I never see two things together there should be different and distant in the feature space and then you get something that will group together office stuff versus outdoor stuff etc you can also use this so somebody used this to say I don't understand this biological data that I'm getting in some test that was being done on cancer patients I don't understand this I don't know what the structure is but I do know that I get the these these readings from individual patients so they just said let's group these together right and then just say that if the same if these readings come from the same patient then they should have some similarity I think it was from two different tests that were being done that were not obviously correlated but they understood a sort of an unknown structure in different types of cancer because of this and that was just a matter of saying there's a relationship between these because they're coming from the same person you can also use this for temporal information so you can just say that in streaming video frames that are close to each other should be more similar than frames that are further apart from each other and then you get something that's often referred to as slow features you get this very different sort of features where you get where things are very invariant over short amounts of time or transformations so yes very very broad area can do a lot of different things with with these approaches all right so I like navigation navigations a fun problem we all navigate I navigate you navigate I wanted to make a problem in a simulator that I could try different deep reinforcement learning approaches on and I started working on this at a time when deep mind had was just working on Atari and I really wanted to go beyond Atari in terms of a set of interesting problems and so navigation mazes have the problem of that if you can look at the maze from here you can solve the maze if you're looking at it from there it becomes much more challenging because you only have partial observability of of the domain so I need to remember things over time and I need to learn to recognize structure just from a visual input so I worked with my colleagues at the mine we developed a simulator to produce these procedurally produce these mazes and we made up a game that says I'm going to start somewhere in this maze anywhere in this maze and I'm going to try to find the goal if I find the goal I get plus 10 points and I immediately get teleported somewhere else and I have to find my way back again and I'm going to repeat that as quickly as I can for a fixed episode length right wander around the maze find the goal get teleported elsewhere find your way back again get teleported find your way back that's the goal there's also some apples laying around these help with getting the learning off the ground we found out later they're not necessary but they're there for for because we assumed initially that we would need those in order to start the learning process we can look at different variants here so we could say well we've got a static maze in the static goal and the thing that changes is where the agent gets gets spawned where you start and where you get teleported to or we can say well the maze layout is fixed but I've got a goal that moves around on every episode or I can say everything is random all the time and the inputs I get are the RGB image and then my velocity just my instantaneous velocity in ancient relative coordinate frame and I can take actions that involve moving sideways forward backwards we're rotating looking around and I can look at a few different mazes we have a large one takes about five minutes per episode almost 11,000 steps in an episode so longer space of time and bigger mazes and we also have that little what we call the eye maze where the goal is only appears in the four corners and the agent always starts somewhere in the middle here and you really just you know exactly the behavior you want you want the agent to methodically check the four corners when it finds the goal just immediately go back there again but you can see from these sorts of traces this is after learning that it has finds the goal and then goes back there again and again throughout the episode so that's the problem in an in a very quick nutshell so the what we have a sparse rewards it makes this this we can train an agent using sort of standard deep reinforcement learning methods on the game I present it but it's slow it is hard to do it's slow takes a long time very data if in efficient we discovered that we could substantially accelerate the speed of the learning if we used auxilary losses so this means that instead of just trying to maximize reward through learning a value function and updating the policy I'm also at the same time going to predict simple things in my environment just using supervised learning or unsupervised learning depending on how you want to call it and we decided to try using depth prediction and loop closure prediction so what is so and I will tell you what that moment means more in a moment first of all let's take a look at the architecture that we used so our input is RGB images feed it through a convolutional encoder three layers then we add a two layer LS TM so we need to have memory for this task right I need to be able to get to the goal and then remember where it was so I can efficiently get back to it so we know that we need memory we use an LS TM we used to just two is better and we have a skip connection from the Convenant said skip connections are useful general tools and that helps the learning we can add some additional inputs to the system the instantaneous velocity like I said just an agent relative coordinates how fast I'm moving laterally and rotationally previous action previous reward and then we trained this using a three C which is a synchronous advantage actor critic which you will know about by the end of this course if you don't know now but it is a method for Policy gradient learning where we use the case step advantage function to update the value and we use the thing that we are really interested in here is the axillary tasks so we're going to predict depth by that I mean the input to the system is actually RGB and D the depth in the image space right how far away things are we're going to not give that as an input but instead try to predict it and we can try to predict it as a MLP on the convolutional features or we can do it off of the LS TM we're trying to predict that depth Channel and just a just a sub sampled version of it a course prediction of the depth of that image we also tried experimenting with loop closure prediction so this is just a Bernoulli loss we're going to predict at each time step have we been here before at this place in the maze in this episode and so that's off of the LS TM because we need memory for that one and then we actually add a position encoder which we don't we just use this as a as a decoder but we don't back prop gradients through it's just to say can I can I predict the position of the agent from what it's thinking little heartbeat stethoscope this produce produces a plumbing I'm going to skip this it's in the it's in the paper if you're interested and we're just going to combine all of those different losses together by the way the axillary losses and the RL loss we're just going to we're just going to add them all together and back propagate yes there aren't that many that you can you know this was one of the main questions was whether or not obviously this is something about the visual system and we knew from some related work that this could accelerate learning we didn't know if if it would work to have the gradients have to pass through to to LST MS to get to the visual feature layers but that actually works very well there aren't that many different places you can you can attach these things and it's relatively easy to to test the effect of it alright so different architectures on the left the plain vanilla a3c feedforward no memory be we add on an LST M make it recurrent see we call our nav a 3c we've added our additional our additional inputs are additional plumbing and additional LST m and then the last one where we add on our auxilary losses axillary tasks okay so how does this look on a large maze hmm won't show the video yet okay so these are learning curves this is over a hundred million steps of experience in this maze and this is what we get with the vanilla agent without memory it can't solve the task very well it can just learn to be a speedy Explorer but it can't learn to remember really remember where it is and get back to the goal and we ran five five seeds for each one and we show the show the mean if we add the LST I'm the second agent that I showed there then we do much better but you can say that takes a long time to learn before we get to that inflection point where the where we actually figure out what's where the agent figures out what's going on and that's what we typically see with LST ms by the way if we add the additional inputs and the additional LST m then it's about the same it's a little bit more unstable if we add loop closure prediction then we get fairly unstable performance from using that because often there's not a strong gradient signal because often you don't close the loop again in these mazes but it does give enough information that you can see that the that inflection point all of a sudden moves to the left by well a day in terms of training this which is nice to see if we add on the depth prediction wow we're way over there all of a sudden now all of a sudden we're getting to almost peak top performance very very very very quickly and this is this is remarkable to see what the difference it makes in in this in this task and we see that it helps in all the tasks that we tried not always this dramatically but it always improves if we use d2 d2 is placing the depth prediction on the LST M instead a little bit later to start off but doesn't really matter and it finishes just a smidge higher performance we can put together d1 d2 and L so that with the loop closure doesn't really change things too much and for reference that's where a human expert gets to it's that's does much better than I do I only get 110 points at deepmind we have not one but two dedicated video game players that put in a hard 40-hour week of playing various games we throw at them I have to say there's something that they don't like very much the mazes were not too bad there were some that were pretty pretty pretty unpleasant for them because they tend to be quite easy Atari Atari oh those are fun some of the other things that we've done are fun but some things when we're asking them to repeat playing it twenty to a hundred times but it's a really trivial game then they get annoyed at us yes so they are professionals and and and I must say there's there's a lot of skill involved I mean I can't come close to their performance on various things it's interesting we actually also tend to if we develop a task deep mind we will say I want to look for how we use memory or you know different types of things or attention or something like that we develop it a task and we have our human experts learn to play it and then we interview them that we say we ask them what was your process of learning how to do this what did you feel what were the key things what were you looking at you know what were you observing what was hard and that's can be really really interesting really informative I'm not sure that they've ever had a task where they've wanted to have that but we would probably let them do that we just try to not give them we actually for this because depth was an important element then we did give them a stare stereo goggles to to look at a representation so that they had a heightened sense of depth to see what difference that that made which didn't make any fun all right so just have one minute left let me skip past that video I'll show a video at the end so an important question is here is is if depth gives such an amazing difference then should it why not just give it as an input you know why give it as a prediction target instead and the answer is that that actually works much better so we compared between if we had an architecture like this on the Left where we fed in RGB D we just said okay here's the depth full resolution the whole thing then we actually don't learn as well as if we have to predict it and that's because the important thing here is not the actual depth information it's the gradients the gradients sort of scaffold the very noisy learning that goes on rent with reinforcement learning if these very noisy great it's coming from reinforcement learning if I can on every frame give give something meaningful that lets the network learn something about the structure of the scene and turn from a set of pixels into something that's a little bit more coherent then then it works better we showed that the agent is memory is remembering the goal location because it gets to the goal faster at least on the smaller mazes and then position decoding it knows where it is and then you can just see it's sort of zooming around here so this is on the AI maze where it's going to check the corners and it just found the goal on the this arm of the maze and now it's just going to repeatedly for the rest of them 90 seconds or so of the episode is just going to repeatedly come back here again because it remembered where it was an easy task but we wanted to see this this is in a larger maze we show that we can decode the position of the agent using that non back propped position decoder you can see it just zooming through very effectively when it got to the goal it got respawn somewhere else has to come back again the last thing that I did want to show here so this is because in the mazes that are static so the maze layout is static and only the goal position changes then it knows just where it is it doesn't need to go forward so it can go backwards because it's really memorized that maze as soon as you train it animes that can change where the topology of it changes over time or changes with each episode then you see that it pays a little bit more attention it doesn't do the same nice sliding moves that's true if you put in a cost of hitting the wall then it does it does it does worse yes exactly and to help with the memory system and we have actually shown that it's not the agent does not use that the human does one of the things we asked the human expert are the paintings on the wall useful for recognizing where you are and they said yes absolutely the agent we can take the paintings away and you lose like an epsilon of performance so it integrates over time so if it's in an ambiguous space then it can just go down the hall and it just uses the LS TM to accumulate evidence about where it is well done this is what I wanted to show this is the actual thing that's being predicted by the agent that's the auxiliary loss that makes all the difference there you can see that it gives that it's it's predicting hoarsely some thing about the geometry of the scene but it's interesting what if you make it really empty I think it would probably do fine and I have not tried that specifically but we have tried with more and less complex sort of wall textures and that has not made a difference all right I am I am all done thank you very much\n"