PyTorch at Tesla - Andrej Karpathy, Tesla

**Developing Neural Networks for Autopilot: A Challenging Task**

Training neural networks for the autopilot is a complex task that requires significant resources and expertise. The team behind the autopilot project has developed a large pool of tasks, which are heterogeneous and require workers to perform different tasks such as object detection, road layout, depth sensing, and others. These tasks are trained in parallel using a synchronous or asynchronous approach, allowing the researchers to squeeze out all the juice from their neural networks.

The current setup involves training 48 different networks that make 1,000 different predictions each, which results in a significant number of calculations required for training. The team estimates that it would take 70,000 GPU hours to train just one set of neural networks, and if you have a single node with 8 GPUs, it would take a year to complete the task. To mitigate this, the researchers use a combination of techniques such as data augmentation and transfer learning to improve the efficiency of their training process.

One of the major challenges in developing these neural networks is calibrating all the different thresholds and ensuring that none of the predictions can regress. The team has developed a process for this, which includes loop validation, other types of validation, and evaluation. They have also implemented a continuous integration system to automate many of the workflows involved in training and deploying their models.

The ultimate goal is to make this entire process automated, allowing the researchers to focus on improving the models without having to manually intervene. This is often referred to as "operation vacation," where as long as the data labeling team is around to curate and improve the dataset, everything else can be automated. The team is working towards achieving this goal, with the aim of making it possible for them to go on vacation while their models continue to improve.

**Inference Capabilities**

The autopilot project also requires inference capabilities, which involve running the trained neural networks in real-time to make predictions and control the vehicle. To address this challenge, the team has developed its own back-end hardware called FST, which offers approximately 144 in Tate Terra ops off capability compared to traditional GPUs. This results in a significant improvement in performance while reducing costs.

The researchers have also targeted all the latest cars coming out of production lines to use these new chips, and they are confident that this will improve the overall efficiency of their systems. The FST hardware has already shown promising results, with some impressive statistics such as navigating autopilot accumulating 1 billion miles, confirming over 200,000 lane changes, and shipping models across 50 countries or more.

**Future Developments**

The team is also working on a new project called dojo, which aims to improve the efficiency of neural network training by roughly an order of magnitude at a lower cost. The details of this project are not yet publicly available, but it is expected to have significant implications for the development of future autopilot systems.

In conclusion, developing neural networks for the autopilot project is a complex task that requires significant expertise and resources. However, with the team's dedication and innovation, they are making rapid progress in improving the efficiency and performance of their models. The ultimate goal is to make this entire process automated, allowing them to focus on improving the models without having to manually intervene.

**The Future of Autopilot**

The autopilot project has already achieved some impressive milestones, such as navigating over 1 billion miles and confirming over 200,000 lane changes. The team's Smart Summon feature has also attracted significant attention, with over 800,000 sessions reported so far. As the technology continues to evolve, we can expect to see even more exciting developments in the field of autonomous vehicles.

The collaboration between the research team and other organizations, such as Patrasche, has been instrumental in driving progress on this project. The response from the team has been incredibly helpful, allowing researchers to develop their models and deploy them in real-world scenarios. With continued innovation and investment, we can expect to see even more impressive advancements in the world of autonomous vehicles.

**Conclusion**

In conclusion, developing neural networks for the autopilot project is a challenging task that requires significant expertise and resources. However, with the team's dedication and innovation, they are making rapid progress in improving the efficiency and performance of their models. As the technology continues to evolve, we can expect to see even more exciting developments in the field of autonomous vehicles.

"WEBVTTKind: captionsLanguage: enokay hello everyone I am Andre I am the director of AI at Tesla and I'm very excited to be here to tell you a little bit about PI torch and how we use PI tours to train your all networks for the auto pilot now I'm Pierce to do a bit of a show of hands how many of you actually own the Tesla okay a few and how many of you have used or experienced the autopilot the product okay thank you yeah so let's see so for those of you who may not be familiar with autopilot the basic functionality of the autopilot is that it keeps the car in the lane and it keeps the car also away from the vehicle wait ahead of you and then some of the more advanced functionality that we've been building for the autopilot includes navigating on a pilot which allows you to set down a pin somewhere on the map and then as long as you stick to highway the car will do all of the lane changes automatically and it will take all the right Forks to get you there so that's what navigating autopilot with smart summon which we only released about two weeks ago you can summon the car to you in the parking lot so you hold down come to me and the car comes out of its parking spot and it will come find you in the parking lot you get in like royalty and it's an amazing magical feature more broadly the team is very interested in pursuing kind of developing fulsol driving capability so that's what everyone is focused on in the team now famously perhaps we don't use lidar and we don't use high division high-definition maps so everything that we built for the autopilot is basically based on computer vision machine learning on the raw video streams that come from the eight cameras that surround the vehicle so this is an example of what we might see in one single instant and we process this as you might imagine with a lot of convolutional networks now tesla is a fairly vertically integrated company and that's also true when it comes to the intelligence of the auto pilot so in particular of course we build our own cars and we arrange the sensors around the vehicle but then also we collect all of our data we label all of the data we train it on on premise GPU clusters and then of course we take it through the entire stack we run these networks on our own custom hardware that we develop in how and then of course we are in charge of the full lifecycle of these features so we deploy them to our fleet of almost 3/4 million cars right now and we look at telemetry and try to improve the feature over time so we kind of close the loop on on this so I would like to slightly dive into some of the distribution that we employ in the team so the bread and butter for us is of course analyzing images so here's an image in order to drive in this environment you actually have to understand a lot about this environment so perhaps we have to understand the traffic lights the lane line markings cars and so on so you end up in this very massively multitask setting very quickly where you just have to know a lot about the scene so all over a lot of our networks take on this outline here where you have kind of a shared backbone that has a number of tasks hanging off of it and just to give you an idea of the workflows in the kinds of networks these are typically a ResNet 50 like backbones running on roughly a thousand five thousand images and then they have these heads of these structures that that makes sense and of course we're doing this partly because we can't afford to have neural networks for every single task because there's many many tasks almost almost 100 tasks and so we have to amortize some of that computation so we put the most shared backbones so here's some examples of what these networks that we call Hydra Nets because of their shared backbone and multiple heads what these Hydra Nets might look like is this video playing it's not okay I'm just going to go to the next video that was going to show you some lane line markings and so on this is a video showing you some road edges that we are interested in for the purposes of smart summon because we have to understand where we can be in in this environment so we want to avoid the curbs in this case now here we are making predictions in the image and then we are of course casting them out and stitching them up across space and time to understand a sort of the layout of the scene around us so here's an example of this occupancy grid we're showing just the road edges and how they get projected and the car winds its path around this parking lot while the person is summoning it and it's just trying to find its way towards the goal through this parking lot so here's how things get stitched up now so far if I've only talked about in your networks that run on independent images but of course very quickly you run across tasks that actually have to be a function of multiple images at the same time so for example if you're trying to estimate depth of any of these images it might actually be very helpful to have access to the other views of that same scene in order to predict the depth at every individual pixel or if you're trying to predict the road layout or if you're trying to steer the wheel or something like that you might actually need to borrow features from multiple other hydro Nets so what this looks like is we have all of these different hydro Nets for different cameras but then you might want to pull in some of the features from these hydro Nets and go to a second round of processing optionally recurrent and actually produce something like a road layout prediction so this is an example of what a road layout prediction might look like for the autopilot here we are plugging in three cameras simultaneously into a neural network and the network's predictions are not anymore in the image space they are in the top-down space so we're looking at the predictions of this network in particular here we are showing some of the predictions related to the corridors that are available in this parking lot where the intersections are and what the orientations of all of these things are and so the stitching up now doesn't happen sort of in a C++ code base the stitching up across space and time happens inside the recurrent neural network so more generally what our networks start to look like for all of these different tasks and what we're converging on is it looks something like this we have eight Hydra nets for the Aged tasks and they all produce all kinds of intermediate predictions but in addition to that the features from these Hydra nets go into a second round of processing there's potentially recurrent and then we have more outputs that are sort of in a top-down view and then what's special about this is that this is of course like a pretty large single network and every single task sub-samples parts of this network and trains just that small piece so for example we can be training object detector online with cameras or we can be training a depth network or we can be training or layout network all these tasks subsample the graph and train only that portion and then if you've trained Ricola neural network some videos you'll quickly kind of notice that these are not trivial training we're closed so as an example if I want to back unroll this graph in time and back propagate through time maybe we have eight cameras we unroll for 16 time steps use a bad-sized of say 32 then we are going to be holding in memory 4096 images and all of their activations in a single forward pass so very quickly your typical distributed data parallel will will break because you can't hold this amount of memory this amount of activations in memory of a single GP or even a single node so a lot of our training potentially has to combine some elements of data distribute this this real data parallel but also model parallelism and so on it also gets kind of tricky in terms of training these networks because the typical simplest case might be a round-robin training of different tasks so your training task one then ever all the workers in the pool or training task 1 then task 2 3 etc that gets out of hand when you have a hundred tasks so instead what might make a sense is to actually have a pool of tasks and some of the tasks are doing objects some of the tasks are doing road layout sorry some of the workers might be doing depth and so on and these are all very heterogeneous workflows but they coexist and they're they're training different pieces of the network at the same time and then you can arrange them in synchronous asynchronous way or play with this to really get squeezed out all the juice out of it but all in all if you're trying to train all of the neural networks for the autopilot is actually a fairly expensive task in particular today we would train 48 different networks that make 1,000 different predictions is just if you count the number of output tensors and it takes 70,000 GPU hours to train to compile the autopilot it leaves the neural network stack so if you had a single node with 8 GPUs you would be training for a year so it's a lot of networks and a lot of predictions and a lot of them must work and none of them can regress ever and then you're not just training this once you are you have to iterate on this so of course there are researchers and engineers in the team that actually have to improve on this and so as you can imagine we do a lot of neural hour training at scale to get this to actually work and then we are automating a lot of the workflows it's not just about the neural network training itself but everything surrounding that so in particular we have to calibrate all the different thresholds we have a process for that we have a lot of in the loop valve from our loop validation other type of validation and evaluation to make sure that none of these 1,000 different predictions that we make can regress and so on and so the North Star for the team though is all of this can actually be automated quite well so starting with the data set you can train all the neural nets you can do all the calibration the evaluation and you can really start to see the continuous integration of this and so the North Star for the team is something we internally somewhat jokingly refer to as operation vacation and the idea is that as long as the data labeling team is around and they're curating and improving our data sets then everything else can in principle be automated and so we could actually go on a vacation and the autopilot improves by default so that's something that we really try to try to go towards in the team I would like to also talk a little bit about the inference aspect of this so because I talked quite a bit about training as I mentioned we have sort of our own back-end that our Hardware team has developed we call this the FST computer it offers about 144 in Tate Terra ops off capability compared to the GPUs that we were using before we introduced this chip this is roughly an order of magnitude improvement with lower cost so we use this in all the latest cars that are now coming out of the production line and we target all the neural networks to these chips and the last thing I wanted to also briefly allude to as you'll notice here on the bottom we have a GPU cluster the hardware team is also working on a project we call dojo and a dojo is a neural network training computer and a chip and so we hope to do the exact same thing for training as we did for inference improve basically the efficiency by roughly in order magnitude at a lower cost but I'm not ready to talk about more details in that project just yet so in summary I talked about the full lifecycle of developing these neural hours for the autopilot and how we own everything in-house the neural network is actually fairly complicated and large and we deal with a lot of problems if we actually want to train the beast but it's giving us some very interesting results and the nice thing about this is not only do we get to train really awesome large networks but we also to ship them and so for example navigating autopilot has now accumulated 1 billion miles it we've confirmed 200,000 lane changes and this is a global product product across 50 countries or more now and so that's a lot of forward passes out there of neural networks and with smart summon this is actually a bit of an outdated number we now had 800,000 sessions of people trying to call their car to them and so it's incredible to work on such a such an interesting product finally I would like to thank the patrasche team for being incredibly responsive and helpful and sort of allowing us to develop all these networks and really train them at scale and then deployed in the real world has been really an interesting collaboration thank youokay hello everyone I am Andre I am the director of AI at Tesla and I'm very excited to be here to tell you a little bit about PI torch and how we use PI tours to train your all networks for the auto pilot now I'm Pierce to do a bit of a show of hands how many of you actually own the Tesla okay a few and how many of you have used or experienced the autopilot the product okay thank you yeah so let's see so for those of you who may not be familiar with autopilot the basic functionality of the autopilot is that it keeps the car in the lane and it keeps the car also away from the vehicle wait ahead of you and then some of the more advanced functionality that we've been building for the autopilot includes navigating on a pilot which allows you to set down a pin somewhere on the map and then as long as you stick to highway the car will do all of the lane changes automatically and it will take all the right Forks to get you there so that's what navigating autopilot with smart summon which we only released about two weeks ago you can summon the car to you in the parking lot so you hold down come to me and the car comes out of its parking spot and it will come find you in the parking lot you get in like royalty and it's an amazing magical feature more broadly the team is very interested in pursuing kind of developing fulsol driving capability so that's what everyone is focused on in the team now famously perhaps we don't use lidar and we don't use high division high-definition maps so everything that we built for the autopilot is basically based on computer vision machine learning on the raw video streams that come from the eight cameras that surround the vehicle so this is an example of what we might see in one single instant and we process this as you might imagine with a lot of convolutional networks now tesla is a fairly vertically integrated company and that's also true when it comes to the intelligence of the auto pilot so in particular of course we build our own cars and we arrange the sensors around the vehicle but then also we collect all of our data we label all of the data we train it on on premise GPU clusters and then of course we take it through the entire stack we run these networks on our own custom hardware that we develop in how and then of course we are in charge of the full lifecycle of these features so we deploy them to our fleet of almost 3/4 million cars right now and we look at telemetry and try to improve the feature over time so we kind of close the loop on on this so I would like to slightly dive into some of the distribution that we employ in the team so the bread and butter for us is of course analyzing images so here's an image in order to drive in this environment you actually have to understand a lot about this environment so perhaps we have to understand the traffic lights the lane line markings cars and so on so you end up in this very massively multitask setting very quickly where you just have to know a lot about the scene so all over a lot of our networks take on this outline here where you have kind of a shared backbone that has a number of tasks hanging off of it and just to give you an idea of the workflows in the kinds of networks these are typically a ResNet 50 like backbones running on roughly a thousand five thousand images and then they have these heads of these structures that that makes sense and of course we're doing this partly because we can't afford to have neural networks for every single task because there's many many tasks almost almost 100 tasks and so we have to amortize some of that computation so we put the most shared backbones so here's some examples of what these networks that we call Hydra Nets because of their shared backbone and multiple heads what these Hydra Nets might look like is this video playing it's not okay I'm just going to go to the next video that was going to show you some lane line markings and so on this is a video showing you some road edges that we are interested in for the purposes of smart summon because we have to understand where we can be in in this environment so we want to avoid the curbs in this case now here we are making predictions in the image and then we are of course casting them out and stitching them up across space and time to understand a sort of the layout of the scene around us so here's an example of this occupancy grid we're showing just the road edges and how they get projected and the car winds its path around this parking lot while the person is summoning it and it's just trying to find its way towards the goal through this parking lot so here's how things get stitched up now so far if I've only talked about in your networks that run on independent images but of course very quickly you run across tasks that actually have to be a function of multiple images at the same time so for example if you're trying to estimate depth of any of these images it might actually be very helpful to have access to the other views of that same scene in order to predict the depth at every individual pixel or if you're trying to predict the road layout or if you're trying to steer the wheel or something like that you might actually need to borrow features from multiple other hydro Nets so what this looks like is we have all of these different hydro Nets for different cameras but then you might want to pull in some of the features from these hydro Nets and go to a second round of processing optionally recurrent and actually produce something like a road layout prediction so this is an example of what a road layout prediction might look like for the autopilot here we are plugging in three cameras simultaneously into a neural network and the network's predictions are not anymore in the image space they are in the top-down space so we're looking at the predictions of this network in particular here we are showing some of the predictions related to the corridors that are available in this parking lot where the intersections are and what the orientations of all of these things are and so the stitching up now doesn't happen sort of in a C++ code base the stitching up across space and time happens inside the recurrent neural network so more generally what our networks start to look like for all of these different tasks and what we're converging on is it looks something like this we have eight Hydra nets for the Aged tasks and they all produce all kinds of intermediate predictions but in addition to that the features from these Hydra nets go into a second round of processing there's potentially recurrent and then we have more outputs that are sort of in a top-down view and then what's special about this is that this is of course like a pretty large single network and every single task sub-samples parts of this network and trains just that small piece so for example we can be training object detector online with cameras or we can be training a depth network or we can be training or layout network all these tasks subsample the graph and train only that portion and then if you've trained Ricola neural network some videos you'll quickly kind of notice that these are not trivial training we're closed so as an example if I want to back unroll this graph in time and back propagate through time maybe we have eight cameras we unroll for 16 time steps use a bad-sized of say 32 then we are going to be holding in memory 4096 images and all of their activations in a single forward pass so very quickly your typical distributed data parallel will will break because you can't hold this amount of memory this amount of activations in memory of a single GP or even a single node so a lot of our training potentially has to combine some elements of data distribute this this real data parallel but also model parallelism and so on it also gets kind of tricky in terms of training these networks because the typical simplest case might be a round-robin training of different tasks so your training task one then ever all the workers in the pool or training task 1 then task 2 3 etc that gets out of hand when you have a hundred tasks so instead what might make a sense is to actually have a pool of tasks and some of the tasks are doing objects some of the tasks are doing road layout sorry some of the workers might be doing depth and so on and these are all very heterogeneous workflows but they coexist and they're they're training different pieces of the network at the same time and then you can arrange them in synchronous asynchronous way or play with this to really get squeezed out all the juice out of it but all in all if you're trying to train all of the neural networks for the autopilot is actually a fairly expensive task in particular today we would train 48 different networks that make 1,000 different predictions is just if you count the number of output tensors and it takes 70,000 GPU hours to train to compile the autopilot it leaves the neural network stack so if you had a single node with 8 GPUs you would be training for a year so it's a lot of networks and a lot of predictions and a lot of them must work and none of them can regress ever and then you're not just training this once you are you have to iterate on this so of course there are researchers and engineers in the team that actually have to improve on this and so as you can imagine we do a lot of neural hour training at scale to get this to actually work and then we are automating a lot of the workflows it's not just about the neural network training itself but everything surrounding that so in particular we have to calibrate all the different thresholds we have a process for that we have a lot of in the loop valve from our loop validation other type of validation and evaluation to make sure that none of these 1,000 different predictions that we make can regress and so on and so the North Star for the team though is all of this can actually be automated quite well so starting with the data set you can train all the neural nets you can do all the calibration the evaluation and you can really start to see the continuous integration of this and so the North Star for the team is something we internally somewhat jokingly refer to as operation vacation and the idea is that as long as the data labeling team is around and they're curating and improving our data sets then everything else can in principle be automated and so we could actually go on a vacation and the autopilot improves by default so that's something that we really try to try to go towards in the team I would like to also talk a little bit about the inference aspect of this so because I talked quite a bit about training as I mentioned we have sort of our own back-end that our Hardware team has developed we call this the FST computer it offers about 144 in Tate Terra ops off capability compared to the GPUs that we were using before we introduced this chip this is roughly an order of magnitude improvement with lower cost so we use this in all the latest cars that are now coming out of the production line and we target all the neural networks to these chips and the last thing I wanted to also briefly allude to as you'll notice here on the bottom we have a GPU cluster the hardware team is also working on a project we call dojo and a dojo is a neural network training computer and a chip and so we hope to do the exact same thing for training as we did for inference improve basically the efficiency by roughly in order magnitude at a lower cost but I'm not ready to talk about more details in that project just yet so in summary I talked about the full lifecycle of developing these neural hours for the autopilot and how we own everything in-house the neural network is actually fairly complicated and large and we deal with a lot of problems if we actually want to train the beast but it's giving us some very interesting results and the nice thing about this is not only do we get to train really awesome large networks but we also to ship them and so for example navigating autopilot has now accumulated 1 billion miles it we've confirmed 200,000 lane changes and this is a global product product across 50 countries or more now and so that's a lot of forward passes out there of neural networks and with smart summon this is actually a bit of an outdated number we now had 800,000 sessions of people trying to call their car to them and so it's incredible to work on such a such an interesting product finally I would like to thank the patrasche team for being incredibly responsive and helpful and sort of allowing us to develop all these networks and really train them at scale and then deployed in the real world has been really an interesting collaboration thank you\n"