Self-driving from VISION ONLY - Tesla's self-driving progress by Andrej Karpathy (Talk Analysis)

**Proof by Example: The New Vision-Only Stack Works Better**

The speaker's presentation showcases a new stack that has been trained at scale, demonstrating its robustness and effectiveness in real-world scenarios. By comparing the performance of this new stack to traditional sensor fusion systems, which include radar, it becomes clear that vision-only approach is not only feasible but also superior.

One key aspect highlighted in the presentation is the issue with traditional sensor fusion systems, which often struggle to keep up with fast-moving objects. The speaker notes that radar can be "trigger-happy" and sees false stationary objects everywhere, leading to incorrect associations and errors. In contrast, vision-only approach recognizes objects much earlier, providing accurate depth and velocity information.

The benefits of the new stack are twofold. Firstly, it eliminates the need for complex tracking algorithms, which can be time-consuming and prone to errors. By leveraging the strengths of vision alone, the system can focus on what works best: recognizing objects early on. Secondly, the vision-only approach simplifies the overall system architecture, reducing the likelihood of errors and making it more efficient.

**The Economics of Vision-Only Systems**

One of the significant advantages of using a vision-only stack is the reduced cost and complexity associated with traditional sensor fusion systems. According to the speaker, camera sensors are developed with much more rigor as a society than radar systems, resulting in better-performing and cheaper cameras. This disparity in investment leads to a self-reinforcing cycle, where cameras become increasingly sophisticated and affordable, while radar systems remain less advanced.

As a result, using vision-only approach becomes economically viable, especially when compared to the cost of integrating multiple sensors into a single system. The speaker emphasizes that this is not just an argument based on current technology, but also one that holds true in the future. By focusing on vision alone, developers can create more efficient and effective systems that don't rely on expensive and complex sensor fusion algorithms.

**Validation and Deployment**

The presentation provides insight into how the new stack was validated and deployed in real-world scenarios. The speaker touches upon the importance of thorough testing and validation, ensuring that the system performs well under various conditions. Additionally, they discuss strategies for rolling out the vision-only approach, including data collection and analysis to fine-tune the performance of the system.

**A Q&A Session**

The presentation concludes with a Q&A session, where the speaker addresses questions from attendees. This opportunity allows viewers to delve deeper into specific aspects of the new stack and its implications, providing further insight into the vision-only approach and its potential applications.

In conclusion, the presentation showcases the benefits of using a vision-only stack in sensor fusion systems. By leveraging the strengths of computer vision and eliminating the need for complex tracking algorithms, developers can create more efficient and effective systems that don't rely on expensive and complex sensor fusion approaches. As the speaker notes, this approach may seem like an "ad" but it is indeed a compelling argument for why vision-only stacks are worth exploring further.

"WEBVTTKind: captionsLanguage: enall right hello everyone today we're going to look at andre karpathy's cvpr talk about full self-driving mode in tesla and what tesla's been doing to push that beyond its current state so let's just say that autonomous driving is a hard problem you have to control a car and pretty much anything could happen however we're able to teach it to pretty much any human on the planet so the problem is definitely solvable now the current stack they have for full self-driving or that they intended to use it seems like is what they call sensor fusion which is where you take a bunch of different signals like camera signals and radar signals and so on and you try to fuse their signals together this kind of works it seems but it runs into problems such as what do you do when the different sensors disagree and it turns out solving that problem is quite hard and that's why tesla apparently is transitioning to a fully only vision stack everything is going to be vision based in tesla full self driving now today we're going to look at the best and important bits of the talk right here now absolutely invite you to go watch the entire talk if you're interested it is enjoyable in full length and it is on youtube andre gives a lot of good examples here and the amount of effort that went into engineering this into collecting the data how this is deployed is astounding now keep in mind this is the lead ai scientist for tesla so it is going to be a bit of an ad however it is pretty cool to see that we are actually making a real push towards full self driving a lot of people have been super salty saying that elon musk has promised this like one or two years ago already but come on i mean do you see anyone else doing fully self-driving at this level no so shut up so the first thing right here is a couple of scenarios of what tesla is already doing which is sort of a driver assistance so if the person is driving but the system is relatively sure that the person's making a mistake the system kicks in mostly to do automatic braking for the user so i just i want to show you this one example right here you start slow and probably you know does not actually enter the intersection uh these are examples from pedal misapplication mitigation pmm here a person is unparking from their driving spot and they are trying to turn and then they mess up and they accidentally floor it so they floor it right there so you see like the person wanted to break but stepped on the gas there are people right in front of the car so be salty all you want this right here is already worth it so as a human there is a lot of resistance against fully self-driving feeling that you're no longer in control anymore but the matter of the fact is that these systems already are and in the near future will be even much more better than humans at driving it's going to be much cleaner much safer much faster less traffic jams and so on to let the machines take over the driving pretty much in the same way as it's much safer to let the machines take over the braking in these scenarios the only times you're actually going to drive by hand is when you do it for fun now i drive a motorbike it's a lot of fun to drive but in a car especially with other people or if i do it for work if i may be a little bit tired machines all the way so the full self-driving beta is rolled out to a small handful of customers right now and they do upload youtube videos every now and then of what they're doing and it seems to work fairly fairly well apparently they have had no crashes so far while driving about 1.7 million miles in full stealth driving you can see on the screen in the middle right here that the predictions that the system gives is pretty good though we've also seen some other predictions that are not so good throughout youtube like there's this one video where the truck in front of the car has street lights on its back and the car just keeps thinking it's kind of red lights however we don't know if this is the legacy stack or not and if the car would actually break since the lights are not on red but it's been a scare going around youtube for a little bit so here andre shows a video of weimo already doing this much earlier than tesla having sort of an automatic car drive around an intersection and so on this works if you're in a really defined zone let's say a city that you know that you have accurate maps for this does not work if you want to do this anywhere in the world to do this anywhere in the world you need to rely on the car itself that means you need a lot of data so the data that this new system gets is just vision it's eight cameras around the car and that's it and andre makes a good case here that that is actually all you need humans are able to navigate from this and cars should be able to do the same so an absolutely necessary ingredient to train such a system is a good clean label data set if you just wanted to use humans to annotate every single frame of cars driving around that would probably be prohibitively expensive even for tesla so they came up with what i think is a pretty cool method called auto labeling now i'm sure they're not the inventors of this system but to use it on this scale is very smart and it works out pretty nicely of course we need uh to collect training a typical approach might be to use humans to annotate cars around us in three dimensions what we found actually works really well is an auto labeling approach so it's not your humans just like annotating cars it's an offline tracker as we call it and it's an auto labeling process for collecting data at the scale that is necessary so we need to get millions of hard examples so this is where the scale comes from is that it's not labeled early by humans although humans are involved it's labeled automatically so here's an example of some automatic labels we were able to drive four cars on the highway and the way you do this is because you are offline and you are trying to just annotate a club you have a large number of benefits that you don't typically have if you're at test time under strict latency requirements in the car so you can take your time to fully figure out exactly all the objects in what they are you can use neural networks that are extremely heavy they are not deployable for various reasons you can use benefit of hindsight because you know the future not just the path you can use all kinds of expensive offline optimization and tracking techniques you can use extra sensors in this case for example actually radar was one of the sensors that we used for the auto labeling but there's actually a massive difference between using radar at test time and using it in the offline traffic point here is that if you record data and you're trying to figure out at inference time like while you're driving what's happening it's a lot harder than if you have the same data but kind of at home in the lab so what you want to do is you want to drive around and just record not even not predict or anything just record data record from all your sensors you can even stick expensive sensors on the cars where you collect the data and then you take all that data and you use the biggest heaviest processors you have to figure out what actually happened during that time what he mentions here is the benefit of hindsight which means that if you're in a car and you're driving and all of a sudden something obscures your vision you will be sort of lost because all you have okay you can maybe guess that a car in front of you is still there but who knows they might turn or something now if you record the whole video sequence you're able to see what happens beyond the obstruction of vision and if you see the car is still there you can make a good inference that the car was actually there the whole time and therefore you can annotate that data with a label saying hey that car was there the whole time you can also do active learning and shell out to actual human annotators what you're not sure about so this benefit of hindsight is really important here when you're under the time constraint of not being able to see into the future as well as the latency constraint and you have to have like an efficient neural network in the lab you don't have any of this the method here if you're developing something real time i mean this might seem obvious to you i found it to be pretty cool yes record then figure out what happened then use that as a labeled data set so here's an example of how such a persistent track would look like after the neural network has been trained on data like this here's some examples of really tricky scenarios i don't actually know exactly what this is but basically this car drops a bunch of debris on us and we maintain a consistent track for the label and of course if you have millions of labels like this the neural net if it's a powerful enough neural net will actually end up learning to persist these tracks in these kinds of scenarios here's another example there's a car in front of us i actually am not 100 sure what happens in this case but as you'll see there's some kind of a dust cloud that develops here and briefly occludes the car but in the auto labeling tool we are able to persist this track because we saw it before and we saw it after so we can actually stitch it up and use it as a training set for the neural net so that's how they get clean labels in an automatic or semi-automatic way but they still need to get a lot of data from kind of edge cases because most of driving is quite uneventful straight driving and was done 40 years ago or something like this i think schmidt uber in gtc 21 talk talked about autonomous cars on highways on controlled stretches of highways super duper early already so what we really need to collect is edge cases and for collecting these edge cases tesla has developed these what they call triggers so these are kind of hand programmed rules of what data should go into the annotation pipeline so imagine if all these cars driving around not only the people with full self-driving but the detection the actual recording of data is activated in all the tesla cars driving around they all send that data back to the server of course that's way too much data and also it's very unbalanced in terms of how many critical situations are in there again most of it will be sort of straight road mt just drive straight so what they do is they filter this data for these trigger events now these trigger events can be as simple as whenever the radar and the vision mismatch so whenever they disagree on something that's an interesting example but you know it goes into very detailed such as we detect breaking lights but the acceleration is positive so with these triggers they're able to source a diverse set of training samples and edge cases where the neural network can learn the tricky situations rather than just the long stretches of road so i think it's safe to say that a good mark of quality on these systems is going to be how well these triggers are maintained like how well do they represent the full driving experience of the end users of the cars but so far from the results we got it seems like they cover the road situations fairly well and all of them are iteration and you're looking at what's coming back you're tuning your trigger and you're sourcing data from all these scenarios basically over the last four months we've done quite extensive data engine we've ended up doing seven shadow modes and seven loops around this data engine here where on the top right is where you begin you have some seed data set you train your neural network on your data set and you deploy the neural network in the customer cars in shadow mode and the network is silently making predictions by the way if you if you like squint really hard i don't know if this is just a depiction of a neural network or if this is the actual architecture they're using i don't think so but there is like a stride of six in there and max pooling you know just just noting that for no particular reason and then you have to have some mechanisms for sourcing inaccuracies of the neural net you're just looking at its predictions and then you're using one of these triggers you're getting these scenarios where the network is probably misbehaving some of those let's end up going to unit tests to make sure that we even if we're failing right now we make sure we pass later and in addition those examples are being auto labeled and incorporated into a training set and then as asynchronous process we're also always data cleaning the current training so we spin this loop over and over again until the network basically becomes incredibly good so in total we've done seven rounds of shadow mode for this release so shadow mode is what they call when they let the predictions run but they don't hook them up to the control so you're driving yourself but the system predicts all the time and whenever one of these trigger happens that's an interesting data point that it's going to send back to the server actually let's be honest it's probably going to send everything back to the server so the data set they come up with is 1.5 petabytes crazy so next he's going to go into the architecture of the neural net and this is also fairly interesting and not entirely standard on the top all of them are processed by an image extractor play the layout of the synthetic visual cortex in order to efficiently process this information our architecture roughly looks like this we have these images coming from multiple cameras on the top all of them are processed by an image extractor like a backbone i think resnet kind of style then there's a multi-cam fusion that uses the information from all the aid to use and this is a kind of a transformer that we use to fuse this information and then we fuse information first across all the cameras and then across all of time and that is also done either by transformer by recurrent neural network or just by three-dimensional convolutions we've experimented with a lot of kind of fusion strategies here to get this to work really well and then what we have afterwards after the fusion is done is we have this branching structure that doesn't just consist of heads but actually we've expanded this over the last few last years or so where you now have heads that branch into trunks that branch into terminals so there's a lot of branching structure and the reason you want this branching structure is because there's a huge amount of outputs that you're interested in and you can't afford to have a single neural network for every one of the individual outputs you have to of course amortize the forward pass so this is pretty interesting the top part here what they call the backbone is pretty standard if you have a video especially with multiple cameras you want to extract information from each frame of each camera sort of individually then you want to fuse that information across all the cameras for a single time step and then you want to fuse that information with the information of all the other time steps so so far so good that sort of gives you a representation of what happens in these frames in these cameras during that stretch of time however after that usually even if you have multiple predictions what you would do is you would sort of have like one prediction head on top of that backbone however since they are in a car and have to decide real fast it's not really feasible to have sort of these different columns for each of the prediction tasks because as he says they're interested in a lot of different signals think depth prediction which means that for every pixel you have to provide a depth estimation think tracks of other cars think pedestrians think street lights think okay where are their lanes at or navigation in general so all these signals are things to predict and it's not good enough to have like a separate head for each of the predictions so what they do is they have as you call these branching structures where there are multiple heads yes and within these multiple heads there are what they call trunks and within the trunks there are the individual like little what they call terminals so essentially it's a hierarchical prediction i'm gonna guess that the tasks that go together sort of are grouped together so maybe one head is for all the pixel prediction tasks and another head is more for that classification tasks and then within one head you have a trunk that deals more with like object classification and another trunk that deals more with like navigation classification and the individual terminals then do the actual tasks this is a pretty cool way of getting a highly performant many output network all together such that its size and computational speed are still maintained the other nice benefit of the branching structure is that it decouples at the terminals it decouples all these signals so if i'm someone working on velocity for a particular object type or something like that i have a small piece of neural network that i can actually fine-tune without touching any of the other signals and so i can work in isolation to some extent and actually get something to work pretty well and then once in a while so basically the iteration scheme is that a lot of people are fine-tuning and once you know you just gotta imagine the ml ops behind this it's like hey uh where do you deploy your models i do it on kubernetes i have uh ml flow oh no i i use the tensorflow extended yeah it's pretty cool what do you do car i deploy on car so next uh is going into this in-house supercomputer that they built or are building and this is a massive thing absolutely massive he says that in terms of flops it's something like the fifth biggest computer in the world its storage speed is incredible so i'm pretty sure you could even actually render far cry 2 on this thing maybe but in total it has 5 760 gpus not any gpus the most expensive a 180 gigabyte gpus it would be interesting to see what kind of algorithms they use on top of this uh to actually do the distributed training or whether it's all just kind of simple data parallelism aggregating gradients and so on of course they have super fast interconnect super fast storage super fast everything and it looks sweet like is this a stock photo of a server room or is this the actual server room this effort basically is incredibly vertically integrated in the ai team so as i showed you we own the vehicle in the sensing and we source our own data and we annotate our own data and we train our on-prem cluster and then we deploy all of the neural networks that we train on our in-house developed chip so we have the fsd computer here that has two socs has the chips here and they have our own custom npu neural processing unit here at roughly 36 times each so these chips are specifically designed for the neural networks that we want to run for yeah i mean this is the dream right if you're an ai professional owning the whole pipeline is going to boost your productivity by so much you're not bound by the constraint of anything other than the limits on the final system which is a car so fairly difficult but in between of that you have control over everything you have control over how the data is collected annotated you have control over where it is deployed to on what architecture of chip because you make the chip so i guess the lesson is if you're looking to change the world you better own a good chunk of it so now it's going to show some examples of what this new vision only stack could do remember they used to do fusion of sensors which means they essentially have radar they have vision maybe some other sensors and they try to integrate this information from all of the sensors they compare this to the new vision based system now check out what happens in terms of the depth and velocity predictions that we're able to achieve by putting all these pieces together and training these networks at scale so the first example here i have a video where this is on track testing so this is an engineering car and we asked it to slam on the brakes as hard as it possibly can so this is a very harsh braking here in front of us even though it doesn't look like that in the videos it's very harsh breaking so what you can see on the right here is you can see the outputs from the legacy stack which had radar vision fusion and from the new stack which is vision alone in blue so in the orange legacy stack you can actually see these uh track drops here when the car was breaking really harshly and basically the issue is that the braking was so harsh that the radar stack that we have actually ended up not associating car and dropping the track and then reinitializing it all the time and so it's as if the vehicle disappeared and reappeared like six times during the period of this breaking and so this created a bunch of artifacts here but we see that the new stack in blue is actually not subject to this behavior at all it just gives a clean signal in fact here there's no smoothing i believe on the blue signal here this is the raw depth and velocity that comes out from the neural net the final neuron that we released with about three weeks ago and you can see that it's fairly smooth here and of course you could go into the radar stack and you could you know adjust the height parameters of the tracker like why is it dropping tracks and so on but then you are spending engineering efforts and focus on a stack that is like not really barking up the right tree and so it's better to again just focus on the vision and make it work really well and we see that it is much more robust when you train it at scale so there you have it proof by one example that the new thing works better isn't that every cvpr paper ever but no in in any case i can totally believe that the new stack even though it drops a bunch of the sensors is better because ultimately if your one sensor if vision is so performant that in every single disagreement you go with the vision thing then why do you have the other sensors at all the thing in front of it is just kind of breaking too fast so the radar kind of loses it and then regains it and loses it and regains it now i have no idea how radar works so i'm speaking from complete ignorance right here but what i'm going to guess as far as i understand it is that radar just kind of gives you the velocities of stuff in front of you and then there is a tracking algorithm on top of radar that tries to figure out which stuff is the same stuff and this is very much what they do in this auto labeling where they have sort of a track on something right and then they use hindsight and then they have a tracking algorithm that decides which things are the same even though we don't see them all the time and here you can clearly see the benefit of shifting this from inference time which is what you have to do with radar to the training time which is what you can do with vision so you can teach the vision system to sort of do this persistent tracking whereas the radar system you have to hand tune it to do this in real time now it makes the point that of course you could go into the radar system change the hyper parameters but then he says why bark up the wrong tree why waste time on a stack that isn't functioning well it's a bit of a chicken and an egg problem right if you were to put as much effort into the radar stack as you were into the vision system i'm going to guess that these results would go away and that it's able to keep up maybe but arguments for going vision only is a strong one and i don't doubt that it is probably a good way forward and basically what's happening here is that the radar is very trigger-happy and it sees all these false stationary objects everywhere like everything that like sticks out is a stationary target and radar by itself doesn't know what actually is a stationary car and what isn't so it's waiting for vision to associate with it and vision if it's not held up to a high enough bar is noisy and contributes sort of error and the sensor fusion stack just kind of like picks it up too late so again you could fix all that even though it's a very gross system with a lot of if statements and so on because the sensor fusion is complicated because the error modes for vision and radar are slightly are quite different but here when we just work with vision alone and we take out the radar vision recognizes this object very early gives the correct depth and velocity and there's no issues so we actually get an initial slow down much earlier and we've really like simplified the stack a lot yeah so here you can see the same failure mode in vision that it kind of gets a track but doesn't but get a track but doesn't the important part is that once you get closer to the object it is fairly consistent right as you can see right here the vision stack recognizes this truck on the side much earlier than the radar stack did now again this might just be a function of the hyper parameters used i'm sure you could just lower the threshold for the radar but you'd run into different problems now during the queue and day he makes a good point in that yes other sensors would be nice to have but just the pure economics speak in favor of vision 2. like we develop cameras with much more rigor as a society than we do radar systems and therefore the camera sensors are just so much better nowadays and cheaper so you can afford to build many of them into all kinds of things and collect data and make your systems better through that than to put kind of a lidar on top of a car and having to sort of fuse those signals with the visual signals especially when they're in conflict with one another so if you ask me i'm a fan i like what i see here even though i know it's kind of an ad i don't own a tesla but i think it's still pretty cool so in the end they talks a bit about what they do to validate this data and how they roll it out and gives a bunch of more examples of tracking and there's a q a at the end so if you are interested in that i absolutely welcome you to go watch the entire talk it is on youtube and that was it for me i hope you enjoyed this and i'll see you next time ciaoall right hello everyone today we're going to look at andre karpathy's cvpr talk about full self-driving mode in tesla and what tesla's been doing to push that beyond its current state so let's just say that autonomous driving is a hard problem you have to control a car and pretty much anything could happen however we're able to teach it to pretty much any human on the planet so the problem is definitely solvable now the current stack they have for full self-driving or that they intended to use it seems like is what they call sensor fusion which is where you take a bunch of different signals like camera signals and radar signals and so on and you try to fuse their signals together this kind of works it seems but it runs into problems such as what do you do when the different sensors disagree and it turns out solving that problem is quite hard and that's why tesla apparently is transitioning to a fully only vision stack everything is going to be vision based in tesla full self driving now today we're going to look at the best and important bits of the talk right here now absolutely invite you to go watch the entire talk if you're interested it is enjoyable in full length and it is on youtube andre gives a lot of good examples here and the amount of effort that went into engineering this into collecting the data how this is deployed is astounding now keep in mind this is the lead ai scientist for tesla so it is going to be a bit of an ad however it is pretty cool to see that we are actually making a real push towards full self driving a lot of people have been super salty saying that elon musk has promised this like one or two years ago already but come on i mean do you see anyone else doing fully self-driving at this level no so shut up so the first thing right here is a couple of scenarios of what tesla is already doing which is sort of a driver assistance so if the person is driving but the system is relatively sure that the person's making a mistake the system kicks in mostly to do automatic braking for the user so i just i want to show you this one example right here you start slow and probably you know does not actually enter the intersection uh these are examples from pedal misapplication mitigation pmm here a person is unparking from their driving spot and they are trying to turn and then they mess up and they accidentally floor it so they floor it right there so you see like the person wanted to break but stepped on the gas there are people right in front of the car so be salty all you want this right here is already worth it so as a human there is a lot of resistance against fully self-driving feeling that you're no longer in control anymore but the matter of the fact is that these systems already are and in the near future will be even much more better than humans at driving it's going to be much cleaner much safer much faster less traffic jams and so on to let the machines take over the driving pretty much in the same way as it's much safer to let the machines take over the braking in these scenarios the only times you're actually going to drive by hand is when you do it for fun now i drive a motorbike it's a lot of fun to drive but in a car especially with other people or if i do it for work if i may be a little bit tired machines all the way so the full self-driving beta is rolled out to a small handful of customers right now and they do upload youtube videos every now and then of what they're doing and it seems to work fairly fairly well apparently they have had no crashes so far while driving about 1.7 million miles in full stealth driving you can see on the screen in the middle right here that the predictions that the system gives is pretty good though we've also seen some other predictions that are not so good throughout youtube like there's this one video where the truck in front of the car has street lights on its back and the car just keeps thinking it's kind of red lights however we don't know if this is the legacy stack or not and if the car would actually break since the lights are not on red but it's been a scare going around youtube for a little bit so here andre shows a video of weimo already doing this much earlier than tesla having sort of an automatic car drive around an intersection and so on this works if you're in a really defined zone let's say a city that you know that you have accurate maps for this does not work if you want to do this anywhere in the world to do this anywhere in the world you need to rely on the car itself that means you need a lot of data so the data that this new system gets is just vision it's eight cameras around the car and that's it and andre makes a good case here that that is actually all you need humans are able to navigate from this and cars should be able to do the same so an absolutely necessary ingredient to train such a system is a good clean label data set if you just wanted to use humans to annotate every single frame of cars driving around that would probably be prohibitively expensive even for tesla so they came up with what i think is a pretty cool method called auto labeling now i'm sure they're not the inventors of this system but to use it on this scale is very smart and it works out pretty nicely of course we need uh to collect training a typical approach might be to use humans to annotate cars around us in three dimensions what we found actually works really well is an auto labeling approach so it's not your humans just like annotating cars it's an offline tracker as we call it and it's an auto labeling process for collecting data at the scale that is necessary so we need to get millions of hard examples so this is where the scale comes from is that it's not labeled early by humans although humans are involved it's labeled automatically so here's an example of some automatic labels we were able to drive four cars on the highway and the way you do this is because you are offline and you are trying to just annotate a club you have a large number of benefits that you don't typically have if you're at test time under strict latency requirements in the car so you can take your time to fully figure out exactly all the objects in what they are you can use neural networks that are extremely heavy they are not deployable for various reasons you can use benefit of hindsight because you know the future not just the path you can use all kinds of expensive offline optimization and tracking techniques you can use extra sensors in this case for example actually radar was one of the sensors that we used for the auto labeling but there's actually a massive difference between using radar at test time and using it in the offline traffic point here is that if you record data and you're trying to figure out at inference time like while you're driving what's happening it's a lot harder than if you have the same data but kind of at home in the lab so what you want to do is you want to drive around and just record not even not predict or anything just record data record from all your sensors you can even stick expensive sensors on the cars where you collect the data and then you take all that data and you use the biggest heaviest processors you have to figure out what actually happened during that time what he mentions here is the benefit of hindsight which means that if you're in a car and you're driving and all of a sudden something obscures your vision you will be sort of lost because all you have okay you can maybe guess that a car in front of you is still there but who knows they might turn or something now if you record the whole video sequence you're able to see what happens beyond the obstruction of vision and if you see the car is still there you can make a good inference that the car was actually there the whole time and therefore you can annotate that data with a label saying hey that car was there the whole time you can also do active learning and shell out to actual human annotators what you're not sure about so this benefit of hindsight is really important here when you're under the time constraint of not being able to see into the future as well as the latency constraint and you have to have like an efficient neural network in the lab you don't have any of this the method here if you're developing something real time i mean this might seem obvious to you i found it to be pretty cool yes record then figure out what happened then use that as a labeled data set so here's an example of how such a persistent track would look like after the neural network has been trained on data like this here's some examples of really tricky scenarios i don't actually know exactly what this is but basically this car drops a bunch of debris on us and we maintain a consistent track for the label and of course if you have millions of labels like this the neural net if it's a powerful enough neural net will actually end up learning to persist these tracks in these kinds of scenarios here's another example there's a car in front of us i actually am not 100 sure what happens in this case but as you'll see there's some kind of a dust cloud that develops here and briefly occludes the car but in the auto labeling tool we are able to persist this track because we saw it before and we saw it after so we can actually stitch it up and use it as a training set for the neural net so that's how they get clean labels in an automatic or semi-automatic way but they still need to get a lot of data from kind of edge cases because most of driving is quite uneventful straight driving and was done 40 years ago or something like this i think schmidt uber in gtc 21 talk talked about autonomous cars on highways on controlled stretches of highways super duper early already so what we really need to collect is edge cases and for collecting these edge cases tesla has developed these what they call triggers so these are kind of hand programmed rules of what data should go into the annotation pipeline so imagine if all these cars driving around not only the people with full self-driving but the detection the actual recording of data is activated in all the tesla cars driving around they all send that data back to the server of course that's way too much data and also it's very unbalanced in terms of how many critical situations are in there again most of it will be sort of straight road mt just drive straight so what they do is they filter this data for these trigger events now these trigger events can be as simple as whenever the radar and the vision mismatch so whenever they disagree on something that's an interesting example but you know it goes into very detailed such as we detect breaking lights but the acceleration is positive so with these triggers they're able to source a diverse set of training samples and edge cases where the neural network can learn the tricky situations rather than just the long stretches of road so i think it's safe to say that a good mark of quality on these systems is going to be how well these triggers are maintained like how well do they represent the full driving experience of the end users of the cars but so far from the results we got it seems like they cover the road situations fairly well and all of them are iteration and you're looking at what's coming back you're tuning your trigger and you're sourcing data from all these scenarios basically over the last four months we've done quite extensive data engine we've ended up doing seven shadow modes and seven loops around this data engine here where on the top right is where you begin you have some seed data set you train your neural network on your data set and you deploy the neural network in the customer cars in shadow mode and the network is silently making predictions by the way if you if you like squint really hard i don't know if this is just a depiction of a neural network or if this is the actual architecture they're using i don't think so but there is like a stride of six in there and max pooling you know just just noting that for no particular reason and then you have to have some mechanisms for sourcing inaccuracies of the neural net you're just looking at its predictions and then you're using one of these triggers you're getting these scenarios where the network is probably misbehaving some of those let's end up going to unit tests to make sure that we even if we're failing right now we make sure we pass later and in addition those examples are being auto labeled and incorporated into a training set and then as asynchronous process we're also always data cleaning the current training so we spin this loop over and over again until the network basically becomes incredibly good so in total we've done seven rounds of shadow mode for this release so shadow mode is what they call when they let the predictions run but they don't hook them up to the control so you're driving yourself but the system predicts all the time and whenever one of these trigger happens that's an interesting data point that it's going to send back to the server actually let's be honest it's probably going to send everything back to the server so the data set they come up with is 1.5 petabytes crazy so next he's going to go into the architecture of the neural net and this is also fairly interesting and not entirely standard on the top all of them are processed by an image extractor play the layout of the synthetic visual cortex in order to efficiently process this information our architecture roughly looks like this we have these images coming from multiple cameras on the top all of them are processed by an image extractor like a backbone i think resnet kind of style then there's a multi-cam fusion that uses the information from all the aid to use and this is a kind of a transformer that we use to fuse this information and then we fuse information first across all the cameras and then across all of time and that is also done either by transformer by recurrent neural network or just by three-dimensional convolutions we've experimented with a lot of kind of fusion strategies here to get this to work really well and then what we have afterwards after the fusion is done is we have this branching structure that doesn't just consist of heads but actually we've expanded this over the last few last years or so where you now have heads that branch into trunks that branch into terminals so there's a lot of branching structure and the reason you want this branching structure is because there's a huge amount of outputs that you're interested in and you can't afford to have a single neural network for every one of the individual outputs you have to of course amortize the forward pass so this is pretty interesting the top part here what they call the backbone is pretty standard if you have a video especially with multiple cameras you want to extract information from each frame of each camera sort of individually then you want to fuse that information across all the cameras for a single time step and then you want to fuse that information with the information of all the other time steps so so far so good that sort of gives you a representation of what happens in these frames in these cameras during that stretch of time however after that usually even if you have multiple predictions what you would do is you would sort of have like one prediction head on top of that backbone however since they are in a car and have to decide real fast it's not really feasible to have sort of these different columns for each of the prediction tasks because as he says they're interested in a lot of different signals think depth prediction which means that for every pixel you have to provide a depth estimation think tracks of other cars think pedestrians think street lights think okay where are their lanes at or navigation in general so all these signals are things to predict and it's not good enough to have like a separate head for each of the predictions so what they do is they have as you call these branching structures where there are multiple heads yes and within these multiple heads there are what they call trunks and within the trunks there are the individual like little what they call terminals so essentially it's a hierarchical prediction i'm gonna guess that the tasks that go together sort of are grouped together so maybe one head is for all the pixel prediction tasks and another head is more for that classification tasks and then within one head you have a trunk that deals more with like object classification and another trunk that deals more with like navigation classification and the individual terminals then do the actual tasks this is a pretty cool way of getting a highly performant many output network all together such that its size and computational speed are still maintained the other nice benefit of the branching structure is that it decouples at the terminals it decouples all these signals so if i'm someone working on velocity for a particular object type or something like that i have a small piece of neural network that i can actually fine-tune without touching any of the other signals and so i can work in isolation to some extent and actually get something to work pretty well and then once in a while so basically the iteration scheme is that a lot of people are fine-tuning and once you know you just gotta imagine the ml ops behind this it's like hey uh where do you deploy your models i do it on kubernetes i have uh ml flow oh no i i use the tensorflow extended yeah it's pretty cool what do you do car i deploy on car so next uh is going into this in-house supercomputer that they built or are building and this is a massive thing absolutely massive he says that in terms of flops it's something like the fifth biggest computer in the world its storage speed is incredible so i'm pretty sure you could even actually render far cry 2 on this thing maybe but in total it has 5 760 gpus not any gpus the most expensive a 180 gigabyte gpus it would be interesting to see what kind of algorithms they use on top of this uh to actually do the distributed training or whether it's all just kind of simple data parallelism aggregating gradients and so on of course they have super fast interconnect super fast storage super fast everything and it looks sweet like is this a stock photo of a server room or is this the actual server room this effort basically is incredibly vertically integrated in the ai team so as i showed you we own the vehicle in the sensing and we source our own data and we annotate our own data and we train our on-prem cluster and then we deploy all of the neural networks that we train on our in-house developed chip so we have the fsd computer here that has two socs has the chips here and they have our own custom npu neural processing unit here at roughly 36 times each so these chips are specifically designed for the neural networks that we want to run for yeah i mean this is the dream right if you're an ai professional owning the whole pipeline is going to boost your productivity by so much you're not bound by the constraint of anything other than the limits on the final system which is a car so fairly difficult but in between of that you have control over everything you have control over how the data is collected annotated you have control over where it is deployed to on what architecture of chip because you make the chip so i guess the lesson is if you're looking to change the world you better own a good chunk of it so now it's going to show some examples of what this new vision only stack could do remember they used to do fusion of sensors which means they essentially have radar they have vision maybe some other sensors and they try to integrate this information from all of the sensors they compare this to the new vision based system now check out what happens in terms of the depth and velocity predictions that we're able to achieve by putting all these pieces together and training these networks at scale so the first example here i have a video where this is on track testing so this is an engineering car and we asked it to slam on the brakes as hard as it possibly can so this is a very harsh braking here in front of us even though it doesn't look like that in the videos it's very harsh breaking so what you can see on the right here is you can see the outputs from the legacy stack which had radar vision fusion and from the new stack which is vision alone in blue so in the orange legacy stack you can actually see these uh track drops here when the car was breaking really harshly and basically the issue is that the braking was so harsh that the radar stack that we have actually ended up not associating car and dropping the track and then reinitializing it all the time and so it's as if the vehicle disappeared and reappeared like six times during the period of this breaking and so this created a bunch of artifacts here but we see that the new stack in blue is actually not subject to this behavior at all it just gives a clean signal in fact here there's no smoothing i believe on the blue signal here this is the raw depth and velocity that comes out from the neural net the final neuron that we released with about three weeks ago and you can see that it's fairly smooth here and of course you could go into the radar stack and you could you know adjust the height parameters of the tracker like why is it dropping tracks and so on but then you are spending engineering efforts and focus on a stack that is like not really barking up the right tree and so it's better to again just focus on the vision and make it work really well and we see that it is much more robust when you train it at scale so there you have it proof by one example that the new thing works better isn't that every cvpr paper ever but no in in any case i can totally believe that the new stack even though it drops a bunch of the sensors is better because ultimately if your one sensor if vision is so performant that in every single disagreement you go with the vision thing then why do you have the other sensors at all the thing in front of it is just kind of breaking too fast so the radar kind of loses it and then regains it and loses it and regains it now i have no idea how radar works so i'm speaking from complete ignorance right here but what i'm going to guess as far as i understand it is that radar just kind of gives you the velocities of stuff in front of you and then there is a tracking algorithm on top of radar that tries to figure out which stuff is the same stuff and this is very much what they do in this auto labeling where they have sort of a track on something right and then they use hindsight and then they have a tracking algorithm that decides which things are the same even though we don't see them all the time and here you can clearly see the benefit of shifting this from inference time which is what you have to do with radar to the training time which is what you can do with vision so you can teach the vision system to sort of do this persistent tracking whereas the radar system you have to hand tune it to do this in real time now it makes the point that of course you could go into the radar system change the hyper parameters but then he says why bark up the wrong tree why waste time on a stack that isn't functioning well it's a bit of a chicken and an egg problem right if you were to put as much effort into the radar stack as you were into the vision system i'm going to guess that these results would go away and that it's able to keep up maybe but arguments for going vision only is a strong one and i don't doubt that it is probably a good way forward and basically what's happening here is that the radar is very trigger-happy and it sees all these false stationary objects everywhere like everything that like sticks out is a stationary target and radar by itself doesn't know what actually is a stationary car and what isn't so it's waiting for vision to associate with it and vision if it's not held up to a high enough bar is noisy and contributes sort of error and the sensor fusion stack just kind of like picks it up too late so again you could fix all that even though it's a very gross system with a lot of if statements and so on because the sensor fusion is complicated because the error modes for vision and radar are slightly are quite different but here when we just work with vision alone and we take out the radar vision recognizes this object very early gives the correct depth and velocity and there's no issues so we actually get an initial slow down much earlier and we've really like simplified the stack a lot yeah so here you can see the same failure mode in vision that it kind of gets a track but doesn't but get a track but doesn't the important part is that once you get closer to the object it is fairly consistent right as you can see right here the vision stack recognizes this truck on the side much earlier than the radar stack did now again this might just be a function of the hyper parameters used i'm sure you could just lower the threshold for the radar but you'd run into different problems now during the queue and day he makes a good point in that yes other sensors would be nice to have but just the pure economics speak in favor of vision 2. like we develop cameras with much more rigor as a society than we do radar systems and therefore the camera sensors are just so much better nowadays and cheaper so you can afford to build many of them into all kinds of things and collect data and make your systems better through that than to put kind of a lidar on top of a car and having to sort of fuse those signals with the visual signals especially when they're in conflict with one another so if you ask me i'm a fan i like what i see here even though i know it's kind of an ad i don't own a tesla but i think it's still pretty cool so in the end they talks a bit about what they do to validate this data and how they roll it out and gives a bunch of more examples of tracking and there's a q a at the end so if you are interested in that i absolutely welcome you to go watch the entire talk it is on youtube and that was it for me i hope you enjoyed this and i'll see you next time ciao\n"