C4W3L01 Object Localization

**Object Detection and Localization with Neural Networks**

In this article, we will explore the concept of object detection and localization using neural networks. Object detection is a task in computer vision where an image is scanned to find one or more objects within it. The goal of object detection is not only to identify the type of object present in the image but also to locate its position and size.

**The Problem Statement**

Given an input image, we want to detect whether there is an object present in it and if so, identify its class (e.g., pedestrian, car, motorcycle) and provide a bounding box that encloses the object. The problem can be formulated as follows: let X be the input image, Y be the output of the neural network, PC be the probability that there is an object present, BXB_Y_BH+BW be the bounding box coordinates, and C1, C2, C3 be the class labels (0 or 1 indicating whether it is a pedestrian, car, or motorcycle). The goal is to minimize the loss function over all training examples.

**Training Set Construction**

To train a neural network for object detection and localization, we need to construct a training set consisting of input images X and corresponding output labels Y. For each image in the training set, we can follow these steps:

* If there is an object present in the image (PC = 1), then:

+ Output BXB_Y_BH+BW, which represents the bounding box coordinates.

+ Assign a class label to the object (e.g., pedestrian, car, motorcycle) such that only one of C1, C2, or C3 is equal to 1.

* If there is no object present in the image (PC = 0), then:

+ Output any bounding box coordinates (e.g., [0, 0, 0, 0]).

+ Assign a class label such that all of C1, C2, and C3 are equal to 0.

**Loss Function**

The loss function used to train the neural network consists of two parts:

* For images where there is an object present (PC = 1), use squared error loss for the bounding box coordinates and logistic regression loss for PC.

* For images where there is no object present (PC = 0), use squared error loss only for PC.

The squared error loss function calculates the difference between the predicted output (Y) and the ground truth label (y). The logistic regression loss function uses the softmax activation function to convert the output probabilities into a single value. In this case, we use the cross-entropy loss function, which is equivalent to the log likelihood loss function.

**Neural Network Architecture**

The neural network architecture used for object detection and localization typically consists of several convolutional layers followed by fully connected layers. The final layer outputs a set of real numbers that represent the bounding box coordinates and class labels. The input image is passed through multiple convolutional layers to extract features, which are then processed by fully connected layers to produce the output.

**Future Developments**

Object detection and localization using neural networks has many applications in computer vision and beyond. Future research directions include exploring new architectures, improving performance, and applying these techniques to more complex problems. In the next video, we will discuss other places where this idea of having a neural network output a set of real numbers can be very powerful in computer vision.

"WEBVTTKind: captionsLanguage: enhello and welcome back this week you learn about object detection this is one of the areas of computer vision that's just exploding and it's working so much better than just a couple years ago in order to build up to object detection you first learn about object localization let's start by defining what that means you're already familiar with the image classification task where an algorithm looks at this picture and might be responsible for saying this is a car so that was classification the problem you learn to build a neural network to address later in this video is classification with localization which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box or drawing a red rectangle around the position of the car in the image so that's called the classification with localization problem where the term localization refers to figuring out where in the picture is the car you've detected later this week you then learn about the detection problem where now there might be multiple objects in the picture and you have to detect them all and localize them all and if you're doing this for an autonomous driving application then you might need to detect not just other cause but maybe other pedestrians and motorcycles and maybe even other objects so you see that later this week so in the terminology we'll use this week the classification and the classification of localization problems usually have one object usually one big object in the middle of the image that you're trying to recognize or recognize and localize in contrast in the detection problem there can be multiple objects and in fact maybe even multiple objects of different categories within a single image so the idea is you've learned about for image classification will be useful for classification with localization and then the idea is you learn for localization will then turn out to be useful for detection so let's start by talking about with localization you're already familiar with the image classification problem in which you might input a picture into a confident with multiple layers so there's a confident and this results in and this results in a vector features that is fed to maybe a soft max unit that outputs the predicted class so if you are building a self-driving car maybe your object categories are they're following where you might have a pedestrian or a car or motorcycle or background this means none of the above some of this no pedestrian your common no motorcycle then you may have an output background so these are your classes they have a soft max with four possible outputs so this is the standard classification pipeline how about if you want to localize the car in the image as well to do that you can change your neural network to have a few more output units that output a bounding box so in particular you can have the neural network output for more numbers and I'm going to call them B X B Y B H and B W and these phone numbers parameterize the bounding box of the detected object so in these videos I'm going to use the notational convention that the upper left of the image I'm going to denote as the coordinate 0 0 and the lower right is 1 1 so specifying the bounding box the red rectangle requires specifying the midpoint so that's the point B X comma B Y as well as the height that would be B H as well as the width B W of this bounding box so now if your training set contains not just the object class label which union that took us trying to predict up here but it also contains four additional numbers giving the bounding box then you can use supervised learning to make your algorithm outputs not just a class label but also the four parameters to tell you where is the bounding box of the object you detect it so in this example the idea px might be about 0.5 because this is about half way to the right to the image B Y might be about 0.7 since that's about you know maybe 70% of the way down to the image B H might be about 0.3 because the height of this red square is about 30% of the overall height of the image and BW might be about 0.4 let's say because the width of the red box is about 0.4 of the overall width of the entire image so let's formalize this a bit more in terms of how we define the target label Y for this as a supervised learning task so just as a reminder these are our four classes and the new network now outputs those four numbers as well as a class label or maybe probabilities of the class labels so let's define the target label Y as follows it's going to be a vector where the first component PC is going to be is there an object so if the object is classes 1 2 or 3 PC will be equal to 1 and if is the background class so if it's none of the objects you trying to detect then PC will be 0 and PC you can think of that as standing for the probability that there's a object probability that one of the classes you're trying to detect is there so something other than the background class next if there is an object then you wanted to output B X B Y B H and BW the bounding box for the object you detect it and finally you if there is an object so if PC is equal to one you wanted to also output c1 c2 and c3 which tells it is it the class 1 class 2 or class 3 so is it a pedestrian a car or a motorcycle and remember in the problem we're addressing we assume that your image has only one object so and most one of these objects appears in the picture in this classification with localization problem so let's go through a couple examples if this is a training set image so that is X then Y will be the first component PC will be equal to 1 because there is an object then b XB y bh + BW will specify the bounding box so your label training set we'll need bounding boxes in the labels and then finally this is a car so it's close to so C 1 will be 0 because it's not a pedestrian C 2 B 1 because it is a car C 3 will be 0 since it's not a motorcycle so among C 1 C 2 C 3 at most one of them should be equal to 1 so that's if there is an object in the image 1 of this no object in the image whether we have a training example where X is equal to that in this case PC will be equal to 0 and the rest of the elements of this will be don't cares so I'm going to write question marks in all of them so this is a don't care because if there is no object in this image then you don't care what bounding box the neural network outputs as well as which of the three objects you want C 2 C 3 it thinks of this so given a set of labeled training examples this is how you it construct X the input image as well as Y the class label both images where there is an object and for images where there is no object and these will then define your training set finally next let's describe the loss function you use to train the neural network so the ground truth label was y and in your network outputs some Y hat what should be the law speed well if you're using squared error then the loss can be y1 hat minus y1 squared plus y2 hats minus y2 squared plus dot dot plus y8 hat minus y 8 squared notice that Y here has 8 components so that goes from sum of the squares of the difference of the elements and that's the loss if y1 is equal to 1 so that's the case where there is an object so y1 is equal to PC right so PC is equal to 1 that is if there is an object in the image then the loss can be the sum of squares over all the different elements the other case is if y1 is equal to 0 so that's if this PC is equal to 0 in that case the loss can be just y1 hat minus y1 squared because in that second case all the rest of the components are don't care us and so all you care about is how accurately is the neural network outputting PC in that case so just a recap if y1 is equal to 1 that's this case then you can use the squared error to penalise squared deviation from the predictor than the actual outputs for all eight components whereas if y1 is equal to 0 then you know the second to the eighth components that don't care so all you care about is how accurately is your neural network estimating y1 which is equal to PC and just as a side comment for those of you that want to know all the details have used the squared just to simplify the description here in practice you could use you can probably use a log likelihood loss for the C 1 C 2 C 3 e to the softmax output one of those elements usually you can use squared error or something like squared error for the bounding box coordinates and then for PC you could use something like the logistic regression loss although even if you use squared error or very work okay so that's how you get a neural network to not just classify an object but also to localize it the idea of having a neural network output a bunch of real numbers to tell you where things are in a picture turns out to be a very powerful idea in the next video I want to share of you some other places where this idea of having a neural network I'll put a set of real numbers almost as a regression task can be very powerful to use elsewhere in computer vision as well so let's go on to the next videohello and welcome back this week you learn about object detection this is one of the areas of computer vision that's just exploding and it's working so much better than just a couple years ago in order to build up to object detection you first learn about object localization let's start by defining what that means you're already familiar with the image classification task where an algorithm looks at this picture and might be responsible for saying this is a car so that was classification the problem you learn to build a neural network to address later in this video is classification with localization which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box or drawing a red rectangle around the position of the car in the image so that's called the classification with localization problem where the term localization refers to figuring out where in the picture is the car you've detected later this week you then learn about the detection problem where now there might be multiple objects in the picture and you have to detect them all and localize them all and if you're doing this for an autonomous driving application then you might need to detect not just other cause but maybe other pedestrians and motorcycles and maybe even other objects so you see that later this week so in the terminology we'll use this week the classification and the classification of localization problems usually have one object usually one big object in the middle of the image that you're trying to recognize or recognize and localize in contrast in the detection problem there can be multiple objects and in fact maybe even multiple objects of different categories within a single image so the idea is you've learned about for image classification will be useful for classification with localization and then the idea is you learn for localization will then turn out to be useful for detection so let's start by talking about with localization you're already familiar with the image classification problem in which you might input a picture into a confident with multiple layers so there's a confident and this results in and this results in a vector features that is fed to maybe a soft max unit that outputs the predicted class so if you are building a self-driving car maybe your object categories are they're following where you might have a pedestrian or a car or motorcycle or background this means none of the above some of this no pedestrian your common no motorcycle then you may have an output background so these are your classes they have a soft max with four possible outputs so this is the standard classification pipeline how about if you want to localize the car in the image as well to do that you can change your neural network to have a few more output units that output a bounding box so in particular you can have the neural network output for more numbers and I'm going to call them B X B Y B H and B W and these phone numbers parameterize the bounding box of the detected object so in these videos I'm going to use the notational convention that the upper left of the image I'm going to denote as the coordinate 0 0 and the lower right is 1 1 so specifying the bounding box the red rectangle requires specifying the midpoint so that's the point B X comma B Y as well as the height that would be B H as well as the width B W of this bounding box so now if your training set contains not just the object class label which union that took us trying to predict up here but it also contains four additional numbers giving the bounding box then you can use supervised learning to make your algorithm outputs not just a class label but also the four parameters to tell you where is the bounding box of the object you detect it so in this example the idea px might be about 0.5 because this is about half way to the right to the image B Y might be about 0.7 since that's about you know maybe 70% of the way down to the image B H might be about 0.3 because the height of this red square is about 30% of the overall height of the image and BW might be about 0.4 let's say because the width of the red box is about 0.4 of the overall width of the entire image so let's formalize this a bit more in terms of how we define the target label Y for this as a supervised learning task so just as a reminder these are our four classes and the new network now outputs those four numbers as well as a class label or maybe probabilities of the class labels so let's define the target label Y as follows it's going to be a vector where the first component PC is going to be is there an object so if the object is classes 1 2 or 3 PC will be equal to 1 and if is the background class so if it's none of the objects you trying to detect then PC will be 0 and PC you can think of that as standing for the probability that there's a object probability that one of the classes you're trying to detect is there so something other than the background class next if there is an object then you wanted to output B X B Y B H and BW the bounding box for the object you detect it and finally you if there is an object so if PC is equal to one you wanted to also output c1 c2 and c3 which tells it is it the class 1 class 2 or class 3 so is it a pedestrian a car or a motorcycle and remember in the problem we're addressing we assume that your image has only one object so and most one of these objects appears in the picture in this classification with localization problem so let's go through a couple examples if this is a training set image so that is X then Y will be the first component PC will be equal to 1 because there is an object then b XB y bh + BW will specify the bounding box so your label training set we'll need bounding boxes in the labels and then finally this is a car so it's close to so C 1 will be 0 because it's not a pedestrian C 2 B 1 because it is a car C 3 will be 0 since it's not a motorcycle so among C 1 C 2 C 3 at most one of them should be equal to 1 so that's if there is an object in the image 1 of this no object in the image whether we have a training example where X is equal to that in this case PC will be equal to 0 and the rest of the elements of this will be don't cares so I'm going to write question marks in all of them so this is a don't care because if there is no object in this image then you don't care what bounding box the neural network outputs as well as which of the three objects you want C 2 C 3 it thinks of this so given a set of labeled training examples this is how you it construct X the input image as well as Y the class label both images where there is an object and for images where there is no object and these will then define your training set finally next let's describe the loss function you use to train the neural network so the ground truth label was y and in your network outputs some Y hat what should be the law speed well if you're using squared error then the loss can be y1 hat minus y1 squared plus y2 hats minus y2 squared plus dot dot plus y8 hat minus y 8 squared notice that Y here has 8 components so that goes from sum of the squares of the difference of the elements and that's the loss if y1 is equal to 1 so that's the case where there is an object so y1 is equal to PC right so PC is equal to 1 that is if there is an object in the image then the loss can be the sum of squares over all the different elements the other case is if y1 is equal to 0 so that's if this PC is equal to 0 in that case the loss can be just y1 hat minus y1 squared because in that second case all the rest of the components are don't care us and so all you care about is how accurately is the neural network outputting PC in that case so just a recap if y1 is equal to 1 that's this case then you can use the squared error to penalise squared deviation from the predictor than the actual outputs for all eight components whereas if y1 is equal to 0 then you know the second to the eighth components that don't care so all you care about is how accurately is your neural network estimating y1 which is equal to PC and just as a side comment for those of you that want to know all the details have used the squared just to simplify the description here in practice you could use you can probably use a log likelihood loss for the C 1 C 2 C 3 e to the softmax output one of those elements usually you can use squared error or something like squared error for the bounding box coordinates and then for PC you could use something like the logistic regression loss although even if you use squared error or very work okay so that's how you get a neural network to not just classify an object but also to localize it the idea of having a neural network output a bunch of real numbers to tell you where things are in a picture turns out to be a very powerful idea in the next video I want to share of you some other places where this idea of having a neural network I'll put a set of real numbers almost as a regression task can be very powerful to use elsewhere in computer vision as well so let's go on to the next video\n"