Stereo 3D Vision (How to avoid being dinner for Wolves) - Computerphile

**Understanding Pinhole Cameras and Depth Problem**

For simplicity sake, we'll model our camera as a pinhole camera. The optical center of our camera is somewhere behind this camera plane. Any object in the world projects its light down here, intersecting our image plane and then goes into the camera origin. We have an optical center of our camera, and any light rays coming from an object here are going to travel down this ray, intersect our image plane, and then go into the optical center of the camera. This will happen for any points in our scene that this camera can see.

We want to say we've got a point on this image plane where did it come from? The crucial problem is that it could have come from here or it could have come from here or here or here or anywhere along this ray and we don't know, and that's what we're trying to find out. That's the depth problem.

We also have an optical center for this camera, which is somewhere behind the image plane. Rays will be coming out of and intersecting through these points. If we knew that this point in this image was a certain distance away, we just project the rays, find where they intersect, and use simple triangulation and maths to work out how far way that position is. However, we don't know what point that is because it's going to change - it might not be visible in this image. This is one problem, as finding the exact same point in a different image when it might have rotated and changed slightly is a lot of work in two dimensions.

You've got to do that for every single pixel in this image, trying to find maybe one that tries to match in here. That's a lot of work, so we don't tend to do that. We use something called epipolar geometry to try and make this a little bit easier. Epipolar lines are like big triangles coming out from the optical center of our cameras, with the points where they intersect the image plane.

This is part of what makes pinhole cameras so simple - we're not trying to find a single point in this image but rather a line that passes through all these points. This line represents all the possible projections of a ray into this image, and by knowing where our cameras are, we've simplified our problem. We know where the cameras are, so we can say we're trying to find this position X1 in this image by knowing that it's going to be somewhere along this epipolar line.

We know it's going to be in this line, so we've got a limited set of pixels to look through. All we need to do is go through each of these pixels and say which one of them looks most like this point, and then find it. Once we've found the pixel that matches, we can use triangulation to work out how far away the object is by projecting rays from different parts of the image plane towards our optical centers.

**Epipolar Geometry**

One edge of our triangle is between the optical centers of the cameras, one is through this point and out into the world, and the other is some value we don't know. This makes it a lot easier to find where these things are, as we have a simple way to determine which pixel in the image corresponds to a certain epipolar line.

If we're writing a stereo reconstruction algorithm, for every point in this image, we will try and find the point along its particular epipolar line that best matches it. This process is called finding the correspondence problem, and it's really the core of what we're solving here - finding the occluded pixels and determining their depth.

Finding a point in this image based on another one from this image is not as straightforward as simply looking for a pixel that matches. There are many factors to consider, such as the rotation and translation of the objects, which can make it difficult to find reliable correspondences. However, by using epipolar geometry, we can simplify the problem and make it more tractable.

**Stereo Reconstruction**

Stereo reconstruction is the process of using two or more cameras to create a 3D image of a scene. This involves finding the correspondence between pixels in different images and then using that information to determine the depth of the objects in the scene.

When we're trying to find the depth of an object, we need to be able to determine which pixel in one image corresponds to a certain point in another image. We can use epipolar geometry to help us with this, as it allows us to simplify the problem by finding lines that pass through all the points on our image plane.

By looking at the pixels along these epipolar lines, we can find the ones that correspond to each other between different images. Once we've found these correspondences, we can use triangulation to work out how far away the objects are in the scene. This is a simplified version of what happens in stereo reconstruction - finding the exact correspondence and depth of every pixel can be much more complex.

**Challenges and Limitations**

Stereo reconstruction has many challenges and limitations, particularly when it comes to finding reliable correspondences between different images. The rotation and translation of objects can make it difficult to find matches, and small movements between frames can result in large discrepancies in the correspondence.

Additionally, stereo reconstruction requires a lot of processing power and computational resources to handle the large amounts of data involved. However, with the advent of more powerful computers and advanced algorithms, we're able to tackle these challenges and create more accurate 3D images from stereo pairs.

Despite the challenges, stereo reconstruction remains an important area of research in computer vision and robotics, with applications in fields such as augmented reality, virtual reality, and robotics.

"WEBVTTKind: captionsLanguage: enyou can get results from this where you can't get results from lasers lasers get bleached out in sunlight I had a colleague that I was speaking to who went to Mexico to do crop scanning and he had a handheld laser scanner and he had to do it at night in a tent because the sun wrecked the laser scanner and there were walls about and it was a big problem for him if he' have just used a camera you might have found that you've got to work harder on your stereo matching but there are things it will do that laser scanners can't so there's going to be a time for one and a time for the other the top tip for the day is use a stereo pair of cameras don't get eaten by Wolves yeah that would that would be my advice we find corresponding points in our left and right eyes and then we can use that to work out how far away from us something is when we have an individual eye on its own we have some monocular cues some monocular Clues as it were that we can use to find out depth or at least to estimate depth but true 3D only comes from two eyes in a single eye you might have something like the object is bigger than it was before so it's coming towards W us or one object is passing our view faster than another and that Parallax and that gives us a clue that it's in front of something else e clusion is an obvious one if something actually is in front of something else we can make some reasoning about that so our brains will take those monocular Clues cues and do something with them and work out what's going on but when we have two eyes then we can do actual 3D depth perception um the classic example is those Magic Eye things that were around in the '90s I'm not very good at seeing those I I kind of cross-eyed and it kind of works but it's all a bit backwards but the idea there is that we trick our eyes into seeing slightly different images and that gives us a perception of depth if we've got a stereo um system what we the main thing we need to know is where are our cameras our brains know where our eyes are because they've learned it but one's here and one's here you know people maybe have slightly further apart eyes um but your brain will account for this if we're going to do this mathematically using a computer we need to know where these cameras are if you know that we've seen an object in in one View and then we go into the other view we need to try and find corresponding points without knowing where the cameras were your search spaces increased you've got to look over the whole image maybe you get points confused maybe there's a corner that appears multiple times because it's like a book and it has four corners and then you've got to try and resolve which one's which um and some of these features won't appear in both views because of occlusion so if you take your left and right view of my hands you know some of my left hand is going to be visible from one eye it isn't in the other eye and that's a huge problem so what we do is we start with a process called camera calibration we have two cameras that are nearly next to each other and we don't know exactly what their angles are but we can find that out by using camera calibration we have to take the picture from both cameras the exact same time because otherwise the scene's going to have changed so we'll assume we're taking pictures with the cameras at the same time something that isn't true of of some visual reconstruction systems we take a picture of this board and we calibrate the positions of our cameras and then we move the cameras and take a picture of something we're trying to reconstruct in 3D so then we have a situation where uh we have one image here on this side our left left View and we have an image here which is our right view in our previous video on ratx we talked about the lens and all this system in front we'll do away with that for now just for Simplicity sake and we'll say that these are pinhole cameras because we're using a pinhole camera model the optical center of our camera is somewhere behind this camera plane so some object in the world projects its light down here intersects our image plane and then goes into the camera origin like this we have an optical center of our camera and any light rays coming from this object here are going to travel down this Ray intersect our image plane and then go into the optical center of the camera and this will happen for any points in our scene that this camera can see we want to say we've got a point on this image plane where did it come from and The crucial problem is that it could have come from here or it could have come from here or here or here or anywhere along this Ray and we don't know and that's what we're trying to find out that's the depth problem now we also have an optical centor for this camera which is here and Rays will be coming out and and intersecting through these points so if we knew that this point in this image was this point in this image then we just project the Rays we find where they intersect and that's we use Simple triangulation simple maths to work out how far way that position is we don't know what point that is because it's going to change it might not be visible in this image is one problem the search space is quite large reliably finding the exact same point as this in a different image when it might have rotated and changed slightly is a lot of work in two dimensions and you've got to do that for every single Pixel in this image you've got to find maybe one that tries to match in here that's a lot of work to do so we don't tend to do that we use something a nice observation called epipolar geometry to try and make this a little bit easier if this is our intersection Point X1 and this is some object X all the way out there and we're trying to find out how far away it is we need to try and make our search in this image a little bit easier so what we do is we we imagine that this is part of a big triangle coming out so this is one corner of our triangle this is another or it could be this this x is somewhere along here and comes in through this point so let me get a different pen make things easier we can draw a aray that goes from this optical center to here and from this optical center to here and from this optical center to here to any of these points and they intersect this image like this and what this is is our epipolar line so this line here through these points is all the possible projections of this Ray into this image so now we've simplified our problem because we know where these cameras are we can say we we trying to find this position X1 in this image by knowing that it's going to be somewhere along here we know it's going to be in this line here so we've got a limited set of pixels we have to now look through so all we need to do is go through each of these pixels in a list and say which one of them looks most like this and then we find it and then we find our triangulation point and we find out how far it is away is this because you already know where the cameras are yes it it's only possible because we know where the cameras are if we don't then we have to just search through the whole of the other image and it takes ages one edge of our triangle is between the optical centers of the cameras one is through this point and out into the world and the other is some value we don't know which is going to be along this line because it's just a flat triangle cutting through this image which makes it a lot easier to find out where these things are what we will do if we're writing a stereo reconstruction algorithm is for every point in this image and maybe we'll do it backwards as well for completeness for every point in this image we will try and find the point along its particular epipolar line that best matches it and then of course you can go much more complicated than that you can try and find the global image map between here and here which is a combination of not only the best feature matches but also um you know it needs to be nice and smooth objects don't tend to go back and forth a lot so you want them to be rounded so you have to bear that in mind finding a point in this image based on another one from this image is called the correspondence problem and that's really the core of what of what we're solving here finding the occluded pixels is hard and there are approaches based on this where they not only try and find what we call the disparity map the the difference between this X and this Xyou can get results from this where you can't get results from lasers lasers get bleached out in sunlight I had a colleague that I was speaking to who went to Mexico to do crop scanning and he had a handheld laser scanner and he had to do it at night in a tent because the sun wrecked the laser scanner and there were walls about and it was a big problem for him if he' have just used a camera you might have found that you've got to work harder on your stereo matching but there are things it will do that laser scanners can't so there's going to be a time for one and a time for the other the top tip for the day is use a stereo pair of cameras don't get eaten by Wolves yeah that would that would be my advice we find corresponding points in our left and right eyes and then we can use that to work out how far away from us something is when we have an individual eye on its own we have some monocular cues some monocular Clues as it were that we can use to find out depth or at least to estimate depth but true 3D only comes from two eyes in a single eye you might have something like the object is bigger than it was before so it's coming towards W us or one object is passing our view faster than another and that Parallax and that gives us a clue that it's in front of something else e clusion is an obvious one if something actually is in front of something else we can make some reasoning about that so our brains will take those monocular Clues cues and do something with them and work out what's going on but when we have two eyes then we can do actual 3D depth perception um the classic example is those Magic Eye things that were around in the '90s I'm not very good at seeing those I I kind of cross-eyed and it kind of works but it's all a bit backwards but the idea there is that we trick our eyes into seeing slightly different images and that gives us a perception of depth if we've got a stereo um system what we the main thing we need to know is where are our cameras our brains know where our eyes are because they've learned it but one's here and one's here you know people maybe have slightly further apart eyes um but your brain will account for this if we're going to do this mathematically using a computer we need to know where these cameras are if you know that we've seen an object in in one View and then we go into the other view we need to try and find corresponding points without knowing where the cameras were your search spaces increased you've got to look over the whole image maybe you get points confused maybe there's a corner that appears multiple times because it's like a book and it has four corners and then you've got to try and resolve which one's which um and some of these features won't appear in both views because of occlusion so if you take your left and right view of my hands you know some of my left hand is going to be visible from one eye it isn't in the other eye and that's a huge problem so what we do is we start with a process called camera calibration we have two cameras that are nearly next to each other and we don't know exactly what their angles are but we can find that out by using camera calibration we have to take the picture from both cameras the exact same time because otherwise the scene's going to have changed so we'll assume we're taking pictures with the cameras at the same time something that isn't true of of some visual reconstruction systems we take a picture of this board and we calibrate the positions of our cameras and then we move the cameras and take a picture of something we're trying to reconstruct in 3D so then we have a situation where uh we have one image here on this side our left left View and we have an image here which is our right view in our previous video on ratx we talked about the lens and all this system in front we'll do away with that for now just for Simplicity sake and we'll say that these are pinhole cameras because we're using a pinhole camera model the optical center of our camera is somewhere behind this camera plane so some object in the world projects its light down here intersects our image plane and then goes into the camera origin like this we have an optical center of our camera and any light rays coming from this object here are going to travel down this Ray intersect our image plane and then go into the optical center of the camera and this will happen for any points in our scene that this camera can see we want to say we've got a point on this image plane where did it come from and The crucial problem is that it could have come from here or it could have come from here or here or here or anywhere along this Ray and we don't know and that's what we're trying to find out that's the depth problem now we also have an optical centor for this camera which is here and Rays will be coming out and and intersecting through these points so if we knew that this point in this image was this point in this image then we just project the Rays we find where they intersect and that's we use Simple triangulation simple maths to work out how far way that position is we don't know what point that is because it's going to change it might not be visible in this image is one problem the search space is quite large reliably finding the exact same point as this in a different image when it might have rotated and changed slightly is a lot of work in two dimensions and you've got to do that for every single Pixel in this image you've got to find maybe one that tries to match in here that's a lot of work to do so we don't tend to do that we use something a nice observation called epipolar geometry to try and make this a little bit easier if this is our intersection Point X1 and this is some object X all the way out there and we're trying to find out how far away it is we need to try and make our search in this image a little bit easier so what we do is we we imagine that this is part of a big triangle coming out so this is one corner of our triangle this is another or it could be this this x is somewhere along here and comes in through this point so let me get a different pen make things easier we can draw a aray that goes from this optical center to here and from this optical center to here and from this optical center to here to any of these points and they intersect this image like this and what this is is our epipolar line so this line here through these points is all the possible projections of this Ray into this image so now we've simplified our problem because we know where these cameras are we can say we we trying to find this position X1 in this image by knowing that it's going to be somewhere along here we know it's going to be in this line here so we've got a limited set of pixels we have to now look through so all we need to do is go through each of these pixels in a list and say which one of them looks most like this and then we find it and then we find our triangulation point and we find out how far it is away is this because you already know where the cameras are yes it it's only possible because we know where the cameras are if we don't then we have to just search through the whole of the other image and it takes ages one edge of our triangle is between the optical centers of the cameras one is through this point and out into the world and the other is some value we don't know which is going to be along this line because it's just a flat triangle cutting through this image which makes it a lot easier to find out where these things are what we will do if we're writing a stereo reconstruction algorithm is for every point in this image and maybe we'll do it backwards as well for completeness for every point in this image we will try and find the point along its particular epipolar line that best matches it and then of course you can go much more complicated than that you can try and find the global image map between here and here which is a combination of not only the best feature matches but also um you know it needs to be nice and smooth objects don't tend to go back and forth a lot so you want them to be rounded so you have to bear that in mind finding a point in this image based on another one from this image is called the correspondence problem and that's really the core of what of what we're solving here finding the occluded pixels is hard and there are approaches based on this where they not only try and find what we call the disparity map the the difference between this X and this X\n"