The Video Conferencing Problem - Computerphile

The Art and Science of Video Conferencing: A Technical Deep Dive

As we discussed earlier, video conferencing has become an essential tool for communication in today's digital age. But have you ever wondered how it actually works? In this article, we'll take a technical deep dive into the world of video conferencing, exploring the challenges and solutions that make it possible to stream high-quality video over the internet.

**Audio Compression**

When it comes to audio compression, the goal is to reduce the size of the data being sent over the network while maintaining acceptable call quality. This is achieved through a process called packetization, where the audio signal is broken down into small packets that can be transmitted independently. Each packet is assigned a timestamp and sequence number, allowing the receiving end to reassemble the packets in the correct order.

To cope with variations in network latency, we need to have buffering in place to ensure that the audio doesn't skip or stutter. This means that our compression algorithm must be able to adapt to changing conditions, such as packet loss or latency spikes. By using techniques like variable bitrate allocation and packet loss concealment, we can minimize the impact of these issues on call quality.

One technique used to reduce latency is echo cancellation. This involves removing the echo that occurs when a speaker's voice bounces back through their own microphone, creating an unpleasant audio feedback loop. By applying a digital filter to the audio signal, we can eliminate this effect and improve overall call quality.

**Video Compression**

When it comes to video compression, the process is similar to audio compression. However, since video involves capturing images rather than sound waves, we need to use different techniques to achieve acceptable quality. The goal of video compression is to reduce the amount of data being sent over the network while maintaining a smooth and stable image.

To achieve this, we use various compression algorithms, such as H.264 or H.265, which break down each frame into smaller parts that can be compressed independently. This allows us to take advantage of spatial and temporal compression techniques, which reduce the amount of data being sent by exploiting patterns in the video signal.

One technique used in video compression is inter-frame prediction, where we predict the next frame based on the previous one. By comparing this predicted frame with the actual frame that was captured, we can identify areas where the two differ and make necessary corrections to achieve a better match. This process reduces the amount of data being sent by eliminating redundant information.

Another technique used in video compression is intra-frame prediction, which uses a different approach to compress each frame independently. By analyzing the spatial relationships within each frame, we can identify patterns that can be exploited to reduce the amount of data being sent.

**Synchronizing Audio and Video**

Once we've compressed both audio and video signals, we need to synchronize them in order to create a seamless viewing experience. This is achieved through the use of timestamps and sequence numbers, which allow us to reassemble the packets at the receiving end in the correct order.

When it comes to synchronizing audio and video, we need to consider the latency introduced by the compression algorithms as well as any network delays that may occur. By using techniques like packet loss concealment and timestamp-based synchronization, we can minimize these effects and ensure a smooth viewing experience.

**Network Address Translation and Firewalls**

One of the biggest challenges in setting up a video conferencing system is ensuring that our computer behind our network's firewall can communicate with Sean's computer behind his own router. To overcome this issue, we need to use techniques like Network Address Translation (NAT) and firewall-friendly protocols.

NAT allows us to map multiple IP addresses to a single public address, making it easier for devices behind the same NAT to communicate with each other. By using NAT-enabled protocols like UDP, we can ensure that our video conferencing system works even in environments where firewalls are enabled.

Another technique used to overcome this issue is to use a temporary exposure key (TEK) and perform encryption between the two ends. This allows us to establish a secure connection without having to rely on complex authentication procedures.

**The Future of Video Conferencing**

As we continue to explore the possibilities of video conferencing, it's worth noting that one of the most significant challenges facing this technology is the need for IPv6 support. With the increasing number of devices connecting to the internet, traditional IP-based systems are becoming increasingly congested.

By transitioning to an IPv6 network, we can take advantage of larger address spaces and improved security features. However, implementing this change will require careful planning and coordination to ensure that our video conferencing system remains functional and reliable.

In conclusion, video conferencing is a complex technology that requires careful consideration of various technical challenges. By understanding the intricacies of audio compression, video compression, synchronization, and network address translation, we can create systems that deliver high-quality experiences even in environments with limited bandwidth or connectivity issues.

"WEBVTTKind: captionsLanguage: ensteve welcome back uh we are obviously still in a remote situation um but you wanted to tell us a bit more about how that um well how this kind of connection's working yeah i mean i thought it'd be interesting and obviously we've talked about remote working but i mean obviously a lot of things we're doing is we're using video conferencing or voice over ip type software to have conversations with people not just in a business context but also with friends and family i mean i've lost track of the number of zoom calls or facetime calls or facetime calls for support dave that i've had to make in the past eight weeks or however long we've been in lockdown i lose track so i thought it'd be interesting to sort of talk about the technology that's involved there because as steve jobs put it when you announced the iphone 4 and talked about facetime the technology required to do it is a bit of an alphabet soup of technologies that you need to go through so what i thought we could do is just spend some time going over how video conferencing works the first thing to say this is going to be an overview of what's involved we'll come back probably and do some more in-depth videos little bits but this one's going to give the big picture and then we can zoom in later on to talk about that um we're also going to keep things simple so let's simplify this problem down to the sort of conversation we're having now which is one person talking to another person we'll come back and talk about how we have multiple people but i'm sure you can guess how we can do it you just have multiple connections between the different people to do it and we'll only consider it going in one direction from me to sean or from sean to me because the other direction is basically the same in reverse so we'll simplify it right down so we can have a discussion about what's involved in getting the voice and pictures from my end to get to shore what we would call perhaps mouth to ear how do we get the voice from my mouth to sean's ear across all the networks involved what's actually involved in setting those things up you're breaking up there steve say that again i'm joking i'm joking okay so you mentioned there the uh the sound and me hearing what you're saying but obviously we've got a video element to this as well you know important for youtube as much as anything but are we going to talk about how the video connection works as well yeah yeah we'll talk about video as well but i think the important thing actually when you're having these conversations is that bizarrely even on a video conference the audio is more important to following the conversation and having a sort of natural conversation between the two people if the audio goes wrong then you can't hear what's being saying or the conversation becomes a jumbled mess so actually even when videos involved the video signal can break up or we can drop the quality on it a lot easier than the audio signals the audio signal starts to break up then the conversation the flow of it will brown down uh people start talking over each other each other follow what's being said so we'll start off looking at the audio and then we'll talk after we've done that about how we put video alongside that because actually once you understand how the audio works the video signal is going in a very similar way with a very similar set of problems and you just have to send video rather than sending audio data across the network when you rang me this morning as well as my computer starting to bleep to tell me that you were trying to connect also my phone's going how does that segment of it work how does how do we know that we want to make a call so there's really two things that are involved with video conferencing this is the call set up which is sort of making sure that sean knows i want to speak to him and that he accepts the call and then we also need to establish how we actually send data between the two of us um and he's also actually sending the data over the connection um so taking the audio and the video that we want to send and actually send them over the internet and i think it's easier to talk about how we send the data first and then we'll look at how we do call setup and what's involved in that in another video because that actually involves making sure we can get the data to each other so we need to know how the data is sent then we know the problems that are involved with sending the data talking about how we transfer the data there's loads of things that are involved this it may seem like a simple thing to just take a piece of audio and send it over network but actually when you think about what's involved in making this into a conversation and not just a live stream because in the live stream i'm just sending data to sean or whoever's watching and they're enjoying whatever it is insert your favorite type of youtube live stream here but with a voiceover ip or a video conference we're having a conversation and so we need to make sure that the connection we've got enables the conversation to happen as if the technology wasn't there we don't want the technology to get in the way of what we're doing and so we need to think about things like how much bandwidth is this going to take i mean we've all got network connections but they have a finite amount of data that we can send i mean in an ideal world we wouldn't worry about it we could just send all the data coming from the microphone coming from the camera straight down the network and it would appear at sean's and everything will just get there and it would work so basically what we're saying is that we need to shrink that down before it goes through our connections so that we're not sending too much data and clogging it up and causing latency right yes i mean we've got oodles of data coming the video i mean the audio for recording um sort of cd quality audio is about 768 kilobits per second um so we've got a lot of data going down and and okay a modern network connection my home networks 300 megabits down 35 megabits up the gigabit symmetrical isn't unheard of these days in a home setting but that's still a lot of data and so obviously we need to sort of compress whatever we've got we need to make sure whatever we send fits within the bandwidth we've got available because it's not just our bandwidth it's your bandwidth as well i've probably got a better internet connection than you no no no you have i know that you have we've we've done tests uh in transferring material back and forth that proves that um but yeah the point is i suppose there's even more considerations that you might live in a household with four or five people someone who's streaming a movie at the same time someone else who's trying to make their own call so yeah there are all sorts of considerations aren't there yes there's lots of things that affect the bandwidth is who else in the house is using it there's what you've got what i've got there's what's involved in the network between it if the network gets congested then the bandwidth will drop even though each end is technically incapable of sending that so we need to take that into account so the bandwidth is important because we need to make sure that we can actually send the data that we're not trying to first too much down the pipe as it were to the other end what also is important is the latency or the time it takes from me saying something coming out of my mouth to sean hearing it through his ear or what's known as the mouth-to-ear time so the mouth airtime or the latency of the connection is one of the key issues that we need to minimize the telecoms industry found many years ago that actually as long as it's below 100 milliseconds the conversation will flow naturally people be able to have a conversation as if it's not there so we want to try and minimize that as much as possible if it goes above 100 milliseconds then the conversation starts to break down because i think you've finished speaking i start speaking but because of the delay you think i'm not going to say anything and start saying something and we start talking over each other then we stop and then we both start speaking again at the same time and as it goes on and very quickly once the latency goes above 100 milliseconds the conversation will sort of break down into a sean do you copy over yes receiving over good out we want to make sure whatever we do we minimize the latency so the network itself has some inherent latency in it you can measure that by running the ping command on your computer and seeing how long it takes to get from your computer to another and back and then halve it and that shows you the latency on that network connection but this is also related to bandwidth but not directly if your bandwidth increases your latency will go down but it can also go up and the other thing to bear in mind is that you're never going to get away with zero latency and actually we expect latency in a conversation so as we sit two meters apart as we all socially distance there's a natural delay as the sound waves propagate from my mouth to someone's ear so there's always latency in the conversation what we want to ensure is that whatever we do doesn't introduce you any more latency and some of the things we might do to say preserve bandwidth are going to increase latency and so we want to balance what we do there with what we do to maintain latency so it might actually be better to to use more bandwidth and have less latency or to go for a slightly worse compression and have a sort of algorithm that compresses not as well as it could to reduce load so you want to sort of try and keep the latency down as low as possible the other thing you want to take into account is av sync so we want to make sure that the words coming out of my mouth are in sync with the pictures as they aren't at the moment because when that happens it becomes really hard to follow what's going on it looks odd so we want to also try and maintain sync between what i'm saying and the video and of course because they're going to go through different routes because we're going to compress the video in a different way to where we can press the audio we've got to be able to pair them up back at the other end so that they appear in sync and again that will affect latency because if it takes longer to compress the video than it does to send the audio then we have to delay one at the other end so that we can match them so latency is a really important thing in a video conferencing we want to try and minimize that as possible and lots of the things that happen are involved in minimizing the latency of the connection other things you need to take care of is that if you think about the way this call is working i'm speaking to sean it's coming out of his loudspeaker or through his earpiece in my case um and in sean's case um but if that's coming for his speaker that will get picked up again by the microphone as i speak for sure is either suppressing the echo by sort of muting one end of the conversation when that person's speaking or you can create circuits which cancel the echo by sort of taking the signal that's coming in and feeding it back and sort of in antifa to sort of cancel out what's going on there there's lots of math involved in that but that's a story for someone else in another video to cover so we can we have to take all these things into account of course as the legacy increases the echo will get more noticeable in certain things in fact there's one networking standard where its packet size was created purely on the amount of time that you get from one end of france to the other without needing echo cancelling so they went for 53 bytes or something but that's a another story atm's a complete other networking topic that we'll talk about at some point so we've got to maintain bandwidth we need to make sure we fit within what's available on the network or less if other people are using it and again that can change you might start a conversation and then someone starts watching netflix on the same network connection or the network may get congested elsewhere and the path between you and the other person there's lots of people on the same isp still watching the latest cat video on youtube or whatever it is people do on youtube the latest computer file video of course it's far more interesting than cats we have mice so these things will change we need to match that in our system but we also need to keep the latency down and we need to maintain a v sync so how do we go about building a system let's uh let's start drawing a diagram of the bits that are involved we're going to talk about how we can get the voice from me here to shawn over there so at sean's end we have something that looks a bit like an ear and over at my things we have something that looks a bit like two lips this has changed color and we're now going to talk about how we can send a signal from there to there if we just start building this up as we go the easiest way is that we'd have some sort of microphone at this end and we'd have a loud speaker at sean's end and then we could just connect a wire between the two and as i speak the microphone converts my voice into an audio signal an electrical signal that represents the same waveform as my voice sends it down the cable and that at the other end the loudspeaker converts that back into sound pressure waves that sean then hears um this is your standard plain old telephone system it's exactly how that worked except the cable was sort of controlled by switches but effectively once it was set up you had a connection between the two but we're talking about sending this over computers so we're going to end up having to send this digitally so what we're going to end up doing is taking the signal out of the microphone and then we're going to run that through an a to d converter and what we get out the other end is a series of bytes that will represent that audio signal it's a mono signal so we'll get a single stream of bytes that represent that and we can adjust the quality here by choosing the sample rate and how many bits we use to represent the audio so we could go for sort of cd quality sound or better which would give us a lot of data to send or we could sort of do what the phone companies originally did was only sample it at eight kilohertz which is two times four killers which is the highest frequency you normally get in a voice signal and so you capture all the data required for voice and you don't see anything else and you use eight bits which gives you 64 000 bits per second so that's going to give us a set of data bytes and they're going to come every eight thousandths of a second let's just go with that to keep the numbers simple so every 8 000 of a second we're going to get another byte of data representing my voice and so that's going to come out as a chain and at the other end again we have the opposite so we'll have a d to a converter digital to analog which regenerates the analog signal which can drive the outline speaker and we can then send this across and it would all work we've digitized it but we still need that cable in there so what do we do well we need to send this over our network our network the internet is packet based um there's been a long debate between um telecoms people do you go for sort of circuit switch networks or do you go for a datagram packet based networks where you just put things in the packet direct in packet and send it like you're sending a letter to the right person packets seem to have won out sort of circuits which ones like atm and so on seem to pretty much died out or used to send packets over them anyway so we're going to build them up into packets but now we've got an interesting problem we've got data coming every eight thousandths of a second from the hd converter and we need to put them into a packet so if we just go for say a thousand um samples that would be 1 8 of a second at 8 000 kilohertz sampling rates that's over 100 milliseconds there so we probably want to choose a smaller packet size so as soon as we start building these things up into a packet we've got to wait for the right number of bytes to come in before we can send the packet out that's going to introduce some latency so we need to make sure we choose a packet size that is small enough to not introduce too much latency so let's say we just had one bite in each packet well we could send that over that would mean we didn't have any latency but we then have a situation where you'd be wasting bandwidth because that one bite would have several bites of packet headers on there for the sort of ip header and so on that you need to send send it across so that'd be very wasteful so you want some there so you need to choose the right number of bytes to make this sort of fit there so if you make it too many then you'll use your bandwidth better but you'll increase your latency so you have to sort of rob peter to paypal uh that to find the right thing and of course as the network changes those conditions will change so you may have to change that as the thing's going on there so we're suddenly going to get a delay here by building things into a packet and then you get that packet at the other end and you take the bytes out of that one by one and feed them to your d2a converter there's still an issue here as we send that over the network you're taking the data in the packet and you're reading out the bytes one by one sending them to do your da converter one eighth every eight thousandths of a second so we can send that but when you come to the end of the packet you need to make sure that you've got the next packet with the next set of bytes there at the next 1 8 000 of a second later if it's if it comes later than that you haven't got the data so you can't send it out to the da converter what do you do well you end up having to play silence or whatever so we need it there one eight thousand a second but if you backtrack to my end of the connection to send that i can only send that when the network conditions are clear so if i'm sending another packet let's say i've got a 1500 byte packet from a network transfer where i'm sending this video to you sean at the same time that's just started that's going to take some time to send i can't send my packet of data with the audio connection until that's finished sending which will delay it a bit more and then which will delay it at the other end so actually although we've got the packets there we're going to end up with some sort of buffering after that just because the network will be being used for things and actually every step along the network is going to introduce some buffering as well that means that at your end there's a very good chance that you haven't got the data you need to free to your da converter when you finish going through that packet so we need to put something in place to make that work and the way we can do that typically say in a live stream thing is that we actually we put a buffer here and we just start buffering or cueing up those packets so we know we've got two or three in memory all the time so if the one arrives late it goes on the end of the queue and then by the time we come to it we've got to that point but that introduces more delay so we're going to make sure that the buffer there is as small as possible perhaps two or three packets and that packet size is small enough so that we don't notice that additional latency we introduce there but there's another issue here if you think back to the video we did on vpn we talked about how there's two ways we can send data over the network we can either say use a tcp connection which tries to emulate a a stream between the two connections where we send every byte down there and it arrives at the other end guaranteed one after the other in the order they're sent now i think that's great that's exactly what we want for an audio conversation we want all the data to get there so that you can hear everything i said but if you remember the way we said that does that is that tcp if it doesn't receive something will wait and then after it to be sent again and then it'll send it again and if that happens that's going to delay that packet arriving and because it's guaranteed everything is happening in order after that every other byte will be delayed by that point and so the latency will increase and as packets will get lost on the internet because it's the best effort network is guaranteed to lose packets then we'll be getting packets lost and lost and lost and so the latency over the conversation will creep up and up and up imperceptibly but eventually by the end it could have got to a point where you can notice it so tcp actually perhaps isn't as good an idea for sending this as you think the alternative way we can send those is say just to send the packets out there and let the computer do its best to get there and if they get lost they get lost and actually it turns out that certainly for audio and also for video this isn't as much of a problem as as you might expect because if you think about it if your packet size is small enough let's say we're sending um 80 bytes that's 100 of a second if it's a 100th of a second drop out where the packet gets lost on the audio you're probably not going to hear it you'll be able to pick up the sort of conversation as things are going and the next packet will come along and so all you'll get is a sort of small bit of silence as you're hearing this bit now and actually that's that's quite natural to us because as we're having conversations there are other sounds that we hear which might cover a bit of the conversation and so we're used to those sort of things so actually it's better to have packets go missing and then just carry on and deal with what we've got and deal with that then to actually try and guarantee everything's going there so when we send these things out we want to send them using udp rather than tcp because that enables packets to get lost and as we said we're not going to notice that in a conversation whereas we would start to notice the build up of latency as the tcp connection started delaying things as everything got resent and resent and reset and resent and so on so we we use udp now rather than just stuffing the data into the udp packet as is we actually prepend to the data another header there this time for what's called the real-time transport protocol and this just gives us some details about what type of data is in this stream of packets is it audio is it video what codec is it using to compress things and so on but it also has a source identifier in there so that we know that this is coming from the same stream so just a unique randomly generated id for a stream that says these packets are all coming from the same source stream the same audio stream or the same video stream so you can sort of multiplex and we also have a sequence number in there so we know where the packets have got lost or what order they need to go because the other thing we need to bear in mind is that the network might reorder things um because they go different routes in which case we want to reorder them ourselves back into the correct order otherwise we'll get the audio jumbled up and if we're buffering things we get the opportunity to do this or we can just drop the packets if they come out of order and so on but also we have a time stamp on there which tells us when this should be played a sort of time that says when it either when it came from or when it should be played and this is what we use later to synchronize the video to the audio we stamped the audio with the time it came from we stamped the video with the time it came from and that tells us when the first bite in that packet should be played out that works fine when the data rates are low if so like with 64 000 bits per second that's reasonable to send over a network connection even in the 1980s on something like isdn but just but if you want to do better quality audio because that isn't that good quality audio so when you have cd quality audio we're going to have significantly more data we need to compress that down to a smaller size in some way so that we can send it over the network connection and the way we do that it's dead simple um we can just sit in between the a2g converter and the packetizing bits of the section and take that data and compress it and at the other end after you've read it out the packet rather than sending it straight to the da which would sound horrible you can sort of decompress it at that point but again compressing it you're going to have to gather the data together look at it analyze it and then decide how you're going to throw things away because that's what compression is it's working out well i can throw this information away um or and still recover the original signal or something that sounds like real looks like the original signal and again when you decompress it you're going to have the same sort of process the other end so you're going to end up introducing more delay and increase the latency at each end so you've got to balance all these things to try and keep the latency as far below 100 milliseconds as you can to try and keep the conversation as natural as possible as we're sending this across the other thing we have to bear in mind is that this isn't going to be stable every packet we send is going to take a slightly different amount of time to send out what's called jitter on the lenses we're not going to have a constant latency we'll get a constant latency we could probably calculate quite easily but actually the latency will change as we're sending packets across there as we said someone might start using it to watch netflix or the iplayer or something it might take a different route over the network so one packet may take um two milliseconds to get there the next one might take 2.1 the next one might take 1.9 depending on various fractional networks so you need to have the buffering in place to cope with these sort of things and things and also to monitor it and adapt perhaps your compression rate if things start to get more latent um you can press it less or when you've got more bandwidth you will compress it more so you've got smaller packets so you've got more chance than going through so on whatever's happening to try and maintain the call quality but you want to keep the latency as low as possible as you're doing this so that's how we do audio and we can do the same in the opposite direction so that sean can talk to me and as i said we perhaps a bit of echo cancellation in there which again is going to delay things a bit as we process that it's really weird if you do start to hear the legs on your own voice when having a conversation if you start to play your own voice with even a sort of 10 millisecond delay it will really change the way you speak you end up slowing yourself down as you hear the delay and it's really strange so that's how we can send the audio but what about video well it works in pretty much the same way we have a camera that captures our picture that's going to capture 60 frames per second 50 frames per second 25 frames even as low as 12 frames per second as we're sending the data across and again we're then going to need to compress that down because the video stream is likely to be unless we've got a really fabulous um bespoke video conferencing setup where we put guaranteed um bandwidth between the two things hdsu or eye lines or whatever it is we've sort of laid down we're gonna have to compress and even if you've got it probably i'm gonna still compress it down to a smaller size and that's gonna take time and then we send that across and we just send this across with timestamps sequence numbers and we can rebuild that at the other end to display on screen in the same way we do with the audio one of the things we can do though because we know the image isn't going to change so much in the typical video conferencing the background behind me static the background behind sean's static so we can use into frame compression and to throw bits of information away that doesn't change from frame to frame so for example background but the way we normally do that is by working on a sequence of 15 frames say or 30 frames or 60 frames and compressing them all in one go which is going to introduce the delay so we need to make sure that we design our codec that it doesn't introduce those delays and you can start to do even more things like for example breaking the image down into sort of horizontal slices so that you can compress the top slide first which if you're building the hardware from scratch will often come from the center before the bottom half if the machine's multi-threaded or got hardware support we can be compressing one slice while we're sending the other over the network so we can reduce the latency even more and when it's received at the other end of the conversation we can synchronize it with the audio using the time stamps in the rtp headers that tell us when this was came from and so we can match them up delaying each of them as needed so they can be played out in sync to whoever is watching the video conference and the same can happen in reverse we can now send the data over the network we've got an rtp stream containing the audio an rtp stream containing the video they both have time stamps in them so we can sort of synchronize them the problem we've got and this is because of how the internet's developed is too many people using it is how we make sure that we know we're sending this to the right computer particularly as they're likely to be hidden between network address translation and firewalls and so on so in another video we'll look at how we set up the call and how we can make sure that my computer behind my network and my network address translation can talk to sean's computer behind his net router and his network address translation i know wouldn't things be more simple if we just went to ipv6 you only have to work out whether it's worth alerting the user if you find the key so you know you download the temporary exposure key you perform the encryption you generate the potential rpis and you compare them with the ones you've seen or if you want a more slightly comprehensible message it's saying maybe you haven't applied a function to enough argumentssteve welcome back uh we are obviously still in a remote situation um but you wanted to tell us a bit more about how that um well how this kind of connection's working yeah i mean i thought it'd be interesting and obviously we've talked about remote working but i mean obviously a lot of things we're doing is we're using video conferencing or voice over ip type software to have conversations with people not just in a business context but also with friends and family i mean i've lost track of the number of zoom calls or facetime calls or facetime calls for support dave that i've had to make in the past eight weeks or however long we've been in lockdown i lose track so i thought it'd be interesting to sort of talk about the technology that's involved there because as steve jobs put it when you announced the iphone 4 and talked about facetime the technology required to do it is a bit of an alphabet soup of technologies that you need to go through so what i thought we could do is just spend some time going over how video conferencing works the first thing to say this is going to be an overview of what's involved we'll come back probably and do some more in-depth videos little bits but this one's going to give the big picture and then we can zoom in later on to talk about that um we're also going to keep things simple so let's simplify this problem down to the sort of conversation we're having now which is one person talking to another person we'll come back and talk about how we have multiple people but i'm sure you can guess how we can do it you just have multiple connections between the different people to do it and we'll only consider it going in one direction from me to sean or from sean to me because the other direction is basically the same in reverse so we'll simplify it right down so we can have a discussion about what's involved in getting the voice and pictures from my end to get to shore what we would call perhaps mouth to ear how do we get the voice from my mouth to sean's ear across all the networks involved what's actually involved in setting those things up you're breaking up there steve say that again i'm joking i'm joking okay so you mentioned there the uh the sound and me hearing what you're saying but obviously we've got a video element to this as well you know important for youtube as much as anything but are we going to talk about how the video connection works as well yeah yeah we'll talk about video as well but i think the important thing actually when you're having these conversations is that bizarrely even on a video conference the audio is more important to following the conversation and having a sort of natural conversation between the two people if the audio goes wrong then you can't hear what's being saying or the conversation becomes a jumbled mess so actually even when videos involved the video signal can break up or we can drop the quality on it a lot easier than the audio signals the audio signal starts to break up then the conversation the flow of it will brown down uh people start talking over each other each other follow what's being said so we'll start off looking at the audio and then we'll talk after we've done that about how we put video alongside that because actually once you understand how the audio works the video signal is going in a very similar way with a very similar set of problems and you just have to send video rather than sending audio data across the network when you rang me this morning as well as my computer starting to bleep to tell me that you were trying to connect also my phone's going how does that segment of it work how does how do we know that we want to make a call so there's really two things that are involved with video conferencing this is the call set up which is sort of making sure that sean knows i want to speak to him and that he accepts the call and then we also need to establish how we actually send data between the two of us um and he's also actually sending the data over the connection um so taking the audio and the video that we want to send and actually send them over the internet and i think it's easier to talk about how we send the data first and then we'll look at how we do call setup and what's involved in that in another video because that actually involves making sure we can get the data to each other so we need to know how the data is sent then we know the problems that are involved with sending the data talking about how we transfer the data there's loads of things that are involved this it may seem like a simple thing to just take a piece of audio and send it over network but actually when you think about what's involved in making this into a conversation and not just a live stream because in the live stream i'm just sending data to sean or whoever's watching and they're enjoying whatever it is insert your favorite type of youtube live stream here but with a voiceover ip or a video conference we're having a conversation and so we need to make sure that the connection we've got enables the conversation to happen as if the technology wasn't there we don't want the technology to get in the way of what we're doing and so we need to think about things like how much bandwidth is this going to take i mean we've all got network connections but they have a finite amount of data that we can send i mean in an ideal world we wouldn't worry about it we could just send all the data coming from the microphone coming from the camera straight down the network and it would appear at sean's and everything will just get there and it would work so basically what we're saying is that we need to shrink that down before it goes through our connections so that we're not sending too much data and clogging it up and causing latency right yes i mean we've got oodles of data coming the video i mean the audio for recording um sort of cd quality audio is about 768 kilobits per second um so we've got a lot of data going down and and okay a modern network connection my home networks 300 megabits down 35 megabits up the gigabit symmetrical isn't unheard of these days in a home setting but that's still a lot of data and so obviously we need to sort of compress whatever we've got we need to make sure whatever we send fits within the bandwidth we've got available because it's not just our bandwidth it's your bandwidth as well i've probably got a better internet connection than you no no no you have i know that you have we've we've done tests uh in transferring material back and forth that proves that um but yeah the point is i suppose there's even more considerations that you might live in a household with four or five people someone who's streaming a movie at the same time someone else who's trying to make their own call so yeah there are all sorts of considerations aren't there yes there's lots of things that affect the bandwidth is who else in the house is using it there's what you've got what i've got there's what's involved in the network between it if the network gets congested then the bandwidth will drop even though each end is technically incapable of sending that so we need to take that into account so the bandwidth is important because we need to make sure that we can actually send the data that we're not trying to first too much down the pipe as it were to the other end what also is important is the latency or the time it takes from me saying something coming out of my mouth to sean hearing it through his ear or what's known as the mouth-to-ear time so the mouth airtime or the latency of the connection is one of the key issues that we need to minimize the telecoms industry found many years ago that actually as long as it's below 100 milliseconds the conversation will flow naturally people be able to have a conversation as if it's not there so we want to try and minimize that as much as possible if it goes above 100 milliseconds then the conversation starts to break down because i think you've finished speaking i start speaking but because of the delay you think i'm not going to say anything and start saying something and we start talking over each other then we stop and then we both start speaking again at the same time and as it goes on and very quickly once the latency goes above 100 milliseconds the conversation will sort of break down into a sean do you copy over yes receiving over good out we want to make sure whatever we do we minimize the latency so the network itself has some inherent latency in it you can measure that by running the ping command on your computer and seeing how long it takes to get from your computer to another and back and then halve it and that shows you the latency on that network connection but this is also related to bandwidth but not directly if your bandwidth increases your latency will go down but it can also go up and the other thing to bear in mind is that you're never going to get away with zero latency and actually we expect latency in a conversation so as we sit two meters apart as we all socially distance there's a natural delay as the sound waves propagate from my mouth to someone's ear so there's always latency in the conversation what we want to ensure is that whatever we do doesn't introduce you any more latency and some of the things we might do to say preserve bandwidth are going to increase latency and so we want to balance what we do there with what we do to maintain latency so it might actually be better to to use more bandwidth and have less latency or to go for a slightly worse compression and have a sort of algorithm that compresses not as well as it could to reduce load so you want to sort of try and keep the latency down as low as possible the other thing you want to take into account is av sync so we want to make sure that the words coming out of my mouth are in sync with the pictures as they aren't at the moment because when that happens it becomes really hard to follow what's going on it looks odd so we want to also try and maintain sync between what i'm saying and the video and of course because they're going to go through different routes because we're going to compress the video in a different way to where we can press the audio we've got to be able to pair them up back at the other end so that they appear in sync and again that will affect latency because if it takes longer to compress the video than it does to send the audio then we have to delay one at the other end so that we can match them so latency is a really important thing in a video conferencing we want to try and minimize that as possible and lots of the things that happen are involved in minimizing the latency of the connection other things you need to take care of is that if you think about the way this call is working i'm speaking to sean it's coming out of his loudspeaker or through his earpiece in my case um and in sean's case um but if that's coming for his speaker that will get picked up again by the microphone as i speak for sure is either suppressing the echo by sort of muting one end of the conversation when that person's speaking or you can create circuits which cancel the echo by sort of taking the signal that's coming in and feeding it back and sort of in antifa to sort of cancel out what's going on there there's lots of math involved in that but that's a story for someone else in another video to cover so we can we have to take all these things into account of course as the legacy increases the echo will get more noticeable in certain things in fact there's one networking standard where its packet size was created purely on the amount of time that you get from one end of france to the other without needing echo cancelling so they went for 53 bytes or something but that's a another story atm's a complete other networking topic that we'll talk about at some point so we've got to maintain bandwidth we need to make sure we fit within what's available on the network or less if other people are using it and again that can change you might start a conversation and then someone starts watching netflix on the same network connection or the network may get congested elsewhere and the path between you and the other person there's lots of people on the same isp still watching the latest cat video on youtube or whatever it is people do on youtube the latest computer file video of course it's far more interesting than cats we have mice so these things will change we need to match that in our system but we also need to keep the latency down and we need to maintain a v sync so how do we go about building a system let's uh let's start drawing a diagram of the bits that are involved we're going to talk about how we can get the voice from me here to shawn over there so at sean's end we have something that looks a bit like an ear and over at my things we have something that looks a bit like two lips this has changed color and we're now going to talk about how we can send a signal from there to there if we just start building this up as we go the easiest way is that we'd have some sort of microphone at this end and we'd have a loud speaker at sean's end and then we could just connect a wire between the two and as i speak the microphone converts my voice into an audio signal an electrical signal that represents the same waveform as my voice sends it down the cable and that at the other end the loudspeaker converts that back into sound pressure waves that sean then hears um this is your standard plain old telephone system it's exactly how that worked except the cable was sort of controlled by switches but effectively once it was set up you had a connection between the two but we're talking about sending this over computers so we're going to end up having to send this digitally so what we're going to end up doing is taking the signal out of the microphone and then we're going to run that through an a to d converter and what we get out the other end is a series of bytes that will represent that audio signal it's a mono signal so we'll get a single stream of bytes that represent that and we can adjust the quality here by choosing the sample rate and how many bits we use to represent the audio so we could go for sort of cd quality sound or better which would give us a lot of data to send or we could sort of do what the phone companies originally did was only sample it at eight kilohertz which is two times four killers which is the highest frequency you normally get in a voice signal and so you capture all the data required for voice and you don't see anything else and you use eight bits which gives you 64 000 bits per second so that's going to give us a set of data bytes and they're going to come every eight thousandths of a second let's just go with that to keep the numbers simple so every 8 000 of a second we're going to get another byte of data representing my voice and so that's going to come out as a chain and at the other end again we have the opposite so we'll have a d to a converter digital to analog which regenerates the analog signal which can drive the outline speaker and we can then send this across and it would all work we've digitized it but we still need that cable in there so what do we do well we need to send this over our network our network the internet is packet based um there's been a long debate between um telecoms people do you go for sort of circuit switch networks or do you go for a datagram packet based networks where you just put things in the packet direct in packet and send it like you're sending a letter to the right person packets seem to have won out sort of circuits which ones like atm and so on seem to pretty much died out or used to send packets over them anyway so we're going to build them up into packets but now we've got an interesting problem we've got data coming every eight thousandths of a second from the hd converter and we need to put them into a packet so if we just go for say a thousand um samples that would be 1 8 of a second at 8 000 kilohertz sampling rates that's over 100 milliseconds there so we probably want to choose a smaller packet size so as soon as we start building these things up into a packet we've got to wait for the right number of bytes to come in before we can send the packet out that's going to introduce some latency so we need to make sure we choose a packet size that is small enough to not introduce too much latency so let's say we just had one bite in each packet well we could send that over that would mean we didn't have any latency but we then have a situation where you'd be wasting bandwidth because that one bite would have several bites of packet headers on there for the sort of ip header and so on that you need to send send it across so that'd be very wasteful so you want some there so you need to choose the right number of bytes to make this sort of fit there so if you make it too many then you'll use your bandwidth better but you'll increase your latency so you have to sort of rob peter to paypal uh that to find the right thing and of course as the network changes those conditions will change so you may have to change that as the thing's going on there so we're suddenly going to get a delay here by building things into a packet and then you get that packet at the other end and you take the bytes out of that one by one and feed them to your d2a converter there's still an issue here as we send that over the network you're taking the data in the packet and you're reading out the bytes one by one sending them to do your da converter one eighth every eight thousandths of a second so we can send that but when you come to the end of the packet you need to make sure that you've got the next packet with the next set of bytes there at the next 1 8 000 of a second later if it's if it comes later than that you haven't got the data so you can't send it out to the da converter what do you do well you end up having to play silence or whatever so we need it there one eight thousand a second but if you backtrack to my end of the connection to send that i can only send that when the network conditions are clear so if i'm sending another packet let's say i've got a 1500 byte packet from a network transfer where i'm sending this video to you sean at the same time that's just started that's going to take some time to send i can't send my packet of data with the audio connection until that's finished sending which will delay it a bit more and then which will delay it at the other end so actually although we've got the packets there we're going to end up with some sort of buffering after that just because the network will be being used for things and actually every step along the network is going to introduce some buffering as well that means that at your end there's a very good chance that you haven't got the data you need to free to your da converter when you finish going through that packet so we need to put something in place to make that work and the way we can do that typically say in a live stream thing is that we actually we put a buffer here and we just start buffering or cueing up those packets so we know we've got two or three in memory all the time so if the one arrives late it goes on the end of the queue and then by the time we come to it we've got to that point but that introduces more delay so we're going to make sure that the buffer there is as small as possible perhaps two or three packets and that packet size is small enough so that we don't notice that additional latency we introduce there but there's another issue here if you think back to the video we did on vpn we talked about how there's two ways we can send data over the network we can either say use a tcp connection which tries to emulate a a stream between the two connections where we send every byte down there and it arrives at the other end guaranteed one after the other in the order they're sent now i think that's great that's exactly what we want for an audio conversation we want all the data to get there so that you can hear everything i said but if you remember the way we said that does that is that tcp if it doesn't receive something will wait and then after it to be sent again and then it'll send it again and if that happens that's going to delay that packet arriving and because it's guaranteed everything is happening in order after that every other byte will be delayed by that point and so the latency will increase and as packets will get lost on the internet because it's the best effort network is guaranteed to lose packets then we'll be getting packets lost and lost and lost and so the latency over the conversation will creep up and up and up imperceptibly but eventually by the end it could have got to a point where you can notice it so tcp actually perhaps isn't as good an idea for sending this as you think the alternative way we can send those is say just to send the packets out there and let the computer do its best to get there and if they get lost they get lost and actually it turns out that certainly for audio and also for video this isn't as much of a problem as as you might expect because if you think about it if your packet size is small enough let's say we're sending um 80 bytes that's 100 of a second if it's a 100th of a second drop out where the packet gets lost on the audio you're probably not going to hear it you'll be able to pick up the sort of conversation as things are going and the next packet will come along and so all you'll get is a sort of small bit of silence as you're hearing this bit now and actually that's that's quite natural to us because as we're having conversations there are other sounds that we hear which might cover a bit of the conversation and so we're used to those sort of things so actually it's better to have packets go missing and then just carry on and deal with what we've got and deal with that then to actually try and guarantee everything's going there so when we send these things out we want to send them using udp rather than tcp because that enables packets to get lost and as we said we're not going to notice that in a conversation whereas we would start to notice the build up of latency as the tcp connection started delaying things as everything got resent and resent and reset and resent and so on so we we use udp now rather than just stuffing the data into the udp packet as is we actually prepend to the data another header there this time for what's called the real-time transport protocol and this just gives us some details about what type of data is in this stream of packets is it audio is it video what codec is it using to compress things and so on but it also has a source identifier in there so that we know that this is coming from the same stream so just a unique randomly generated id for a stream that says these packets are all coming from the same source stream the same audio stream or the same video stream so you can sort of multiplex and we also have a sequence number in there so we know where the packets have got lost or what order they need to go because the other thing we need to bear in mind is that the network might reorder things um because they go different routes in which case we want to reorder them ourselves back into the correct order otherwise we'll get the audio jumbled up and if we're buffering things we get the opportunity to do this or we can just drop the packets if they come out of order and so on but also we have a time stamp on there which tells us when this should be played a sort of time that says when it either when it came from or when it should be played and this is what we use later to synchronize the video to the audio we stamped the audio with the time it came from we stamped the video with the time it came from and that tells us when the first bite in that packet should be played out that works fine when the data rates are low if so like with 64 000 bits per second that's reasonable to send over a network connection even in the 1980s on something like isdn but just but if you want to do better quality audio because that isn't that good quality audio so when you have cd quality audio we're going to have significantly more data we need to compress that down to a smaller size in some way so that we can send it over the network connection and the way we do that it's dead simple um we can just sit in between the a2g converter and the packetizing bits of the section and take that data and compress it and at the other end after you've read it out the packet rather than sending it straight to the da which would sound horrible you can sort of decompress it at that point but again compressing it you're going to have to gather the data together look at it analyze it and then decide how you're going to throw things away because that's what compression is it's working out well i can throw this information away um or and still recover the original signal or something that sounds like real looks like the original signal and again when you decompress it you're going to have the same sort of process the other end so you're going to end up introducing more delay and increase the latency at each end so you've got to balance all these things to try and keep the latency as far below 100 milliseconds as you can to try and keep the conversation as natural as possible as we're sending this across the other thing we have to bear in mind is that this isn't going to be stable every packet we send is going to take a slightly different amount of time to send out what's called jitter on the lenses we're not going to have a constant latency we'll get a constant latency we could probably calculate quite easily but actually the latency will change as we're sending packets across there as we said someone might start using it to watch netflix or the iplayer or something it might take a different route over the network so one packet may take um two milliseconds to get there the next one might take 2.1 the next one might take 1.9 depending on various fractional networks so you need to have the buffering in place to cope with these sort of things and things and also to monitor it and adapt perhaps your compression rate if things start to get more latent um you can press it less or when you've got more bandwidth you will compress it more so you've got smaller packets so you've got more chance than going through so on whatever's happening to try and maintain the call quality but you want to keep the latency as low as possible as you're doing this so that's how we do audio and we can do the same in the opposite direction so that sean can talk to me and as i said we perhaps a bit of echo cancellation in there which again is going to delay things a bit as we process that it's really weird if you do start to hear the legs on your own voice when having a conversation if you start to play your own voice with even a sort of 10 millisecond delay it will really change the way you speak you end up slowing yourself down as you hear the delay and it's really strange so that's how we can send the audio but what about video well it works in pretty much the same way we have a camera that captures our picture that's going to capture 60 frames per second 50 frames per second 25 frames even as low as 12 frames per second as we're sending the data across and again we're then going to need to compress that down because the video stream is likely to be unless we've got a really fabulous um bespoke video conferencing setup where we put guaranteed um bandwidth between the two things hdsu or eye lines or whatever it is we've sort of laid down we're gonna have to compress and even if you've got it probably i'm gonna still compress it down to a smaller size and that's gonna take time and then we send that across and we just send this across with timestamps sequence numbers and we can rebuild that at the other end to display on screen in the same way we do with the audio one of the things we can do though because we know the image isn't going to change so much in the typical video conferencing the background behind me static the background behind sean's static so we can use into frame compression and to throw bits of information away that doesn't change from frame to frame so for example background but the way we normally do that is by working on a sequence of 15 frames say or 30 frames or 60 frames and compressing them all in one go which is going to introduce the delay so we need to make sure that we design our codec that it doesn't introduce those delays and you can start to do even more things like for example breaking the image down into sort of horizontal slices so that you can compress the top slide first which if you're building the hardware from scratch will often come from the center before the bottom half if the machine's multi-threaded or got hardware support we can be compressing one slice while we're sending the other over the network so we can reduce the latency even more and when it's received at the other end of the conversation we can synchronize it with the audio using the time stamps in the rtp headers that tell us when this was came from and so we can match them up delaying each of them as needed so they can be played out in sync to whoever is watching the video conference and the same can happen in reverse we can now send the data over the network we've got an rtp stream containing the audio an rtp stream containing the video they both have time stamps in them so we can sort of synchronize them the problem we've got and this is because of how the internet's developed is too many people using it is how we make sure that we know we're sending this to the right computer particularly as they're likely to be hidden between network address translation and firewalls and so on so in another video we'll look at how we set up the call and how we can make sure that my computer behind my network and my network address translation can talk to sean's computer behind his net router and his network address translation i know wouldn't things be more simple if we just went to ipv6 you only have to work out whether it's worth alerting the user if you find the key so you know you download the temporary exposure key you perform the encryption you generate the potential rpis and you compare them with the ones you've seen or if you want a more slightly comprehensible message it's saying maybe you haven't applied a function to enough arguments\n"