Cellphones & Big Data - Computerphile

Network Operators and Disease Surveillance: A Complex Issue

When we had the recent Ebola crisis all the way to the recent Zika virus and various other diseases, network operators have been working with local authorities and international organizations to try and identify areas that are going to be prone to spread of diseases or also identify communities that may be at risk of contracting diseases or infected communities. They're trying to identify where that disease might spread next. Warning bells clacking, you know they're kind of clickbait questions I'm about to ask them. We've seen large organizations be hacked, should we be worried that all of this data about us is being collected? No, we shouldn't be worried. So a lot of organizations and by organizations I mean mobile phone operators when they collect the data it's being anonymized before it actually reaches their systems to some extent. And what they internally work with is also already anonymized so no one actually gets to see the raw data itself.

What Sorts of Things Are Happening with This Data?

So we eventually collected it for various different things, i mean what's your experience of doing so personally? I've done a lot of work around tracking activity-based land use and tracking transport demand. So transport demand is you're trying to figure out how many people are going from A to B, how many people are going from B to A, how many people are going from C to D, et cetera. So this gives you a really quick overview of transport patterns in the city. So if you want to see how many people are going from London to Manchester, how many people from Manchester to London, etc. You can try and figure out where you want to maybe build the next rail line or where you maybe want to build a new motorway.

Activity-Based Land Use: Identifying Urban Patterns

This is um let's say you want to see how cities evolve over time. It's focusing on identifying where people work, where we've got residential areas, where we've got commercial areas, and how all these different things fit together. Activity-based land use is a way of analyzing the city itself, using mobile phone data to understand urban patterns and changes over time. By looking at the locations that people visit most frequently, researchers can gain insights into how cities are developing and what trends are emerging.

The Challenge of Collaboration

One of the problems with analyzing mobile phone data for disease surveillance is that you don't get a complete picture of the entire population or network coverage. To overcome this, researchers use an expansion factor to adjust their findings. For example, in the UK, they might look at a particular area and identify 100 people as having their most frequently visited location during the night, which is their home. They then compare this to the census population and use the ratio of homes identified to actual homes to calculate an expansion factor. This factor is used throughout the analysis to adjust the number of journeys being recorded.

Cell Towers: Limitations in Locating Individuals

When it comes to identifying individual locations using cell towers, the limitations become apparent. Traditionally with GPS, you could say okay a person is here, a person is there, person is there, person is there. But with cell towers, all you know is that they're connected to a tower, which can have a radius of a couple of hundred meters. In rural areas, these radii can be much larger, sometimes kilometers apart.

The Difficulty in Discerning Individual Locations

In urban areas, especially where buildings are close together or high-rise houses are common, it becomes increasingly difficult to discern where individuals actually are. This is because the cell towers' coverage areas overlap, making it challenging to pinpoint exact locations. Sometimes, multiple people may be within a short distance of each other, but still be considered separate locations.

Landers and Other Examples

There was this game called Landers which became the full game's arch when this game was released it was sort of full 3D graphics which

"WEBVTTKind: captionsLanguage: entoday i'm going to talk a little bit about mobile phone data and what kind of data mobile phone operators record and also why they're recording it it's not sinister it's actually got some really good reasons and it's got a couple of really good things that you might not necessarily think of at first glance we probably start with uh what mobile phone that actually is and how it's been collected and what is actually being collected so the most important thing is with any kind of event so we call these network events is we record a time so this is a day month year and the time of day and then they record an id for the handset that is coming from so this would be your mobile phone so this id is not in plain text so no one actually knows that's you but rather the id is being hashed so it's just an anonymized id essentially and it also records an identifier for the cell phone so the cell tower dish that actually serviced your call or your sms or other event so in general it's the time it's the identifier for your handset and the location now there's a number of other things being collected but they're mainly of interest to the operator itself so this is something like how much did this phone call cost for example uh what account did they charge it to use is that something that was included in your contract is that something that was not included in your contract and has to be taken off your balance in the case of pacey go or is that something that has to be stuck on at the end and then you see an extra charge on your bill there's also things like what country code did you call from um is that were you someone from overseas who is visiting here so we know okay they need to charge this back to your network operator the other country and that's basically the most important things that have been collected so they never collect any information of what is actually inside your event so let's say you send an sms they might know how many characters your sms work but they never know what the content of the sms itself was generally network operators differentiate between what's called active events and passive events now active events are things that you actually do so this is you send an sms you make a phone call or receive a phone call you browse the internet you move from what's called a location area to another one now with a location area you've got multiple cell towers in various places and then some of these cell towers are grouped together into what's called a location area and if you move from one location area to another so let's say you go from here all the way over here they would know when you cross this boundary here and it would basically make a notice off you crossed that boundary then they also recorded if you switch between 2g 3g and 4g and in terms of passive events there is a ping that's being sent out every couple hours so let's say your phone is being stationary somewhere you're not using it you're not doing anything for three hours it would then ping and basically say oh um that person is still active that phone is still thing and that's the general events that are being recorded now the time between them varies so if you think about how many times you're sending an sms how many times you're making a phone call compared to what you did 10 years ago that's probably a lower number right now whereas if you think about how much you're using the mobile internet now compared to about three or four years ago that number is probably much much higher and the numbers also increase with the amount of smartphones that we've got in general circulation so we're talking about all kinds of phones here so what makes this a computer science question then so one of the things we're talking about is because there's a huge amount of data that's being generated and there's quite a huge interest in data science right now in the computer science community and also in the wider community and it also has a lot of application areas beyond traditional computer science and analyzing what's happening in the network to fields like epidemiology to fields like urban planning transport planning to finance and a lot of other fields where that data can generate huge insights so epidemiology is this study of the spread of diseases so one thing people might be interested in is trying to predict where diseases are going to spread next so this was used quite extensively in africa when we had the recent ebola crisis all the way to the recent zika virus and various other diseases where network operators have been working with local authorities and international organizations to try and identify areas that are going to be prone to spread of diseases or also identify communities that may be at risk of contracting diseases or infected communities and trying to identify where that disease might spread next warning bells clacks you know they're kind of clickbait question i'm about to ask them we've seen large organizations be hacked should we be worried that all of this data about us is being collected no we should not be worried so a lot of organizations and by organizations i mean mobile phone operators when they collect the data it's being anonymized before it actually reaches their systems to some extent and what they internally work with is also already anonymized so no one actually gets to see the raw data itself what sorts of things are happening with this data then so we eventually collected it for various different things i mean what's your experience of doing so personally i've done a lot of work around tracking activity-based land use and tracking transport demand so transport demand is you're trying to figure out how many people are going from a to b how many people are going from b to a how many people are going from c to d et cetera so this gives you a really quick overview of why they transport patterns in the city so this is um let's say you want to see how many people are going from london to manchester how many people from manchester to london etc so you can try and figure out where you want to maybe build the next rail line or where you maybe want to build a new motorway whereas the activity-based land use is looking at a city itself and it's focusing on identifying where people work where we've got residential areas where we've got commercial areas and how all these different things fit together and how cities evolve over time is there enough information in say collaborating with one network operator to make this a viable or do you need to work with lots and lots of different providers now that is a tricky one and you kind of hit the nail on the head so one of the problems is and this is something that's not just inherent to analyzing mobile phone data but also part of any observational study you do you get some kind of bias because you always only get a snapshot of what is actually happening so we are trying to accommodate for the fact that one network operator doesn't generally cover the entire network and represent the entire population by using what's called an expansion factor so you try and figure out different ways to come up with an expansion factor so one way is for example in the uk you might look at a particular area you where you identify let's say 100 people as having their most frequently visited location during the night there which is their home and then you say okay we look at the census population and we look at the homes that we identify we're divided and that gives us the scale factor and we use that scale factor or expansion factor throughout to adjust all of the journeys they're doing and that gives us like a slight approximation of what you would see in a national travel survey and there's various other methods but that's one of the most commonly used ones particularly an area where you actually got census information available when you talk about something in kind of those generic terms of their most occupied night time location is that feels like that would be a way to identify someone from their phone just because you could see where they lived is that am i um yeah sort of uh so the thing to bear in mind there is um if you let's say you look at a cell tower and you have the coverage area of that cell tower so traditionally with gps you could say okay a person is here a person is here person is here a person is here so you know their home location is probably somewhere inside that circle but with a cell tower you just know that connected to a tower which can have a radius of a couple hundred meters it can be particularly if you're in rural areas so if you're outside large cities they can be a couple of kilometers apart and you just know their home location is somewhere inside that couple hundred meter radius or because the cell towers are next to each other and their coverage areas overlap it's like you can't be anywhere along this circle here now you could potentially be in the middle generally that middle also tends to be a couple hundred meters apart and it becomes really difficult and especially as we move more towards living in high-rise houses it becomes really difficult to discern where people actually are so sometimes the floppies would die so you often would make backup copies let's try this one sounds more hopeful and so there was this game called lander which became the full game's arch and when this game was released it was sort of full 3d graphics whichtoday i'm going to talk a little bit about mobile phone data and what kind of data mobile phone operators record and also why they're recording it it's not sinister it's actually got some really good reasons and it's got a couple of really good things that you might not necessarily think of at first glance we probably start with uh what mobile phone that actually is and how it's been collected and what is actually being collected so the most important thing is with any kind of event so we call these network events is we record a time so this is a day month year and the time of day and then they record an id for the handset that is coming from so this would be your mobile phone so this id is not in plain text so no one actually knows that's you but rather the id is being hashed so it's just an anonymized id essentially and it also records an identifier for the cell phone so the cell tower dish that actually serviced your call or your sms or other event so in general it's the time it's the identifier for your handset and the location now there's a number of other things being collected but they're mainly of interest to the operator itself so this is something like how much did this phone call cost for example uh what account did they charge it to use is that something that was included in your contract is that something that was not included in your contract and has to be taken off your balance in the case of pacey go or is that something that has to be stuck on at the end and then you see an extra charge on your bill there's also things like what country code did you call from um is that were you someone from overseas who is visiting here so we know okay they need to charge this back to your network operator the other country and that's basically the most important things that have been collected so they never collect any information of what is actually inside your event so let's say you send an sms they might know how many characters your sms work but they never know what the content of the sms itself was generally network operators differentiate between what's called active events and passive events now active events are things that you actually do so this is you send an sms you make a phone call or receive a phone call you browse the internet you move from what's called a location area to another one now with a location area you've got multiple cell towers in various places and then some of these cell towers are grouped together into what's called a location area and if you move from one location area to another so let's say you go from here all the way over here they would know when you cross this boundary here and it would basically make a notice off you crossed that boundary then they also recorded if you switch between 2g 3g and 4g and in terms of passive events there is a ping that's being sent out every couple hours so let's say your phone is being stationary somewhere you're not using it you're not doing anything for three hours it would then ping and basically say oh um that person is still active that phone is still thing and that's the general events that are being recorded now the time between them varies so if you think about how many times you're sending an sms how many times you're making a phone call compared to what you did 10 years ago that's probably a lower number right now whereas if you think about how much you're using the mobile internet now compared to about three or four years ago that number is probably much much higher and the numbers also increase with the amount of smartphones that we've got in general circulation so we're talking about all kinds of phones here so what makes this a computer science question then so one of the things we're talking about is because there's a huge amount of data that's being generated and there's quite a huge interest in data science right now in the computer science community and also in the wider community and it also has a lot of application areas beyond traditional computer science and analyzing what's happening in the network to fields like epidemiology to fields like urban planning transport planning to finance and a lot of other fields where that data can generate huge insights so epidemiology is this study of the spread of diseases so one thing people might be interested in is trying to predict where diseases are going to spread next so this was used quite extensively in africa when we had the recent ebola crisis all the way to the recent zika virus and various other diseases where network operators have been working with local authorities and international organizations to try and identify areas that are going to be prone to spread of diseases or also identify communities that may be at risk of contracting diseases or infected communities and trying to identify where that disease might spread next warning bells clacks you know they're kind of clickbait question i'm about to ask them we've seen large organizations be hacked should we be worried that all of this data about us is being collected no we should not be worried so a lot of organizations and by organizations i mean mobile phone operators when they collect the data it's being anonymized before it actually reaches their systems to some extent and what they internally work with is also already anonymized so no one actually gets to see the raw data itself what sorts of things are happening with this data then so we eventually collected it for various different things i mean what's your experience of doing so personally i've done a lot of work around tracking activity-based land use and tracking transport demand so transport demand is you're trying to figure out how many people are going from a to b how many people are going from b to a how many people are going from c to d et cetera so this gives you a really quick overview of why they transport patterns in the city so this is um let's say you want to see how many people are going from london to manchester how many people from manchester to london etc so you can try and figure out where you want to maybe build the next rail line or where you maybe want to build a new motorway whereas the activity-based land use is looking at a city itself and it's focusing on identifying where people work where we've got residential areas where we've got commercial areas and how all these different things fit together and how cities evolve over time is there enough information in say collaborating with one network operator to make this a viable or do you need to work with lots and lots of different providers now that is a tricky one and you kind of hit the nail on the head so one of the problems is and this is something that's not just inherent to analyzing mobile phone data but also part of any observational study you do you get some kind of bias because you always only get a snapshot of what is actually happening so we are trying to accommodate for the fact that one network operator doesn't generally cover the entire network and represent the entire population by using what's called an expansion factor so you try and figure out different ways to come up with an expansion factor so one way is for example in the uk you might look at a particular area you where you identify let's say 100 people as having their most frequently visited location during the night there which is their home and then you say okay we look at the census population and we look at the homes that we identify we're divided and that gives us the scale factor and we use that scale factor or expansion factor throughout to adjust all of the journeys they're doing and that gives us like a slight approximation of what you would see in a national travel survey and there's various other methods but that's one of the most commonly used ones particularly an area where you actually got census information available when you talk about something in kind of those generic terms of their most occupied night time location is that feels like that would be a way to identify someone from their phone just because you could see where they lived is that am i um yeah sort of uh so the thing to bear in mind there is um if you let's say you look at a cell tower and you have the coverage area of that cell tower so traditionally with gps you could say okay a person is here a person is here person is here a person is here so you know their home location is probably somewhere inside that circle but with a cell tower you just know that connected to a tower which can have a radius of a couple hundred meters it can be particularly if you're in rural areas so if you're outside large cities they can be a couple of kilometers apart and you just know their home location is somewhere inside that couple hundred meter radius or because the cell towers are next to each other and their coverage areas overlap it's like you can't be anywhere along this circle here now you could potentially be in the middle generally that middle also tends to be a couple hundred meters apart and it becomes really difficult and especially as we move more towards living in high-rise houses it becomes really difficult to discern where people actually are so sometimes the floppies would die so you often would make backup copies let's try this one sounds more hopeful and so there was this game called lander which became the full game's arch and when this game was released it was sort of full 3d graphics which\n"