Dealing with Dynamic Data - Computerphile

**Decision Trees and Drift Handling**

Decision trees are kind of constructed by splitting nodes where we're going to get a lot of information gained from making a split on a certain attribute. So as the data stream changes, the information gained that we might get and some of these nodes may change. If we say get a new instance and it will say okay, now this actually makes this a split worth making. We can make that split continue growing the tree and then that instance can go, we don't need it anymore. But we still have the information from it in our model. So we've got our implicit and explicit drift handling approach.

**Implicit Drift Handling**

The explicit drift handling is very good at spotting sudden drift. So anytime there's a sudden change, there'll be a sudden drop in performance that's very easy to pick up on with a simple statistical test. But when we then add in the implicit drift handling on top of that, it means that we can also deal very well with gradual drift. Gradual drift is a bit more difficult to identify simply because if you look at the previous instance or like say that 10 previous instances, with a gradual drift, you're not going to see a significant change. It's a lot harder to detect by combining the implicit and explicit drift handling methods.

**Drift Handling Timing Methods**

We end up with a performance plot. That would look something like this. We maintain pretty good performance for the entire duration of the data that's arriving. The problems with streams are not the only problems we face, and so if you can imagine a very high volume stream and high-speed, getting a lot of data arriving in a very short amount of time, if you take a single instance of that data stream and it takes you like five seconds to process it, but in that 5 seconds, you've had 10 more instances arrive. You're going to get a battery of instances very, very quickly. So you need to be the model update stage needs to be very quick to avoid getting any backlog.

**The Problem with Storing Data**

The second problem is that with these algorithms we're not going to have the entire history of the stream available. To create the current model, so the models need to be updated frequently. For example, the single path algorithms that can say we don't need the historical data, that we have the information we need from it, but we don't need to access these. Because otherwise, you just end up with huge data sets having to be used to create these models all the time. And again, these streams of potentially infinite, we don't know when they're going to end and we don't know how much data they're going to contain.

**Adaptation of Classic Machine Learning Algorithms**

Most of the kind of and well-known machine learning algorithms have been adapted in various ways to be suitable for streams. So they now include update mechanisms. So they're more dynamic methods. So this includes but decision trees, neural networks, K nearest neighbors. There's also clustering algorithms that have also been adapted. So basically any classic algorithm you can think of, there's a multiple streaming version of it now.

**Software for Streaming Algorithms**

There's software such as the Mower suite of algorithms which interfaces with the worker data mining tool kit. This is free to download and use, and includes implementations of a lot of popular streaming algorithms. It also includes ways to synthesize data streams so generate essentially a stream of data that you can then run the algorithms on, and you can control the amount of drift that you get, how certain it is, and things like that.

**Real-World Applications**

Specifically, there's software such as the spark streaming module for Apache spark well. There's also the more recent Apache flink that are designed to process very high volume data streams very quickly. Just mentioned some yourself where people can download and have a play with but I mean in the real world, as an industry, and websites and things that services that we use every day, he was using these streaming algorithms. And so a lot of the big companies or most companies to be honest will be generating data constantly that they want to model. So for example, Amazon recommendations like what to watch next, what to buy next, they want to understand changing patterns so that they can keep updating whatever model they have to get the best recommendations again optimizing ads to suggest based on whatever searching history you have that's another thing that is being done via this.

**Processing High-Volume Data Streams**

Now I've got the token so I can load a value into it, add the value emerged or into it, and store it back and hand. And now I've got the token again, I can load something into its my register you, and do the computation split across those machines. So rather than having one computer going through a billion database records. You can have each computer going through these streams of data at the same time.

**Conclusion**

The processing of high-volume data streams is an essential task in many industries, including but not limited to finance, healthcare, and e-commerce. The use of decision trees and drift handling methods is crucial for ensuring that models remain accurate and effective over time. By understanding the challenges associated with these tasks, we can develop more efficient algorithms and software to process high-volume data streams, leading to better decision-making and improved outcomes in many areas of life.

"WEBVTTKind: captionsLanguage: enSo now we're going to talk about something that is kind of a specific part of Big DataSo the velocity part huge amounts of data being generated all the time, which essentially is a data streamSo that's a flow of instances so you could have a flow of images coming in have a flowVideo coming in or just a flow of essentially lines to go into a database the thing about the dynamic dataIs that the patterns within it can change so if we've got for example a static machine learning model?That's not going to deal very well with a changing pattern happening in the dataWe build a single model at the start. We use it to make predictions on later data the modelAccuracy can kind of degenerate over time as that data changesThe problem of kind of designing algorithms to deal with this real time dataThere's been a research topic for kind of several years now and there's several real world applications on top of that as wellso if you think aboutBanks trying to detect fraud as patterns change of different forwards occurringThey want their models to kind of be able to update all the time similar for intrusion detection systems and computer networksThey want to be able to updateAnd keep on top of what is happeningIdeally, you would want this to happen automatically so minimum interference from humans, because otherwise they've got to spot when changes are happeningWe just want the machines to be able to do it by themselvesSo if you think about a traditional classification problem on a static batch of dataYou assume you have all of that data there already. You have your training test set and you haveinstances withFeatures which X and then there's some unknownfunction f of X which gives you the class label and you want to find ahypothesis that gives you the best prediction possibleSo what kind of approximates this function as well as possible?So you have a red class and a green class and we have instances that look like this our function f of X may createA class boundary that looks like this. So anything on this side is red. Anything on this side is greenOur model doesn't know that but we use standard machine learning techniques decision trees new or networksWhatever you want and it learns a boundaryThat looks like that and so that will do okay on whatever dates that we haveIt's not effect, but it may get the results that we want. This is static classifications. We already have all our dataSo we've got our data we've done our machine learningThis is the decision boundary that we've learnt. The dotted line is what is actually the boundary this gives. Okay resultsLet's now say that this is happening in a data stream. So we get this data originally and we build this modelBut then later on we have a similar distribution of instance arrivingHowever, what now happens is that some of these instances are now in reality in a different classso the true boundary is now here, but we still have ourModel with this decision boundary and so we're now predicting instances here and here into the wrong class if we use thatExact same model. So what we would see in this case inCentage accuracy over time you would see at this change pointAccuracy would plummet. So this problem here is called real concept drift. What is effectively happened hereis that this function the unknown function has changed but we've kept our hypothesis our machine learning model exactly the same and soIt starts to perform badlywe can also have a similar problem called virtual drift and what would happen in this case isthat theTarget decision boundary has stayed the same from this originalBut the instances we now see in the stream are somewhere else in the feature space. Let's say we now seedatalike this so though theKind of optimal decision boundary is in exactly the same place. We now have different data. That means that are predicted boundaryIt's going to give this instance as wrong because we haven't got a way of incorporatinginformation from this instance into the original model that we built both of these will create this decrease in accuracy so we can alsoLook at the drift in the data streams in terms of the speed they happen so something that would give us an accuracy plot thatLooks like this is called sudden drift we go from straight from one concept in the data streamSo one decision boundary straight to another one another possible thing that could happenIs that our accuracy looks like this?So rather than this sudden switch this decision boundary gradually shifts save me your life if we're looking at a very very oversimplifiedIntrusion detection system. We have only two features that we're looking at in the original datasetanything with these features, this is asecurityProblem and intrusion anything on this side is good in this caseWhat happens is that suddenly there's a new way of attacking the network and so suddenlyWhat was here is now not good. So we see those patterns and we say okNo, that counts as an intrusion in this casewhat it means is that we see something that we've not seen before so the model hasn't been trained with any similar data andSo it could get it, right it could fall somewhere up here and we correctly say this is badbut it could also fall in an area that we didn't learn the decision boundary so well, soYeah, we get that prediction wrong. We just looked at what?The problems are with using a single static model when we're dealing with incoming dataOver time the distribution changes and we start to see a decrease in accuracy on whatever model we builtSo what happens in kind of a stream machine learning algorithm would be so first of allYou've got X arriving. This is your instance in our previous example, this would just have two values associated with itWhat would first happen is we make a prediction? So in the classification example, we classify this. YesIt's an intrusion. No, it's not intrusion using the current model that we have then what happens is we update whatever model we haveusing information from X and we'll talk about some of the ways that this is done in a second andOne of the kind of caveats with stream machine learning is that you need for this to happen you?need to haveThe real class label if you're doing classificationSo in order to incorporate information from this instance into whatever model you've got you need to have that label there now in some casesIt's very easy to say we've seen this data. This is what it's classified usAnd we do that immediately if we're thinking aboutMaking weather predictions we can almost immediately say yes. This is what the weather is like it may be a day's delayBut yeah, we can that's pretty immediate thing four things for example for detectionYou may see a pattern of datayou mayPredict it is not being fought and then suddenly two days later this person figures out that actually there's something wrong with their bank accountsThey phone up and it does turn out to be fraudAnd so we'd only have the label for that data after that has happenedThe final bit is to update the modelAt this point and so the goal of updating the model over time is so that rather than having a performance plotThat looks like this so we go from 95s and accuracy down to 20% accuracyWe instead end up with something that okayWe may drift a little bit here and have a tiny performance decreaseBut the model should very quickly recover back to the original level and we still have a high performanceSo that's the goal of this model update. There's various approaches we can take so the first one is explicit drift handlingwhich means that we first of all detect when a drift happens in the data streamSo to do thatWe have drift detection methods and these are usually statistical tests that look at some aspects of the data arrivingSo if the distribution of the data we see arriving and the distribution of the classes we see is changingIf morph like that as a drift some of these we'll also look at the performance accuracy of the classifierSo if the classifier performance suddenly drops we can say well, we've probably got a drift hereWe need to do something to the model to mitigate thisWho spots that though? Is it, you know, is there an algorithm that actually spots that something's different to what it should beYes, so there are various statistical tests that will do thisThat will kind of just measure things like the mean of the data arriving and be able to spot things that have changed basicallySo yeah, once we detected that a drift has happenedWe then want to take some action. The first thing that we could do is we could do a complete replacement of the modelso we get rid of whatever model we had before andweWe have taken chunk of recent dataAnd we retrain the model on that and continue using that for predictions until we've hit another driftThis is okay. But it means that we could be getting rid of some information in the previous modelThat is maybe still going to be useful in the futureso then there are also methods that we'll look at specific parts of the model and say okay this specific part of it isCausing a performance decrease. So let's get rid of this we can thenLearn from new instances something to replace this that will do it better basicallyso if you think of a decision treeIf you can detect that there are certain branches in that decision tree that are no longerMaking good predictions you can get rid of them and we grow the tree to perform better prune it. Yeah, exactlyIt is called pruning. You prune. Yeah, you prune the branches off the treeThere are no longer performing as you want them to the alternative to explicit handling is to do implicit drift handlingSo rather than looking at the data or looking at the performance and saying something has changed we need to take actionWe're just continually taking action. There are various approaches to implicit drift handlingSo the first and probably most simple one is to use a sliding windowSo if we imagine we have the data stream with instances arriving like thisWe could say we have a sliding window of three instances and we learn a model off of them. We thenTake the next three learn a model off of them. So as each instance arrives we get rid of the oldest instanceAnd this makes the assumption that the oldest instances are the least relevant. This is usually the caseIt's kind of a valid assumption to make so this performsOkaythe problem with this though is that it kind of provides a crisp cut off points everyInstance within this window is treated with exactly the sameKind of impacts on the classifier. They were weighted the same so we can introduce instance weightingSo that older instances will have a lower weight their impact on the classifier will be lessSo again, the more recent instances will be have the largest impact on the current modeland then again these algorithms that we'll use instance weighting will usually haveSome threshold. So once the weight gets below a certain point they say that's the instance goneWe delete it presumably the windows can be larger or smallerYes, so setting the window size is a pretty important parameterif you have a window, that is too large thenOkay, you're getting a lot of data to construct your model from which is good and cents between learning more data usually goodWhat it also means is that if there's very short-term driftsSo this drift happens and then we don't learn from that drift if that makes sense because we see that all as oneChunk of the data againIf you didn't set the window to be too small we can react very well to very short-term drifts in the streamBut you then have a very limited amount of data to work on to construct the modelSo there are methods that will automatically adjust the window size. So during times of drift the window size will get smallerso we want to be very rapidly changing the model and then during times when everything is kind of very stable theWindow will grow to be as large as possible so that we canUse as much data to construct this model as possibleSo the problem weird sliding windows and instance weighting is that you need all of those instances available to construct the modelContinuously. So every time you add a new instance and delete another one you need to reconstruct that model andSo the way we can get around this is by using single pass algorithmsSo we see each instance once use it to update the model and then get rid of that instanceIt's probably still in long-term permanent storage, but in terms of what is being accessed to construct this algorithmIt's gone now in that respect then you've got information out of the instance, but you don't need the instance itself. Yeah, exactlySo we see the instance we incorporate what we can from it into the current modelWe get rid of it and that instances impact is still in the model an example would be a decision treeSo decision trees are kind of constructed by splitting nodes where we're going to get a lot of information gainedfrom making a split on a certain attributeSo as the data stream changes the information gained that we might get and some of these nodes may changeSo if we say get a new instance and it will say okayNow this actually makes this a split worth makingWe can make that split continue growing the tree and then that instance can go we don't need it anymoreBut we still have the information from it in our modelSo we've got our implicit and explicit drift handling appro. You can also have hybrids approachesSo the explicit drift handling is very good at spotting sudden drift. So anytime there's a sudden changeThere'll be a sudden drop in performance that's very easy to pick up on with a simple statistical testBut when we then add in the implicit drift handling on top of thatIt means that we can also deal very well with gradual driftSo gradual drift is a bit more difficult to identifySimply because if you look at the previous instance or like say that 10 previous instancesWith a gradual drift, you're not going to see a significant changeSo it's a lot harder to detect by combining the implicit and explicitDrift timing methods we end up with a performance plot. That would look something like thisWe maintain pretty good performance for the entire duration of the data that's arriving the problems of a changing data distributionAnd not the only problems with streamsandso if you can imagine a very high volume stream andhigh-speed got a lot of data arriving in a very short amount of time ifYou take a single instance of that data stream and it takes you like five seconds to process itBut in that 5 seconds, you've had 10 more instances arrive. You're going to get a battery of instances very very quicklySo you need to be the model update stage needs to be very quick to avoid getting any backlog. The second problem is that with?These algorithms we're not going to have the entire history of the stream availableTo create the current modelso the models need to beFor example the single path algorithms that can say we don't need the historical data that we have the information we need from itBut we don't need to access theseBecause otherwise, you just end up with huge huge data setsHaving to be used to create these models all the timeAnd again these streams of potentially infiniteWe don't know when they're going to end and we don't know how much data they're going to end up containingMost of the kind of and well-known machine learning algorithms have been adapted in various ways to be suitable for streamsSo they now include update mechanisms. So they're more dynamic methods. So this includes but decision trees neural networksK nearest neighbors. There's also clustering algorithms have also been adapted. So basically any classic algorithm you can think of there'sMultiple streaming versions of it now. So if you are interested in these streaming algorithmsThere's a few bits of software that you could look atfor example, there's theMower suite of algorithms which interfaces with the worker data mining tool kitThis is free to download and use and includes implementations of a lot of popular streaming algorithms it alsoIncludes ways to synthesize data streams so generate essentially a stream of dataThat you can then run the algorithms onand you can control the amount of drift that you get how certain it is and things like that andthat's quite good to play around with to see the effects thatDifferent kinds of drift can have on accuracy in terms of big data streamsSpecifically there's software such as the spark streaming module for Apache sparkwellThere's also the more recent Apache flink that are designed to process very high volume data streams very quicklyyou just mentioned some yourself where people can download and have a play with but I mean in the real world as an industry andWebsites and things that services that we use every dayHe was using these streaming algorithms. And so a lot of the big companies or most companies to be honest will be generating dataConstantly that they want to model. So for exampleAmazon recommendations like what to watch next what to buy next they want toUnderstand changing patterns so that they can keep updatingWhatever model they have to get the bestrecommendations againoptimizing ads to suggest based onwhateverSearching history you have that's another thing that is being done via this. So yeah, there are a lot of real-world applications for this stuffNow I've got the token so I can load a value in add the value emerged or into it and store it back and handAnd now I've got the token againI can load something into its my register you and do the computation split across those machinesSo rather than having one computer going through I don't know a billion database records. You can have each computer going through\n"