What is Big Data - Computerphile

**The Process of Big Data Processing**

There are different ways to process big data, and understanding these methods is crucial for extracting value from large datasets.

**Batch Processing**

One common method of processing big data is batch processing. This involves taking all the available data, processing it in batches, and then producing results. However, this approach can be slow because it requires processing all the data at once. The other issue with batch processing is that it's not suitable for real-time data processing. If we're dealing with high-velocity data, such as sensor readings from vehicles on the road, we need to process the data as it comes in, rather than waiting until a batch of data has been collected.

**Real-Time Processing**

Another approach to processing big data is real-time processing. This involves processing each data item as it arrives, rather than waiting for a batch of data to be collected. Real-time processing is ideal for applications where the data is constantly coming in, such as sensor readings from vehicles on the road. By processing the data in real-time, we can get immediate results and respond quickly to changes.

**Pre-Processing**

Before processing big data, it's often necessary to pre-process the data. This involves taking raw, unstructured data and transforming it into a format that can be used for analysis. For example, if we're dealing with sensor readings from vehicles on the road, we may need to convert the data into a more usable format before we can process it.

**Removing Noise and Outliers**

One of the challenges of processing big data is dealing with noise and outliers in the data. Noise refers to irrelevant or meaningless data that can throw off our analysis, while outliers are data points that are significantly different from the rest of the data. By removing these elements, we can get a clearer picture of what's going on in the data.

**Data Streaming**

Another important aspect of big data processing is data streaming. This involves dealing with high-velocity data streams, where the data comes in continuously and needs to be processed immediately. Data streaming technologies are designed to handle large volumes of data in real-time, making it possible to extract value from the data as it arrives.

**Frameworks for Big Data Processing**

There are many frameworks available for big data processing, including Apache Spark, which provides a distributed computing engine for processing big data. These frameworks provide tools and libraries for handling various aspects of big data processing, such as data streaming, pre-processing, and real-time processing.

**Applying Big Data to Real-World Scenarios**

Big data processing has many applications in the real world. For example, in finance, it can be used to analyze large volumes of transaction data to detect patterns and identify potential fraud. In logistics, it can be used to track shipments and optimize delivery routes. The possibilities are endless, and big data processing is becoming increasingly important as more industries move towards data-driven decision making.

**The Importance of Distributed Computing**

Big data processing often requires distributed computing, where the computation is spread across multiple machines or nodes. This approach can greatly improve the efficiency of big data processing by allowing us to handle large volumes of data simultaneously. Distributed computing also makes it possible to scale our processing power up or down as needed, depending on the demands of the application.

**The Role of Machine Learning**

Machine learning plays a critical role in big data processing. By analyzing large datasets, we can identify patterns and trends that would be impossible to see by hand. Machine learning algorithms can also help us make predictions and identify potential issues before they become problems. In applications such as fraud detection and customer segmentation, machine learning is essential for extracting value from the data.

**Real-World Examples of Big Data Processing**

Big data processing has many real-world examples, including fleet management systems that use sensor readings to track vehicles on the road. These systems can be used to optimize routes, reduce fuel consumption, and improve safety. Another example is financial institutions that use large datasets to detect patterns and identify potential fraud.

**Conclusion**

In conclusion, big data processing involves a range of techniques and technologies for handling large volumes of data. From batch processing to real-time processing, and from pre-processing to machine learning, there are many ways to extract value from big data. By understanding these methods and applying them in the right context, we can unlock the full potential of big data and make informed decisions in a rapidly changing world.

"WEBVTTKind: captionsLanguage: enToday we're going to be talking about big data. How big is big?soWell, first of all, there is no precise definition as a rule. So kind of be standard what people would say isWhen we can no longer reasonably deal with the data using traditional methodsSo that we kind of think what's a traditional method? Well, it might be can we process the data on a single computer?Can we store the data on a single computer? And if we can't then we're probably dealingWith big data, so you need to have new methods in order to be able to handle and process this dataAs computers getting faster and bigger capacities and more memory and things that the concept of what becomes big is is changing, right?So kind of but a lot of it isn't really as I'll talk about later isn't howMuch power you can get in a single computerIt's more how we can use multiple computers to split the data up process everything and then throw it back like in the MapReduce frameworkThen we talked about the for in with big dataThere's something called the five es which kind of defines some features and problems that are common amongst any Big Data thingsWe have the five es and the first three that were defined. I think these were defined in 2001So that's kind of how having talked about four. So first of all, we've got some volume. So this is the most obvious oneIt's just simply how large the dataset it's the second one isvelocitySo a lot of the time these days huge amounts of data are being generated in a very short amount of timeSo you think of how much data Facebook is generating people liking stuff people uploading content that's happening constantlyAll throughout the day the amount of data they generate every dayIt's just huge basically so they need to process that in real timeAnd the third one is varietyTraditionally the data we would have and we would store it in a traditional single database. It would be in a very structured formatSo you've got columns and rows everywhere. He would have values for the columns these daysWe've got data coming in in a lot of different formatsSo as well as the traditional kind of structured data, we have unstructured dataSo you've got stuff coming like web dream cliques, we've got like social media likes coming inWe've got stuff like images and audio and videoSo we need to be able to handle all these different types of data and extract what we need from themand the first one isvalueYeah, so there's no point in us collecting huge amounts of data and then doing nothing with itSo we want to know what we want to obtain from the data and then think of ways to go about thatSo something some form of value could just be getting humans to understand what is happeningIn that data. So for example if you have a fleet of lorriesThey will all have telematics sensors in that we collecting sensor data of what the lawyers are doingSo it's of a lot of value to the fleet manager to then be able to easilyVisualize huge amounts of data coming in and see what it's happening. So as well as processing and storing this stuffWe also want to be able to visualize it and show it humans in an easily understandable formatOh, the value stuff is just finding patterns machine learning algorithms from all of this datasee then the fifth and final one isVeracity this is basically how trustworthy the data is how reliable it isSo we've got data coming in from a lot of different sourcesSo is it being generated with statistical bias?Are there missing values if we use think for example the sensor data, we need to realize that maybe the sensors are faultyThey're giving slightly off readingsSo it's important to understand how?Reliable the data we're looking at is and so these are kind of the fiveStandard features of Big Data some people try and add more. There's another seven V's a big data at the 10 meter producerI see. I'm sure we will keep going up and upThey are doing things like don't like vulnerability. SoObviously when we're storing a lot of data a lot of that is quite personal dataSo making sure that's secure but these are the kind of the five main onesThe first thing the big big data obviously is just the sheer volumeSo one way of dealing with this is to split the data across multiple computersSo you could think okay. So we've got too much data to fit on one machine. We'll just get a more powerful computerWe'll get more CPU power. We'll get larger memorythat very quickly becomes quite difficult to manage because every time you need toScale it up again because you've got even more data you to buy computer or new hardwareSo what tends to happen instead and all like they see all companies or just have like a cluster of computers?So rather than a single machineThey'll have say a massive mean warehousebasicallyIf you wind loads and loads and loads of computers and what this means that we can do is we can do distributed storageso each of those machines will store a portion of the data and then we can alsoDo the computation split across those machines rather than having one computer going through?I know a billion database records you can have each computer going through a thousand of those database recordsLet me take a really naive way of saying right. Ok, let's do it. Alphabetically, I'll load more records. Come in for say ZedThat's easy. Stick it on the end load more records coming for P. This Y in the middle, right? How do you manage that?and so there'sComputing frameworks that will help with thisSo for example, if you're storing data industry to fashion than this the Hadoop distributed file systemAnd that will manage kind of the cluster resources where the files are stored and those frameworks will also provide fault tolerance and reliabilitySo if one of the nose goes down, then it you've not lost that data. There will have been some replication across other nodesSo that yeah losing a single node isn't going to cause you a lot of problemsAnd what using a cluster also allows you to do is whenever you want to scale it upAll you do is just add more computers into the network and you're done and you can get by onrelatively cheapHardware rather than having to keep buying a new supercomputer in a big dataSystem there tends to be a pretty standard workflowso the first thing you would want to do is have a measure toIngest the data remember, we've got a huge variety of data coming in. It's all coming in from different sourcesSo we need a way to kind of aggregators and move it on to further down the pipelineSo there's some frameworks for this. There's an Apache Capra and like Apache flume for example and loads and loads of others as wellSo basically aggregate all the data push it on to the rest of the systemso then the second thing that you probably want to do isStore that data so like we just spoke about the distributed file systemyou store is in a distributed manner across the cluster then you want toProcess this data and you may skip out storage entirelySo in some cases you may not want to store your dataYou just want to process it use it to updateSome machine learning model somewhere and then discard it and we don't care about long-term storageSo you're processing the data again do it in disputed fashion using frameworks such as MapReduce or Apache sparkDesigning the algorithms to do that processing requires a little bit more thought than maybe doing a traditional algorithm with the frameworksWe'll hide some of it but you need to be thinking that even if we're doing it through a frameworkWe've still got data on different computers if we need to share messages between these computers during the computationIt becomes quite expensive if we keep moving a lot of data across the networkSo it's designing algorithms that limit data movement around and it's the principle of data localitySo you want to keep the computation close to the data?Don't move the data aroundSometimes it's unavoidable, but we limit it. So the other thing about processing is that there's different ways of doing itThere's batch processingSo you already have all of your data or whatever you protected so farYou take all of that data across the cluster you process all of that get your results and you're doneThe other thing we can do is real-time processing. So again because the velocity of the data is coming inWe don't want to constantly have to take all the day to DetectiveWell produce it get results and then we've got a ton more dataI want to do the same get all the data bring it back process all of itSo instead we wouldDo real-time processing so as each data item arrives?We process that we don't have to look at all the data we've got so far. We just incrementally process everythingAnd that's coming up in another video when we talk about data streamingSo the other thing that you might want to do before processing is something called pre-processing remember I talked about unstructured dataSo maybe getting that data into a format that we specifically can use for the purpose we want to soThat would be a stage in the pipeline before processing the other thing with huge amounts of dataThere's likely to be a lot of noise a lot of outliers so we can remove thoseWe can also remove one instances, so if you think we're getting a ton of instances in and we want them she learning algorithmThere'll be a lot of instances that are very very similar see an instance is say in a databaseIt's like a single line in the database. So for HTV sensor reading it would be everything for thatLorry at that point in time CS speed directions traveling reducing. The number of instances is about reducing the granularityso part of it is sayingif we store a rather than storing data for aContinuous period of time so every minute for an hour if those states are very similar across that we can just say okay for thisperiod this is what happens and put it in a single line or we could say for example a machine learning algorithm if there'sInstances with very very similar features and then a very very similar classWe can take a single one of those instances and that will suitably representAll of those instances so we can very very quickly reduce a huge data set down to a much smaller oneBy saying there's a lot of redundancy here and we don't need a hundred very similar instancesWhen we one would do just as wellSo if you've got a hundredInstances and you reduce it down to one is does not have an impact on how important those instances are in the scheme of thingsYes, so techniquesThat deal with this stuff. Some of them would just purely say okay now this is a single instance andThat's all you ever know others of them wouldHave yet have a waiting?So some way of saying this is a more important one because it's very similar to 100 others that we got rid of this one'sreally not as important because there are least three others that were similar to it so we can wait instances to kind of reflect theirImportance. There are specific frameworks with big data streaming as wellso there's technologies such as the spark streaming module' for apache' spark or there's newer ones such asApache plink that can be used to do that. So they kind of abstracts away from thestreaming aspects of it so you can focusJust in what you want to do a little thinking all this data is coming through very fast, obviouslyMy limited brain is thinking streaming relates to video. But you're talking about just data that is happening in real time. Is that right?yes, soGoing back to the Lori's as they're driving down the motorway. They may be sending out a sense of read everyminute or so andThat since the reading goes back we get all the sense readings from all the lorries coming in as a data streamso it's kind of a very quick roundup of the basics of Big Data and there's a lot of applications this obviously soThanks, we'll have huge volumes of transaction data that you can extract patterns of value from that and see what is normal they can doKind of fraud detection on that again. The previous example of fleet managers understanding what is going onbasically any industry will now have ways of being able to extract value from the data that they have so in the next video we'reGoing to talk about data stream processing and more about how we actually deal with the problems that we all time data can presentersover very very large BIOSThis kind of computation is a lot more efficient if you can distribute at because doing this map phase of saying, okayThis is one occurrence. The letter A that's independent of anything else and see mostInterested in you're probably only interested when a button is pressed or so on the only times positive\n"