**The Challenges of Big Data**
Big data is often touted as a solution to complex problems, but in reality, it can also present significant challenges. One of the biggest myths about big data is that it's always easy to work with and that statistics are powerful enough to handle even the largest datasets. However, this couldn't be further from the truth. Big data often means messy and unclean data, which can be difficult for statistics to cope with.
**Dealing with Missing Data**
One of the biggest challenges of big data is dealing with missing data. This can happen when there are incomplete or inaccurate records in a dataset. Statistics may struggle to handle this type of data because it requires fundamental changes to how we think about and work with data. Computer science, on the other hand, has developed ways to deal with missing data, such as writing code that can clean and preprocess datasets.
**The Limitations of Correlation**
Another common misconception about big data is that correlation implies causation. This means that just because there appears to be a relationship between two variables, it doesn't necessarily mean that one variable causes the other. For example, if we observe that patients taking a certain drug are more likely to have a heart attack, it's not conclusive evidence that the drug caused the heart attack. In fact, there may be underlying factors at play that are contributing to both the use of the drug and the likelihood of a heart attack.
**The Importance of Context**
Understanding context is crucial when working with big data. This means considering the source of the data, the population being studied, and any potential biases or flaws in the methodology. In the case of medical research, for example, it's essential to consider whether the study was conducted on a representative sample of the population, rather than just a small group that may not be representative.
**Co-Founding and Other Pitfalls**
One of the biggest pitfalls when working with big data is co-founding. This refers to the tendency to mistake correlation for causation. Co-founding can occur when there are underlying factors at play that are contributing to both the variable being studied and the outcome of interest. For example, if a study finds a positive correlation between eating salt and blood pressure, it's not necessarily true that eating salt causes high blood pressure.
**The Power of Data Compression**
Finally, big data is often used in conjunction with other technologies such as video compression to reduce its size and make it more manageable. Video compression methods use algorithms to identify patterns and compress the data accordingly. While this can be useful for reducing storage space, it's also important to ensure that the data being compressed is accurate and reliable.
**Conclusion**
Big data presents many challenges, from dealing with missing data to avoiding the pitfalls of co-founding. However, by understanding these challenges and taking steps to address them, we can unlock the full potential of big data. This requires a combination of statistical power, computer science expertise, and a deep understanding of context and interpretation. By being aware of these challenges and limitations, we can use big data in a more effective and responsible way.
"WEBVTTKind: captionsLanguage: enI mean Data mining is you get a lot ofInformation in a lot of raw data and you want to get the nuggets of information?Hence the word mining, so the Golden the data. That's the Data Mining Iusually starts with people saying ohYou got loads of data, can we make some money from thi,s is there something interesting in there that we haven't found ourselves yet, youTell us you're the expert that's how it usually startsSo it might be big companies or medical people doctors and hospitals they might have lots of dataActually the first step usually is what do you actually want to know from the data because people aren't always that clearwhat they're actually after this is where it comes to artificial intelligenceSo a lot of the work that we do is actually applied artificial intelligenceSo you get you get your data, and then how do you get a pattern out of it? It involves algorithms,It involves programming it may be some mathematical or statisticalSystems that you want to design or it might be more artificial intelligence somehowUsing something like evolutionary computation or machine learning in the broadest terms for example sometimes you get your dataAnd you know it just looks like this is how long you studied. This is the grade you're get on the exam really obvious?It's obviously clear correlationPerfect yes, I mean, this is not very difficult and we can use them some simple statistics videos that's fine. That's easy thenAwesome, so the data you get doesn't quite look like this it might look a bit more likeIt's more messy like this and even in that case could probably still do some statistics a lilYeah, there's probably some relationshipIt looks a bit like this, but it's getting a bit more ambiguous now, and it's not so obvious anymore. What is it and?sort of statistics may work it may not work anymore things when it gets really interesting is when the data starts looking like something likethis soI'm sort of exaggerating a little bit now, but you do statisticsAnd it comes back with all those no relationship because something is like a zero relationship between things and obviously we you and me lookAt it you've got up cough. I don't know what's going on. Here's a football, and it's clearly something in this dataSomething is going on here. Don't understand and that's where the artificial intelligence comes in as expertsWe look at this me girl or something here. How can we teach the computer together out? There is a shapeHow do we get that shape out and then of course the real problem?Is that is if it was just looking like this. It'd be easy to teach a computer of course in realityit's so Messy the data can do enough box on here, butSomething like this and somewhere in here there is this little pattern and maybe it repeats in a few placesBut it gets really hard to see you nowHow do you teach your computer now to get this pattern out and of course remember you don't know what it looks like the patternYou're looking for the unknownIf you know what you're looking for I be easy that's what Google does actually google is easy job because they have a huge databaseIt's just a lookup table. We don't have GodWe don't actually know what the thing looks like so you need to find somethingYou don't know what it looks like it's somewhere hiding in there. Maybe it isn't right in the hard wayWe do it does data mining really from this stage. There is lots of different stepsI mean the first thing we usually do when we get data and it doesn't look as simple as the original if you do somethingCalled pre processing of the data. We make it in a nicer shape. Maybe we bring it on a certain scaleWe're actually trying to plot the bait. I'm just looking at it plotting it one way floating is another way two dimensions three dimensionsmaybe we can eliminate some of the variables because theyTurn out irrelevantOr maybe lots of information is missing so we can't use them sometimes also the data is not always numbers it might be text itmight be picturesAnd then how do we put out in so there's all this or messing around in the beginning with the data to bring in someSort of okay some sort of shape, and it's saying we can look at it more easierSo it's a love it is eliminating background noise as roses itYeah, and of course the problem with that is is it really noise, or is that maybe the noisy bit?It's the really interesting bit in the data. You know when you get your day turnIt's like lots of Data is like this then your phone is there and now of course the question is well, okay?Maybe that's where your sensor failed. Maybe the person didn't fill in the questionnaire properly or maybe that's actually the one data pointthat's really interesting that's where you can make your money andThat's part of the question. This is an outlier which is somethingWe need to get rid of or is it actually the one interesting pattern thatMakes you the money or the person that needs to drive or whatever whatever the question is if you're looking for this isStatistics isn't so it would be statistics it go back to my pictureIt would be statistic to divorce like thisAnd we can do things which are quite simple and even in the machine learning we still do some statisticsBut statistics on it on its own. It's not enoughHas too many limitations, so we need to go Beyond statisticsAnd that's what computer programming is really good because it's more flexible than statistics. We can deal with text week until the picturesI mean to kind of statistics and pictures doesn't work you can do statistics on textWe can do a pattern mining on alphabet. We can do a pattern mining on a picture. We can do an image recognitionWe can mix all these things together?We can also deal then a lot of information is missing I mean usually when you get these big data setsI mean, this is one of the myths big data people think I so huge dataSo you know no problem, and of course big data usually also means a lot of data is missing a lot of Data is messyThe Data isn't always clean and the bigger the data gets this problem doesn't get any smaller gets bigger as wellSo statistics usually can't cope windows missing values or when it is missing the data because it has some veryfundamentalRequirements, so for computer science we can deal with this missing data, okay?We write a bit of code deals with missing dataThe Data is Messy me write a bit of code does something massage is the data or I don't know whatever it isSo that's me a very much more flexible and statistics statistics may powerful your worksBut when it doesn't work then computer science. What is it correlation does not imply causation?The idea in terms that just because you see a pattern doesn't mean it's relevant would that be a way of putting wellIt's like you know only because you're carrying an umbrella sand doesn't mean it's going to rain today, right? I mean exactly youObserve things together, but what you don't know is where today?I actually was a related to be related to a or maybe and usually actually there's another reasonUnderlying it all together. Which is what's happening? I mean a lot of medical work. We do itThis is this is interesting so we were looking at patients taking certain drugsAnd we want to know whether the drug is really helping them or not and then after they've taken the drug something might happen tothemSomething something you don't want maybe now have a heart attack and the question of course is wellWhat's it a drug that caused the heart attack or was it something else?And it's fundamentally important to understand that you can't just conclude only because they took a drugThat's why the harder that usually it's not at all because of the drug. YesOtherwise we could say I drank water and therefore exactly it's something called co-founding in technical termsThere's lots of co-founding issues, and you need to understand the difference between them you will have heard of thisIt's obviously the newspaper every year againEating sold this bad for you. Why is eating sold bad for you because it apparently causes high blood pressureWhere's the evidence for this so what happens is I looked at this in1988 there was some studies on this and this is what his picture shows and it's something British medical journal where?we're looking at different countries, and you were looking at the basically urine in salt andWhat's happened is each of these thoughts is a different population. I a different country and they measuredthousands of people in the Urine how much sold their walls andThey also measure the club pressure and this is the graph so you've got this how much sold this how much blood pressure and youCan do this is sort of slight positive correlation there, and is actually based on this studyOr it's kind of a meta study that the conclusion was okay, so more salt must lead to higher blood pressurehoweverActually if you look at this picture properly you might see that wait a momentThere's lots of point here, and I saw 44 over there, and I look a bit strange and actuallyIf you were to remove those four points from the analysis your trend would suddenly become negativeIt's only because of these that the trend is positive it turns out that these outliers are from countries were soldisn't very popular in the died sort of non industrialized countries in Africa, soOkayActually is it possibleBut this study is flawedThis data should never have been included because it's not what we are used to invest onSociety and if you look at ours actually thinks all this good for youBecause it seems just like to reduce blood pressureThere has been more work since on this then that probably is a link between Salton blood pressurebut it's not as obvious andThere may be much more going on and actually what hasn't been established at all is whether slightly raise blood pressure might actually be goodfor you because might also help you in other ways, it's not theData itself is how is the interpretation?Delimitation of it and getting the right in a rubbish in rubbish outActually making sure that what you what you put in make sense and and yes think thinking of co-founding thingsIf you use data that you shouldn't have used because its former country which is completely different for example like hereIt doesn't it doesn't helpso Jpeg works by down stopping the car components and using other techniques as well the majority ofVideo compression methods will do the same thing. So you're being compressed right now. I certainly am yes\n"