Data Analysis 2 - Data Visualisation - Computerphile

**Exploring Chicken Data: A Journey into Statistics and Visualization**

As we embark on this journey to explore the data of our feathered friends, we find ourselves surrounded by a plethora of numbers and statistics. The task at hand is to unravel the mysteries hidden within these figures, and that's where the world of statistics comes in. We'll begin by delving into the world of averages and weights, examining how chickens grow over time.

For this section, we're going to rename the columns so that they're a little bit more informative. We want to make sure our data is clear and concise, allowing us to extract valuable insights from it. This involves reassigning labels to better reflect the content of each column. By doing so, we can ensure that our data is not only accurate but also easily digestible.

Now, let's run a line of code to generate some statistics on the different average weights of chickens over time. We're going to use a script to produce a range of statistical measures, from mean and median to standard deviation and variance. This will give us a better understanding of how the weight of our chickens changes as they grow older. By running this line of code, we'll be able to see if there's any correlation between diet and weight gain.

As we examine the data, we notice that the average weight of chickens seems to increase over time, regardless of the diet they're on. This is an interesting finding, suggesting that the diet may not have a significant impact on weight gain after all. However, it's essential to note that this is just a snapshot in time and doesn't necessarily mean that the diet has no effect.

To further investigate this finding, we'll move on to examining the number of eggs produced by our chickens. We'll aggregate the data by week and by diet, creating a frequency distribution of how many eggs are laid each week. This will give us a better understanding of whether there's any correlation between diet and egg production.

Upon running the script, we find that diet B and C produce roughly the same number of eggs per week, while diet A produces at least one more egg per week on average. This is an impressive result, suggesting that diet A may be beneficial for egg production. However, it's crucial to remember that correlation doesn't necessarily imply causation, and we need to dig deeper to understand if there's a real link between diet and egg production.

To do this, we'll focus on the age of our chickens. We want to know whether older chickens produce more or fewer eggs than younger ones. By examining the mean age of chickens on each diet, we can see that Group A (diet A) has significantly younger chickens compared to Groups B and C. This is an important finding, as it suggests that the age of the chicken may be a confounding variable in our experiment.

To further investigate this, we'll create a scatterplot of age versus the number of eggs produced per week. By coloring the data by diet, we can see where each group sits on the plot. This will allow us to visualize any potential relationships between age and egg production.

Upon examining the scatterplot, we notice that as chickens get older, they produce significantly fewer eggs per week. This is a critical finding, suggesting that age may be an important factor in determining egg production. Moreover, we see that Group A (diet A) has younger chickens than Groups B and C, which are producing more eggs.

The implications of this finding are significant. It's possible that our initial results were skewed by the presence of young chickens on diet A. While diet A may be beneficial for weight gain, it's not necessarily the cause of higher egg production. Instead, it's likely that the younger age of these chickens contributes to their higher egg output.

As we conclude our exploration of chicken data, we realize that statistics and visualization are powerful tools in understanding complex phenomena. By using these techniques, we can uncover hidden patterns and correlations that might not be immediately apparent from raw data. Our journey into the world of statistics has taught us that correlation doesn't necessarily imply causation, and that age may be a critical factor in determining egg production.

As we move forward with our research, we'll need to continue exploring ways to clean and preprocess our data. This will involve identifying potential biases and outliers, as well as using clustering and classification techniques to better understand our data. However, for now, let's take a moment to appreciate the beauty of visualization in action – it's a powerful tool that can help us uncover new insights and gain a deeper understanding of the world around us.

**Note:** The article is written based on the provided transcription, without any summarization or condensation of content. Each part of the transcription is fully developed into a readable paragraph or section in the article.

"WEBVTTKind: captionsLanguage: enLet's talk about data visualization so that we can avoid problems like this which is where we've got some kind of graphWho knows what it means?Loads and loads of lines none of them labeled. I think the thick one is more important. That's that's what I've learned from thisData visualization is another method we can use along withStatistics to have a look at our data Explorer our data and try and work out what's going onIt's a way of trying to understand our data better so that we can then performYou know more rigorous statistical tests or actually start to draw conclusions or model our dataIt's a very important tool but you've got to use it properlyYou can't just plot anything and everythingEvery chart you use has got to support your hypothesis or it's got to try and show the storyYou're trying to tell right? You don't just plot something because it could be plotted. There's got to be a point to itThere's a lot of problems with using inappropriate grass and only picking subsets of your data. That's a huge problem, right?That is not just a problem for data visualization. That's a problem for your statistical test as wellIf you're only using some of your data, it's that okayIt's going to depend on the situation right my um, you knowbut I think there's a strong argument for saying you've got to be really really careful and you've got to be reallystructured and regimented andDocument everything you do. The core problem with visualization is that people just plot stuff and they do it badlymaybe they use the inappropriate plot type or theyDon't scale of axes properly and that leads to huge misunderstandings and actually can be quite misleading, right?This happens a lot in the mediaSo, for example, you might get a sort of political message for your door, but says these are different partiesSo this is party oneThis is party to this is party three and maybe you know party one's got this many votes and party twos gotThis many votes and party three twoRight down here and party two are trying to make the case that just a few more votes and they're gonna win in this areawhy but actually written down here this is twenty thousand and this is ten thousand and this is you know,Eight thousand and just in the small labeling they've got hereThey've completely skewed the axis right ten thousand is half of twenty thousand yet. Here. We are up here if you misuse plotsIt's actually misleading when it's on your own dataYou're going to draw the wrong conclusions and then spendquite a while researching into an area but doesn't make sense or and ends up in failure or if it's ifIt's something you're presented to someone else. You can mislead that person whether intentionally or by accidentAnd that's never a good thing. I'm back in our and I just wanted to show a couple of plots that you knowIt's not misleading necessarily, but you can easily infer the wrong kind of information, right soThere's this websites onlineYou can go to to look at the ratings for different TV shows right now. One of my favorite TV shows is Fraser, right?I think it's amazing andIf you go on to these sites and you plot theRatings for all these Fraser episodes. It's all over the placeSometimes it's very highly regarded and sometimes it's not so I'm just going to plot thisusing the GG plot tool and we can see if we look at the graph thatIt's absolutely everywhere. Right? You've got good episodes. You've got bad episodes and it seems to maybe be going slightly downhill towards the endBut it's difficult to say right because it's all over the placeNow what's actually happened is I've just plotted using a default function and it's Auto scaled my rating axis, right?so my y-axis is the rating of the episodes and it's going between seven andAbout nine and a half now that isn't representative because it's spreading out my data if I plot the exact same dataBut this time from naught to ten like an actual rating systemYou can see that most episodes get almost the exact same rating somewhere between around seven and a half to eightWhich I think's pretty goodI would rate them a 10, but you knowIt's just me. You can see that even if you're not carefulIf you do it by accident, even auto-scaling a maxi's and things like this can cause a real problem another classic example, you'll seeIn the news is when they show something like a currency exchange rateSo if we look at herewe've got our I've downloaded some sample data of the Japanese yen versus the US dollar and I've simplified this byExtracting just a period of about 60 days in the middle of some timeI can't remember exactly what it isIf we plot this you can see that actually there's a big sort of cliff edgeSomething terrible has happened around day 30 and the value of the Japanese yen is just plummetingAnd of course, this is absolute nonsense, right? Because this scale goes between 108 and a hundred and fourteenAnd so if we plot it with a proper axes on you can see that actually it's almost completely flatIf your business relies on the exchange rate of a Japanese yen to the US dollarObviously these small changes might be important right but if you're presenting this in the newsIt's very easy to claim that something terrible's happened when in fact actually, maybe this is just normal blip up and down, right soYou can misusePlots to serve your purpose right or and you can do it accidentally and waste a huge amount of timeLet's have a look at the standard plotsYou might see right and you could use on a very basic level and see you knowWhat are they appropriate for right because one of the most important things is that you use these plots and these chartsAppropriately, alright, so, you know, perhaps the most common one that everyone sees is going to be a bar chartYou've got two axesYou've got some kind of attributes or labels down here and then you've got some quantity or amount of some attribute hereAnd then you're going to have different bars like this nowThis is a very nice graph to use it's simple but it's effective because you can very easily see what the difference between these differentLevels are right so that you know, it's often going to be your go to graph for lots of thingsRight, some people now some people try and replace this graph of a pie chart, right? This is a bad idea in generalI meanI like pie as much as the next person but if you've got different thingsLike this and one of them is bigI mean you can see that this one's bigger than this one, but how much bigger it is?I don't knowYou can't see the relative sizes quite so easily this all gets worse if you combine this into a doughnut plotAnd then you've got multiple pies embedded in each other none of them align and nothing makes any sense anymore, right?So if in doubt don't use a pie chart, it's a bad idea. I mean they look very nice for presentationsThat's about what I can say for it if we're going to be measuring some call of quantity then a bar charts going to beWhat we want right but what we might also do is replace quantity with the with the frequency or the amount of somethingSo this is gonna be frequency. This is also our labels again on the bottom hereWe've got our labels and this is going to be bins for some single attributeSo this is maybe so naught to 10 that misses maybe 10 to 20 of whatever the thing isAnd this is a frequency the amount that fall into that range and what this allows us to do is work out very easilyWhat the distribution is is it normally distributed, but I'm only distributed with two peaks, you knowIs it suitable left skewed to the right?We can see very easily the shape of our data and it can be really helpfulAnother way of looking at this sort of the shape or the range of our data in particular is a box plot right nowYou'll see box plots come up from time to time with scientificDocuments but they're very easy to produce in tools like are and they can be quite usefulSo here we're gonna have a single attributeSo some label again or some attribute here and this is going to be the quantity of this attributeAnd what a boxplot does is label the range of that dataSo we're going to have a box here like this and it's going to look a little bit like thisSo I'll use a different color penThis line in the center is our median typically and then this is going to be the third quartile hereThird quartile and this is going to be the first quartile and then these are the max and the min in this one plotWe've got the absolute range of our dataWe've got where 50% of our data is sort of this interquartile range here and we know where the midpoint of our data isSo we can very easily see whether we've gotoutliers and we can plot this next to a different attribute and we can have two box plots next to each other and we canSee very quickly, you know a comparison between these two things so that can be really useful now the final ones right?We're going to be talking about scatter plots and trend lines. All right, so it's got to pop very simple. We've got twoAttributes, this is attribute one and this is attribute two, and we want to see how they bury with respect to each otherSo when one goes up does the other one go up or does it go down are they even related to?So you'll see something like this and it'd be all over the place oftenBut you can see maybe there's a kind of trend where as attribute one increases attribute two increases right nowThis is a correlation being shown here. Not a causation. So you can't say they're definitely related, but you can say thatgenerally speaking when one is big so is the other that's but sometimes useful aTrendline is going to be where we're going to be plotting something over timeMy so this has to be a continuous variable or at least a variable we believeCan be inferred between our points like it's unlikely, but you're gonna have all the pointsSo you what you might have is you might have a plot where you've got timeDown here. So maybe time in mumps, for exampleAnd we've got some amount of something and we're just going to plot it like this and we can sort of have a trendline goingLike this if it's a situation where we can infer the amount between two time points then this is okayRight because we can say well look we've got a reading here. We've got a reading hereIt's reasonable to assume that between these two points. This is the amountAll right. Nothing to funny's gone on between these two points, right?If you can't assume that then you shouldn't really be using a trendline and you probably want to be using a bar graphDoes that depend on the kind of day to them? Yes, it'll depend on itThis is a judgment call based on the kind of dataSo if a data I mean time is a good good example. We don't tend to measure sort of in infinitely small incrementsWe're going to be measuring daily or hourly or something like thisbut we can kind of make an assumption a lot of the time that our readings like temperature for example over time ifYou're at 20 and then the next hour you're at 25. We're probably halfway between there to between those two times, right?It's going to depend on your dataI mean a good example would be if you were plotting something like operating system usage per studentso we've got OS X here, but Linux here and we've gotWindows these many people use OS X this many people uses Linux this many people use WindowsWell bees have discrete data points. You can't fit a trend line to these. There is no operating systemThat's 50% between Linux and Windows that I know of and we can't inferHow many students are going to be using it that makes no sense? That should be a bar chart?So let's look at an actual data set and see how we can use some of this visualization in practiceSo I've got here a chicken data set and this data set is aboutWeighing chickens on different diets over a period of weeks and also measuring how many eggs they producedI'm not a farmer, but let's imagine that what we wanted to do was see if one of theseDiets produces a better weight gain and maybe more eggs per week. Let's have a lookSo I'm going to load the chicken data set. This is at stored in a CSVJust like before let's have a quick look at just the first few rows of this data to see what they look likeSo that's going to be the head function and we you can see we've got six attributesSo we've got the week but the measurement was taken the chicken in this case of chicken number one, but they'll obviously be other chickensdiet, they're on a diet B or diet see the age of the chicken in mumps the weight of a chicken in kilograms and theNumber of eggs they produce that week. All right, so there's going to be lots of combinations of weeks and chickens in this data setNow what we want to try and do is see if there's any kind of relationship between the dietThey're on and the number of eggs. They're producing or the weight of a chicken or anything like thisSo the first thing we could do is we could have a look at the aggregate functionSo I'm going to paste this down here. We'll talk through it. What the aggregate function does is let us produceLet's say a summary or calculate some means or mediansOver a data set but this time grouping by a certain attributeso in this caseWhat we're going to do is we're going to aggregate the weight of the chickens bar in groups of their dietSo all the A's all the B's and all the C's and then we're gonna for each of thoseWe're going to calculate a summarySo let's run that and you can see that we've got our group down here for a we've got the minimum the maximumThe median the mean and we can see some slight differences perhaps in these data setsI mean the median mean for example of Group A. It's 3.8. Whereas the mean for Group C is 3.4So maybe there's a slight difference in these things. Okay. So let's try a different aggregate functionSo this time we're going to aggregate the number of eggs produced groups by again the dietSo this is going to be all the A's all the B's and all the Seas and then we're going to produce a summaryso we can see that the median number of eggs produced for group a is 4 per week andFor group B and Group C is 3 per week. So maybe again there's a slight differenceWe're starting to learn a little bit about our data. So let's start with histogram lightSo what we're gonna do we're gonna use this histogram functionWhich is mostly labels like the hist function in our produces a histogramAnd we're going to produce a histogram of the ages of a chickens. So what's the distribution of the ages?Are they old are they young?And we're gonna use 15 breaksThat means we're going to take the whole range and break it into 15 columns 15 bands right nowactually, I will do a little bit ofJust a few checks behind the scenes to make sure 15 is an appropriate number and might adjust it up or down slightlyso we can see this histogram broadly speaking ourChickens are evenly distributed among the different ageswe've got some young ones that sort of 60 or 70 weeks old older ones that are350 weeks old and then for some reason we've got a peak around 250I don't know why that is but I maybe we've got a batch of a certain age of chickens inAnd let's finally let's look at the box plotSo we talked about the block's plot box plot will tell us the minimum the maximumFor an attribute and also the median in the range, right? So this is really helpfulSo we're just going to have a look just to age just for all chickensSo you can see that the median is around 220 something like thatand then the majority of the chickens, so 50% of the chickens fall between about150 weeks old and 300 weeks old but you can see there are some very young ones and some very old ones this kind ofPlot will end. It's really size up where our data sits before we start to make any assumptionsso let's imagine now that we want to try and drill down into his day to a bit and work out whetherActually the diet had any effect on the number of eggs or the weight of a chicken, right?so what we're going to do is we're going to group we're going to use the aggregate function again to calculate the means ofAll the weights per week. I was going to copy that down hereSo we're going to say aggregate the weight of the chickens by both the week and the dietsocombinations a week onedie a week to die a and so on and I don't want you to calculate the mean for all chickens, soRun that so that produces some statistics on the different average weight of chickens over timeI'm going to rename the columns so that they're a little bit more informative that sort of run that line thereAnd then finally, we're going to plot this nowWe're going to use GG plot for this, you know, whether you use the inbuilt our plot functions or enough aliveWe like GG plot will kind of depend on what plot you want to do in generalYou can get quite nice plots with GG plot, but they're a little bit more involved. Alright, so I'm going to run this line hereLooking at this data we can kind of see that maybe da a is having a positive effect, right?So down at the bottom where no weeks are passed at the beginning of our experimentThere were roughly the same weight and then the average weight of a actually does seem to increaseSo I guess that's something interesting about our data right nowLet's look at number of eggs, right so we're gonna do the same thing this timeWe're going to aggregate the number of eggs by week and by diet so they don't copy that and I'm going to give it someHelpful labels as well and then we're going to put the data. Let's seeOver time whether or not any of the diets have any effect on the eggs, and it's looking pretty goodAlright, so this is the frequency as the number of eggs were producingthe weeks is the twelve weeks of ourExperiment and you can see that diet B and IC produce roughly the same number of eggs per weekThis is averaged over all the chickens but diet a produces at least an egg more per week on averageYou know, that's a 20% increaseRoughly speaking. If you're if you're a farmer, that's a great thingBut the problem we've got is that this might be a little bit too good to be trueWhat we're seeing here is perhaps an issue of correlation versus causationSo we can see here that there is a correlation between the diet that's being used and the number of eggsbut we don't know but it's the diet specifically that causes it we're looking more detail at the ages of the chickens specifically because I'mInterested to know him whether or not Paul older chickens produce more or fewer eggsRight because that could be relevant to our to our experimentOkay, so we're going to group the chickens up by diet and then work out what their average age is so mean ageSo I'm going to calculate thisOn this here, and then I'm going to look at itAnd we can see that the average age the mean age for Group A or these chickens on diet a is only 156 weeksbut the age for let's say Group C is248 weeks are significantly older. All right, so we need to just check thatThis isn't going to be an issue for the number of eggs laidSo let's plot the number of eggs versus the age of the chickens, right? So here we're going toGenerate a scatterplot of age versus the number of eggsBut we're also going to color by diet so we can see roughly where the different diets sits. Let's run thisOkay, so what we can see is but actually as chickens get older we do see a quite serious decrease in the number of eggsProduced per week from about four and a half hour wage down to about two and a half or two average, right?And also we can see that IAE is predominately sitting up here, which means that the chickens are youngerSo this could be a problemWhat we're saying is that it could be that we happen to have put a load of young chickens on diet a and yesThey're producing more eggs, but that isn't because of die a that's because they're younger, right?So let's have a look at a box plot of the age of chickens per diet and you can see that they're significantly younger ondiet a soI think the conclusion we can draw is but it's theoretically possible that there's a link between the diet and the number of eggs producedBut we can't really say it from this data. We're going to need a lot more data. Maybe some you know some more chickensI like to try and work this out. We've seen a number of different visualizations and the important thing is that we use visualizationsAppropriately and we don't make assumptions about our dataSo we're going to start to look at cleaning of data and then maybe using our data in clustering and classificationbutVisualization is a really good way to start off exploring your data and generate some initial hypothesesWell, we're looking at chocolate datasets today, so I thought I'd bring some researchYeah, good and definitely relevant\n"