Intro to Data Science - Crash Course for Beginners

# Exploring Data Science: A Comprehensive Guide

## Introduction to Data Science

Data science is an interdisciplinary field that involves extracting insights from data using various techniques such as programming, statistics, and machine learning. It is a powerful tool for understanding patterns, making predictions, and driving decision-making in fields ranging from business to healthcare. The process of data science typically involves cleaning, transforming, analyzing, and visualizing data to uncover meaningful information.

## Tools for Data Science: Python Libraries

One of the most essential aspects of data science is the use of programming tools and libraries. Python is one of the most popular languages for data science due to its simplicity and extensive library support. Two of the most widely used libraries in Python for data analysis and visualization are **pandas** and **matplotlib**.

### Pandas: Data Analysis Made Easy

Pandas is a powerful library for data manipulation and analysis. It provides DataFrame objects that allow you to handle tabular data, similar to spreadsheets or databases. With pandas, you can perform operations such as filtering rows, selecting columns, calculating statistics, and merging datasets with ease. For example, if you want to calculate the correlation between two variables in a dataset, you can do it with just one line of code: `df.corr()`. This makes data analysis incredibly efficient and accessible even for those without extensive programming experience.

### Matplotlib: Customizable Data Visualization

Matplotlib is another essential Python library, primarily used for creating static, animated, and interactive visualizations. It allows you to create a wide variety of plots, including line charts, scatter plots, bar plots, and histograms. One of the key strengths of matplotlib is its flexibility; you can customize almost every aspect of your plot, such as colors, labels, grids, and legends. For instance, if you want to add a title to your plot, you can use `plt.title('Your Plot Title')`. This level of control makes it an invaluable tool for presenting data in a clear and visually appealing manner.

## Programming for Automation and Customization

Programming is a fundamental skill in data science. It allows you to automate repetitive tasks, customize analyses, and develop tailored solutions for specific problems. By writing code, you can transform raw data into actionable insights without being constrained by the limitations of manual methods or rigid software interfaces. For example, instead of manually calculating the mean of a dataset, you can write a script that computes it automatically. This not only saves time but also reduces the risk of errors.

## Multi-Variable Graphs and Heat Maps

When dealing with complex datasets, it is often necessary to visualize relationships between multiple variables. Scatter plots and line graphs are commonly used for this purpose, but there are other advanced visualization techniques that can provide deeper insights.

### Heat Maps: Tracking Intensity and Patterns

Heat maps are a type of visualization that displays the intensity or frequency of data points in a two-dimensional format. They are particularly useful for tracking patterns over time or space. For example, a heat map could show customer movement within a store, with darker colors indicating areas where customers spend more time. This can help businesses optimize product placement and improve the overall shopping experience.

### Multi-Variable Bar Plots: Comparing Multiple Metrics

Multi-variable bar plots allow you to compare multiple metrics across different categories or groups. Instead of creating separate plots for each metric, you can combine them into a single visualization, making it easier to identify trends and relationships. For instance, if you want to analyze the performance of different teams in a sports league, you could plot metrics such as goals scored, shots on target, and possession percentage on a single bar chart. This provides a comprehensive view of each team's performance.

## Conclusion: The Future of Data Science

Data science is a rapidly evolving field with endless possibilities for innovation and discovery. By mastering programming tools like pandas and matplotlib, and by leveraging advanced visualization techniques such as heat maps and multi-variable bar plots, you can unlock the full potential of your data. If you are interested in learning more about data science or exploring additional resources, I encourage you to visit my website, **codingwithmax.com**, where I offer a variety of articles, cheat sheets, and courses designed to help you master the fundamentals of data science.

If you have any questions or need further guidance, feel free to reach out to me directly. The world of data science is vast and exciting, and with the right tools and knowledge, you can make a meaningful impact in your chosen field. Happy coding!

"WEBVTTKind: captionsLanguage: enhey everyone and welcome to my mini course on the essentials of data science this mini course provides a super basic looking to data science what it is and the three main components that make up data science data science is a very mainstream word like it's thrown around a lot but its actual definition is quite vague this mini course is designed to help those of you who are curious about data science develop a better and more specific understanding of the topic there are definitely more advanced techniques within data science such as machine learning but even these can be traced back to the three essential components that we'll cover before we get straight into it I thought I'd quickly introduce myself my name is Max and I work as a data scientist after getting my degree in physics I find myself more and more drawn into the world of data science so instead of diving into the realm of physics research I taught myself all the tools and techniques a data scientist needs and shortly after landed my dream data science job I've since also started teaching data science to others and have been fortunate enough to teach what is currently over 9,000 students the skills of gathered and learned over the past five years of my data science journey so let's jump right into it so what is data science well data science is you can kind of summarize it in different ways but the main parts of it are transforming data into information and this is a really big step because a lot of people talk about you know data and big data and all these things but data by itself isn't really that useful until you can turn it into information and so if you just have a bunch of numbers appearing somewhere and it's just you know so much of it no one can make sense of that and that's where you need a data scientist to be able to transform all of these all of this vagueness and kind of this noise to that's going on and you need to be able to extract information from it and that's what a data scientist does now what you do with this - with this information or how you get this information it's through analyzing your data so a big part of it would be you know cleaning things up doing some some processes on it and then you analyze once you've clean things up and that is one of the ways that you can then get information out of your data through this analysis and you can kind of continue on and you see trends and patterns and all types of correlations hopefully and all of these things again build up into this turning data into information component and then ultimately you also need to contextualize everything that you have so your computer can't do that for you you can Peter can kind of crunch the numbers and stuff but it's your responsibility also to make sense what's in front of you and even if you see something you just don't blindly trust it but you need to understand you know where am I at where am i coming from where is this data coming from you need to be able to contextualize these things and then of course be able to apply as well as understand them and so once you have this data you know it's great but turning it into an information into great information that you can use and directly apply that's where the real power lies and that's also kind of the role of a data scientist so that's what the data that's what data science pretty much is and so what is the data scientists do well we kind of already talked about this just a little bit but let's go over it again any more concrete examples um so a data scientist would for example get and process this raw data and then convert it into something a little bit cleaner so you can imagine kind of just like a data stream coming in and it's you have this measuring device and constantly is just measuring all sorts of data and because like nothing is really constant so everything will be fluctuating up and down and so a data scientist would be to kind of take all of this data it'd be to kind of clean it up a little bit you know maybe reduce this fluctuation that you know isn't supposed to be there that's just kind of background stuff going on and then put it into a format so that you can easily plot it against some things and then we already get to the next point that you know once this data is cleaner you can maybe do start doing some calculations on them figuring out the core statistical components you know like what is the average values of these what what am I really dealing with you know getting a first look at first understanding of what it actually is that you're tackling and then once you have this kind of understanding then you can start to do some visualize they which helped you as a data scientist maybe see some trends or patterns already but visualization was also really key because they let you show it to other people and there are great means of communication so they help both you as a data scientist as well as helping others when you try to convey this information to them all right and then finally you have to suggest some applications of the information so it's not really enough to just be able to look at it and say like yeah I see it goes up and down and that's that's good but what does that mean how does this transfer into something useful and that's also one of the key roles of a data scientist transferring information into knowledge and so you've got this data into information step but you also need to transfer this information into knowledge and those are two really powerful things that are worth a lot a lot and that's pretty much what a data scientist focuses on and then you can go further you know and take this data and do machine learning with it or something if you really understand what's going on or if you have some hypotheses of you know what could happen so you can take things a lot further but ultimately this kind of turning data into information and then into knowledge that's kind of your role all right so let's go into the essential techniques or the essential components of data science so the first essential component and we kind of touched on this already is statistics and basically we're gonna cover this later on but let's just give a kind of quick wrap down so in statistics need to understand different data types that you can encounter and so there are data can come in different ways and we'll go again into more detail with this later but it's not just you know you get a bunch of numbers data can come in very many different ways depending on the field that you're in and so you need to be prepared and you need to kind of be aware that data may not always just be a direct number for you then of course you need to understand some key statistical terms like you know the different types of means and also understanding fluctuations in data and the reason that this is important is because these key statistical terms give you an overview of how this data is behaving and depending on how the data is behaving you may want to approach it differently so if you know that your data is very clean there's a very little fluctuation then if you visualize things you can probably trust what's going on or if you want to maybe fit some curves to it or something but if you see there's a lot of fluctuation in your data visualizing it is going to be much more difficult because you just see jumps everywhere and you're not really sure which of this is actually true and which of this is caused by you know like some interference somewhere or someone is messed with my system and so all of these things will kind of be hinted to you through statistical terms so it's probably good that you know you're kind of comfortable with these things and that you can be able to get some meaning and meaning out of them all right and then finally it be in statistics to be able to you know split up and group or segment data points so that when you have this big data set you want to be able to you know maybe split it up into smaller things compare different regions look more into more detail into some things and maybe you know isolate two components because you know hey these things are probably going to be important the rest I don't really care about that much so being able to kind of pinpoint an isolate and meddle with the data a little bit so these are the kind of statistical components that we're gonna look into all right so the next big thing and we've already talked about this too is data visualization and we'll see why data visualization is a really key skill for data scientists and then we're also be gonna be covering different types of grass that you can use and how you can compare different number of variables so for example you can have one variable grass where you only look at one thing and you only want to look at this and you want to see how these how this changes you have your typical two variable grass which you probably know where you have this X and a y-axis and then you can kind of see how two variables relate to each other or you can have three variable or even higher variable graphs and where you plot maybe three different things or even more if you want as long as it makes sense next to each other so that you can compare multiple things at the same time all right and now we come to the other big thing that you're probably going to need as a data scientist which is going to be the ability to program now not every data scientist can do this but this is really really essential in my opinion to your role as a data scientist because knowing how to program is going to make your life so much easier if you know how to program you can kind of take your ideas and your thoughts and you can put them into actions in the computer and you can just automate everything you can customize things you can explore you can prototype you can test and you're not reliant on some you know application you don't have to master some application and if it doesn't work or if one feature isn't there you have to contact customer support and maybe it's not even possible and then you have to wait for an update or maybe something is bugged with programming there's just you're so much more reliant on yourself and you can really just do whatever it is you want to do and you're not reliant on other people or on the tools that other people have built for you but rather you can just pretty much go and you know just do what you want to do without there being major roadblocks and then we'll also look at some essential packages in Python so in programming you never want to reinvent the wheel you always want to start off where the last person left off and so the ability to program and be able to write simple programs you would need to teach yourself but you wouldn't need to write highly complex mathematical packages or data analysis packages those are already out there all you need to do is be able to download them and implement them in your code and they're gonna work you know they've been tested a lot there's a huge community's working on them on improving them and everything all of this is for the community and so the whole community kind of works together to improve it no one's really directly trying to make a lot of money off of it so they're not going to charge you all of these service fees and everything everyone's just trying to improve their package because if it improves everyone also benefits from it and so we'll look at some of the libraries or we'll talk about some libraries that you can use especially in Python and to help you along your way with data analysis and to become a successful data scientist in this chapter we're going to talk about statistical data types now we're going to look at the three different types of data which are summarized as numerical categorical and ordinal types of data now these are the types of data that we talked about before how you can't just expect your data to be cut be kind of numerical and so we'll see the miracle data but we'll also see the two other types of data that you may be you know encountering in your career as a data scientist all right so let's talk about numerical data first though numerical data is also known as quantitative data and it's pretty much things that you can kind of measure it's it's great numerical stuff that you can do math with you can compare it you know saying this Plus this makes sense he is greater than B these are you know all examples of numerical data numerical but data can we split up into two different segments one of them is going to be discrete and so discrete means the values only take on distinct numbers and an example of this would be you know IQ or something like that a measurement of IQ or if you do a coin toss the number of times that you toss heads so you can you know you can have 15 heads you can have 12 heads out of you know 20 coin tosses you can have 500 heads out of a thousand coin tosses or 500 out of 600 or all of these things but all of these are distinct numbers and now they don't have to be whole specifically but they do have to be distinct so that's that's the kind of very important part that you know there's a kind of step size that you're dealing with and of course you can still say hey you know flipping eight heads out of twenty is better than filling seven heads out of twenty so if you want to flip heads lettuce or flipping eight out of 20 is worse than flipping 7 out of 20 if you're going for as many tails as you can so all of these kind of comparisons that make sense so that's the discrete part of numerical data then we have the continuous part and now the continuous part is really that values can just take on any number and they're not unlimited by decimal place so a value that can you know can be like one point one and then the next value would be one point two that's not continuous that's still discrete because you have this step size of zero point one continuous means literally ever number from start to finish can be taken on and this doesn't mean that every possible number in the universe from negative infinity to plus infinity and all imaginary numbers and everything that comes with it that doesn't that that's not required for continuous it could really be that just every number between zero and one can be taken on so for example let's say you have a bottle of water and this bottle of water can hold one liter now if you fill your bottle up and it starts off empty and you fill it all the way up to the top the amount of water that you've had needed to take on every single number between zero and one because you can't just fill up water you know in kind of small increments of say hey I'm gonna put in 0.2 liters every single time because the water doesn't just you know teleport from A to B but when you're pouring in water it's more like we see in the stream here and the water level rises and Rises and Rises and so the amount of water that we have in our cup needs to take on every value between zero and one so that's an example of continuous data for but you see that you know we can be limited to zero and to be between zero and one we don't have to you know start at zero and go all the way up to infinity or something but it's just that the range that we're looking at every single number can can be applied or every single number can you know happen another good example would be the speed of a car if you start and you you know you're standing still and you're studying you're standing at a stoplight and then you want to accelerate in the speed limit us say you know 50 miles an hour or something to get to 50 miles an hour from your starting position your car has to take on every single speed in between and of course you won't see that you know on your spot on the speedometer it would say something like zero miles an hour one mile an hour you know maybe you can go to like it's going 0.1 0.2 0.3 or something like that so it may look discrete to you but that's not how your car is going your car doesn't say it like oh I'm gonna go in these step sizes of speed it's gonna accelerate gonna take on every value starting from zero going up to 50 miles an hour and you're gonna when you're in this transition you're gonna take on every single one of those speed values so that's how continuous data looks like and it's important to understand the difference between this discrete and continuous just because you may want to approach it differently now of course if we're dealing with computers our computers can't deal with infinite numbers in the decimal places we have to cut it off somewhere and so usually continuous data is gonna be rounded off at some point but it's still important for you to know that you're dealing with continuous data here rather than discrete so that you know hey there can still be other stuff in between here and or all of these things rather than you know having specific step sizes and all you see is just kind of a bunch of lines at every step size but you can expect that when you have continuous data that everything is just kind of filled filled up that everything can and may even well be in between certain places so that's that's kind of the important thing to note between discrete and continuous alright so the next type of data that we'll have is categorical now categorical data doesn't really have a mathematical meaning and you may also know it to be qualitative data um and categorical data it describes characteristics so a good example of this would be for example gender so here there is no real mathematical meaning to gender of course you know if you have good data you can say male is zero and female is one but you can't really compare the two numbers even though you assign numbers to them and you may just do this so that you can split it up later on there your computer can understand but it doesn't really make any sense to compare you can't say you know is male equal what you can say male is not equal to female but you can't really say is one greater than the other or is one approximately equal to the other those things don't really make sense because they're not well-defined what does that mean and you can't really add them up either you can't say male plus female but that doesn't it doesn't give you a third category or something so categories that you can't really apply math to them but there are nice ways to kind of split up or group your data and they provide these nice qualitative pieces of information that are still important it's just you can't really go that well about you know like plotting them on a line or something like that so those are important things to note with categorical data and then another example would for example be yeah ethnicity or you could also have nationality all of these things are examples of categorical types of data yeah I'm so like we said you can assign numbers to them but that's really just for your code so that it's easy to kind of split them up but you still can't really compare them how are you gonna compare nationalities there is really no definition for you know comparing one type of category to another alright and so the third type of data that you can encounter is something called ordinal data now ordinal data is a mixture of numerical and categorical data and a good example of this would be both tell ratings so you have you know star ratings 0 0 1 2 3 4 or 5 stars or maybe even 6 stars or you know whatever it is whatever the hotels go up to these days um but it's still not as straightforward to compare so I'm sure you've seen two different types of three-star hotels one of them you know had the bare minimums the beds were okay but it wasn't really anything special and then you had this three-star hotels that you could have sworn we're at least four-star and so star ratings do make sense we can say you know a four-star hotel is probably better than the three tier hotel because there have been standards there are standards for these things they have been checked you know if you go to a four-star hotel you know what to kind of expect but still it's not completely defined so like you know coming back to this three star example it's very hard if you just say hey we're going to a three-star Hotel it's very hard to know exactly what to expect because there are different parts of three-star hotels there are three-star hotels that have developed onto like have a swimming pool maybe or something like that and then there are those three-star hotels that are really more like hostels or something that I've just made it past the to start place and so there it's much harder to kind of define or to know what to expect now if you take averages of these star systems though then you do get a much better idea of what's going on so if you have you know consumer reviews or something like that and you say Oh from you know 500 reviews our hotel has an average rating of like 3.8 then you know that the three star hotel that you're looking at is pretty much a four star hotel it feels like a four star hotel even though it may not have all of those qualifying characteristics that's the kind of feel you get from it whereas from another three star hotel you may have a rating of like 2.9 or something and there you know oh you know this hotel is more towards the lower end of the three star some people may not even consider it to be three stars and of course you know this rating may be a little bit biased because they went to a different three hard star hotel first and then they went to this one and they were expecting something completely else from the three star hotel so they said this can't be three stars this is two stars but it's because of the way that the ranking system is defined underneath and everything and so when we have these averages of these ordinal numbers then they kind of start to make a little bit more sense alright so let's go over a small exercise and see if we can identify what type of data we're dealing with so the first thing we'll look at is gonna be the survey response to happiness now you have people filling out a survey and then this and then one of the questions is you know how would you rate your happiness and it's gonna be bad neutral good or excellent what type of data with this be well this would be an ordinal type of data because it's still in a form of categories and you're asking for the subjective opinion but it does make sense so you can still compare them you can say excellent is greater than good good is greater than neutral neutral is greater than bad but what exactly does it mean to be good and excellent you know where do different people draw the line for this there's it's still a little bit of vagueness involved but generally it does make sense and you can compare it and if you have a lot of surveys and you averaged them the values you're gonna get are probably going to be very well representative or at least pretty good representative all right so if we look at the next thing which is going to be the height of a child what type of data is that now we can say it's probably numerical and well it actually most definitely is numerical so the height of a child is a numerical value but let's go a little bit deeper and say is the height of a child discrete or is the height of a child continuous well even though when you measure height you get something like you know five foot five foot three or 160 centimeters or something like that it's not a discrete value because to get that height you have to have reached every single height before and so even though at the moment you may be measuring it you're kind of rounding it off to how much your measuring tape can measure so like your measuring tape is kind of limiting the height but if you had a super super precise measuring instrument you could measure not just you know five foot three or something like that you could really go into detail with the inches and the decimal places and there and everything kind of going on so the height of a child would be a numerical data type but it would be continuous all right now let's take about talk about the weight of an adult do you expect the weight of an adult to be either discrete or continuous so we can probably agree that it's numerical because it's a weight value it's it's pretty much defined to be a number and what do you expect it to be discrete or continuous well the right answer here is gonna be continuous again because to reach a certain weight they would have had to have reached every single weight in between before so again the weight is something that we can consider to be continuous all right and so finally let's look at the number of coins in your wallet again we can already by the name it says a number of coins so we can probably agree that this is a numerical type of data but the number of coins in your wallet would that be discrete or continuous well the answer would be discrete because it doesn't really matter what's your anoint your coins are they could be 50 cent pieces that could be 25 cent pieces ten or five or ones or anything you know like a two or something like that but they're not going to be but the number of coins that you're gonna have we're gonna sum up to a whole number so you can have one coin you can have two you can have three all of these things but you can't have infinite fractions of a coin you can't have say you know the square root of two number of coins that doesn't really make sense so you have a defined step size you have one coin and then if you have a second coin then you have to you get a third coin meaning of three you're going in step sizes of one so for the number of coins in your wallet we'd be having discrete numerical data in this tutorial we're going to talk about the different types of averages now we're going to see the three different types of averages which is the mean the median and the mode alright let's get started so we'll start off with the mean now the mean is the typical average that you know and really what the mean is is you just sum all of your values up and then you divide them by the total number of values that you have now the great pros of the mean is that it's very easy to understand it makes sense we just have everything we have and just kind of add it all up and then divided by what we have and that should give us a good representation of what is the average and it also takes into account all of the data so since we're adding everything up and then but dividing by how much data we have we're taking into consideration every single data point now there are some problems with this so one of the problems is that the mean may not always be the best description and we'll see why when we look at examples for when we should use the median and the mode and the mean is also very heavily affected by outliers so since we're taking everything into consideration if we have big outliers that's really gonna change how our mean looks like so if we just have normal values you know between like one and five and all of a sudden we have like 10,000 in there that's really gonna affect our mean so mean is heavily influenced by outliers and the bigger the outlier more the mean is influenced by it all right so let's see some examples of the mean we'll go through a worked example first and we can see our data set here which is just a bunch of numbers and what we're gonna do to calculate the mean is we're just gonna take every single one of these numbers and we're gonna add them up and we can see the total result that we get here and then the next thing we're gonna do is we're gonna take this total result we're gonna count the amount of data points that we have and we're gonna divide one by the other which then gives us our mean as we can see here so that's an example calculation of the mean but let's see some example applications of the means so when would we use it well good application would say if you look at the time it takes you to walk to the supermarket so sometimes you walk a little bit faster and maybe it takes you 20 minutes to get there sometimes you walk a little bit slower it takes you 25 but on average it takes you somewhere like 22 or maybe 22 and a half minutes or something like that so if you say I'm gonna go to the supermarket you're like it's gonna take me this much time to get there another good example of the mean would be exam score for a class so to get a good understanding of how people do in an exam or in a class you can look at the mean exam score last year and since our exam scores are kind of in a smaller range a mean is gonna be good to use it because you can get anything between 0 and 100 but realistically speaking no one's probably going to get a zero so your range is even smaller and so you're less affected by outliers and you kind of know how hard a class is gonna be just by being you know able to compare their means so if you look at one class and it's mean is higher than the other but they have a large number of students or something then you can probably say hey it's easier to get a good grade here or something like that or maybe you know some of these it's more simpler overviews without diving too deep into it alright another good example of the mean would be to say how much chocolate do you require when you get this kind of sweet craving and you're not gonna say like oh you know I require one chocolate bar two chocolate bars but like you're gonna say Oh on average you know I require you know maybe three-quarters of a chocolate bar and sometimes I may want a little bit more because I feel like it and when I start eating chocolate I crave it even more sometimes you know I have it at first and like the tasters doesn't sit right with me right now and so I have a little bit less but these are kind of the amount of things so like if you have this craving you know either you say oh I'm gonna try to be strong or you like hmm well I know this feeling and I know if I eat about you know three-quarters of a bar of chocolate or something I'm gonna feel good my craving is gonna be satisfied so you kind of know what to expect so these are some of the examples for how we would deal with a mean well when we would use mean all right so let's look at the next thing which is going to be the median now the median represents the middle value in your debt data says now if you have an even number of data points you don't really have a middle value and so in that case the meeting is gonna be the mean of the two values so it's going to be the two meeting values added together and then divided by two so the pros of using a median value is that the median can sometimes be more accurate than the mean and we'll see some examples of this the median also evenly splits your data so you're not really you know affected by the mean in the sense that if you have an outlier in the mean and it drags everything to the right it could be that your outlier drags things so far to the right that all of your data is to the left of the mean and only the outliers to the right so that would be an extreme case but that can happen whereas the median you know it's always located directly in the center of your data and the median also doesn't care about outliers so if you have huge outliers at the beginning and at the end it doesn't really care because outliers by definition aren't very common because they're outliers and so if you have them at the beginning or house them at the end they're gonna be very few in number which makes them outliers and therefore the median doesn't really care about outliers that much a con though is that the median doesn't really give you much information on the rest of the data sure you know you know what's at the center but you don't know how does everything we behave you only know where is the center of our data so let's see some examples well do a working example first where we see our data set here and we can count how many values we have is we go from left to right then we can say we've got 1 2 3 4 5 6 7 8 9 10 11 12 and 13 data points so we've got an odd number and so our median value our center value is going to be the seventh data point because it's 6 from the beginning and it's also 6 from the end so it's equally spaced both from the beginning and from the end and so that's why we see our median value here is 26 it's located directly in the center now what is the median useful for well the median is often used if you look at you know household incomes for a country because if you were to use the moon then these billionaires they would just completely you know they would give you a false description of what really an average household income is because normally if you have you know like an average value and you can say oh the average household income from this family would be say $40,000 or something like that or that would be the median value but if you were to use the mean instead then all of the billionaires and all the millionaires in the country they would change that household income and then you would say oh you know the average household income per family would look like 60k and that's a bad representation because that doesn't actually give you a realistic look at what the average household family has and the average household family really does it's you know centered at like 40k and sure there are people below and there few people be high but that's what's in the middle whereas if you were to use the mean instead for your average you would kind of get this inflated household income which wouldn't be representative to the rest of your and the rest of the country another good example of the median would be the distance that people cover to get to work so if you look at this in terms of you know kilometers then you can say like oh you know some people they walk to work and it's like you know one kilometer at most so something like that and then you can expect people to travel most people travel around three kilometers to work and sure there are some you know that travel much further because they want to live outside of the city and there are some that travel very very short distances because they have a house right next to the office where their house is the office or something like that depending on where you're working but then you can look at you know like we're in the middle how do people travel to work what time or what distance do they need to cover and so that would be another good use of the median and a meeting another good meeting value is what do you usually spend when you buy a new item of clothing and so sure you know sometimes may go to that expensive clothing store and you could get a jacket that costs I don't know north of a couple hundred euros or dollars whatever system you want to use and sometimes you can go to a secondhand store and get it for very cheap but usually if you go into stores a jacket I don't know maybe cost you like a hundred dollars or something like that and so you know if you go out you can expect to pay about $100 no not really you know taking that much accountant - what story going in - so most of the stores that you're gonna visit are gonna have that price for the jacket so that would be another good use for the median all right let's look at the third type of average that we can do which is the mode now the mode looks at the most common value in your data and it's not really defined if there are several most common values but if there's only one most occurring value then that's what your mode would be and so we'll see an example of this in a second to the pros of using the mode is that it's not only applicable to numerical data so if you look at categories for example then you can say hey we've got five people from the US you know and two from Canada and one from France and you know that the mode is gonna be the US because there are five people from the US so mode is the great average that's not only applicable to numerical data in this sense but you can technically also apply it to categories or to ordinal numbers if you wanted so that you can say the most common country that we have were the the average kind of country that we would expect tear is the US and sure there are other countries but the average or the most common one is gonna be the u.s. in this case so yeah and then of course and the other Pro is that we allow to see what's most common what pops up the most so that's a great use of the mode if there are cases when you know recurring values happen a lot which is the case for discrete numbers for example so in discrete numbers values recur often and so it's good to use the mode icon of the mode it's gonna be that it doesn't really again give you good understanding the rest of the data similar to what we had for the median but also it's not really applicable if you just have a bunch of different types of data then there isn't really gonna be a mode if there's not enough of each data it's not really good to use the mode you don't want to you know have thousands of data points and the most reoccurring value it reoccurs like three times that's not good you want to use the mode for situations where data re occurs often so like we saw the country example but let's actually see a worked example but also some other examples for the mode so the worked example here would be again we take our data set and we can count how many times different numbers appear and so if we go through the numbers we'll see that twenty six occurs the most and so that's gonna be our mode here so we've got 22 and 25 that both occurred twice but 26 occurs three times and so 26 is gonna be our Millah it's gonna be our most occurring value now the mode is gonna be useful for things like the peak of a histogram so if you draw this histogram and if you don't know what a histogram is don't worry we'll cover that in a later lecture to let me go into data visualization but the peak of a histogram that's gonna show you the mode of the data the most occurring data a good another use of the mode will be if you look at employee income at a company because at a company you know you can again have the boss which takes off the mean and you can have you know higher level employees to which we kind of shift the median but if one third of your employees earn minimum wage that's gonna be the best average or say 40% of your employees earn minimum wage you're probably not your employees because that wouldn't be a very good system to have but a 40% of the employees at the company that you're looking at earning a minimum wage that's not a really good thing to have and if you look at the mode you'll easily see that the average in this case would be to earn minimum wage because that's what most people earn and sure you know the boss he or the CEO or something you know he may shift the mean up heavily and then the fact that you have higher ups if you look at the median value you may even well be too far you know too far to the right that you really don't consider these employees that all are in the same amount but you really want to get that description which is what you get here from the mode and then also the out kind of an of an election is where you use the mode for and sure sometimes you may only have two values sometimes you may have three but if you have different candidates and say you have five different candidates then the person with the most votes is gonna win the election because they have the most and so they are again you'll use the mode in this lecture we're gonna look at a spread of data and we're going to start off with looking at the terms range and domain then we're gonna move on to understanding what variance and standard deviation means and then finally we'll look at covariance as well as correlation all right so let's start off with a range and domain now let's start off with the range though so the range is basically the difference between the maximum and the minimum value in our data set so that's that's kind of simple to think about so let's just kind of go through this with a work example let's set up a company in the town and this is the only company in the town and the owner of the company earns a salary of 200k a year and then the employees you know they all have different salaries but the lowest employees or maybe the part-time workers they earn something like 50k a year so we've got data kind of ranging from 15k up to 200 K and so our range is the difference between the maximum and the minimum value in our DNA so we take 200 K and we subtract 15 K from it and we've got a range of 185 K in salary so that's how big our salary can change so it can if we start at 15 K it can go all the way up to 200 K so that's a hundred and eighty five K range of salary that people in this company can have all right and the domain is going to be the values that our data points can take on or the region that our data points lie in so if we look at this example again our domain is gonna start at 15 K and go up to 200 K so what the domain defines is it defines kind of starting and ending points or it defines a section in our data and so in this case the domain would define you know we would start at 15k and it would end at 200 K and what the domain tells us is that everything or all salaries within you know between 15k and 200 K that they are possible but within this domain or within this company it's not possible to have salaries outside of the glist domain so if our domain again is 15k to 200 K then we can't have a salary of 14 K because that's outside of our domain and we also can't have a salary of 205 K because again that's outside of our domain so pretty much all salaries within 15 to 200 K are possible anything outside of the domain is not possible because that's no longer in our domain all right so let's move on and look at the variance and standard deviation and we'll talk about the variance first and what the variance tells us it pretty much tells us how much our data differs from the mean value and it looks at each mean value and it looks at how different each value is from the mean value and then it gives us the variance it does some calculation and we don't really need to know the formula it's more important right now just to understand the concept of variance and so what it variance really tells us is it tells us how much our data can fluctuate so if we have a high variance that means a lot of our values differ greatly from the mean value and that will make our very it's bigger if we have a low variance that means a lot of our values are very close to the mean value and so that will make our variance lower and now if we turn to the standard deviation the standard deviation is literally just the square root of the variance so if you understand one then you also understand the other and now we can combine this if we know the range of our data to kind of get a better feel for our data and so let's use an example where we have two different countries just countries a and B and they have the same mean height for women which in this case will say is 165 centimeters or 5 feet 4 and we'll say that the range of heights for them could be identical so let's say they can range you know the range let's say it could be like 30 centimeters or something you can go anywhere from say 150 all the way up to 80 or we can even increase that and say like anywhere from as low as 140 up to like two meters or something like that but let's just keep the range for these the same and they both have the mean height now if country a has a standard deviation of five centimeters which is approximately two inches and country B has a standard deviation of 10 centimeters which is approximately four interests then what you can expect knowing these values is that if you go into country a the people that you're gonna see are gonna be much more similar in height so our standard deviation is lower that means our values differ lower from the mean and so that means a lot of the women that you're gonna see are going to be very close to 165 centimeters or 5 feet 4 plus minus 2 inches so it's very what you can expect when you go to this company that when you go to this country is that everyone is gonna be or every a lot of the women are gonna be about that height whereas if you go to country B they have a much larger standard deviation and so you can't really expect everyone to be about 5'4 because it fluctuates a lot more and so if you go to that country you can expect to see a lot more women of different heights both taller and shorter than for all right and so that's how we can kind of use the variants in the standard deviation or the standard deviation to give us a little bit more perspective on our data and kind of allow us to infer some stuff about our data all right so let's talk about covariance and correlation and so covariance will or already has the name very incident but covariance is measured between two different variables and it pretty much measures if you have two variables so let's say we've got you know me drinking coffee in the morning and my general tiredness so if I use these two values and you know get data point so this is how much coffee I drank in the morning and this is how tired I feel this morning or something like that and so what the covariance does is it looks at how much one of these values differs or changes when I change the other one so what does that mean for example well if I drink more coffee what the covariance would look at is how much does my tiredness change so that's what you do with covariance you see you say I change one how much does that affect the other thing that I look at and our correlation is very similar to covariance so we kind of normalize the covariance by dividing by the standard deviation of each variable so what that means is we get the covariance for my drinking coffee versus feeling tired and then we would just divide by the standard deviation of me drinking coffee and a standard deviation of me feeling tired and so really what we're doing with the correlation is we're just kind of bringing it down to relative terms that would fit our data better so that's kind of the abstract idea the important thing to just keep in mind is that we're looking at one and we're seeing how much that changes and we're seeing how much that change affects the other one all right so there are different types of correlation values that we can have and they can range anywhere between negative 1 and 1 or so their domain is between negative 1 and 1 and a correlation of 1 means a perfect positive correlation so that when one variable goes up the other goes up so for my coffee example that would be if I have coffee in the morning then I also feel more happy so the more coffee I have the more happy I feel and of course there's going to be a limit but let's say I only drink up to two cups of coffee or something like that and I can drink anything in between and the more I have the more happy I am about it so that would be a positive correlation the more I have of coffee the more I have of happiness and so they would kind of go up together and then when we get closer to zero the zero point is gonna mean no correlation to us so anything between zero and one is going to be a kind of slightly positive correlation it's not going to be a super strong and we'll actually see some examples on the next slide but yeah so anything between zero and one is going to be a kind of slight positive correlation not super strong and the closer you get to zero the more it means no correlation so an example for the zero case would be that it doesn't matter how much coffee I drink in the morning it's not gonna affect the whether they're unrelated one does not affect the other so I could drink you know one cup of coffee during a sunny day and one cup of coffee during the rainy day and it's not gonna change the weather it's not gonna affect the weather so they're pretty much uncorrelated and then we can also go down into the negative range and so the closer we get to negative one or if we reach exactly negative one that correlation of negative one means a perfectly negative correlation and so here we can take our example of coffee versus tiredness and so the more coffee I have the less tired I'm gonna be so coffee goes up and tiredness goes down so that's how we can kind of understand this correlation and it comes from the covariance so it was important to understand the covariance we usually use the correlation because the correlation because we divided by the standard deviation of each is much better fit to our data now there is one thing that's very important to remember and that's that correlation does not imply causation so just because two things are correlated that does not mean that one causes the other so a good example of this would be if I live in a climate where it's usually cloudy in the morning and I know it to be sunny in the afternoon but every morning when it's cloudy I drink coffee and then it becomes sunny in the afternoon that's not even though they may be correlated me drinking coffee and it becoming sunny it me drinking coffee does not cause it to be sunny that's just you know by chance it's just because it happens every day and by chance there's this kind of correlation that appears but that does not mean that me drinking coffee you know results in the weather getting better a causation would be me drinking coffee and me feeling less tired or me drinking coffee and me feeling happy about it because I like the taste those would be causations so that's an important thing to keep in mind just because things are correlated does not mean that one causes the other all right so let's see these things on a graph and so here we have the examples again that we've talked about but we can kind of see how the data would look like for different types of correlations and so we can see a perfectly the perfect correlation of one so one goes up the other goes up we can see on the left side and we pretty much get this really nice straight line so one value goes up the other value goes up with it and then the closer we reach zero the less related or the less correlation there is between them and then the more kind of variance we have in data so we'll notice for the case of perfect correlation which is the one or the case of perfect anti-correlation which is the minus one which again we had the example of more coffee less tired and in those cases you know we have a very nice thin line and our data doesn't jump around a lot but the closer we get to zero the less we can see you know one causing the other and the more we can see our data kind of spread out and so that's what correlation would look like in terms of graphics in this tutorial we're gonna go through quantiles and percentiles all right so let's get started so what our quantiles quantiles allow us to split our data into certain regions that if we're dealing with probability they all have the same probability of occurring or if we're just dealing with you know sizes of data we want to split our data into equal regions so that's what we can do with quantiles is just splitting everything up so that every time we split it you know we have equal amounts of data all right and so an example of a quantile would be something known as a quartile and so that's when we split our data into four equal regions hence the name quartile so a quantile is the general name for doing this splitting procedure and then if we say quartile that means we're doing quantiles but for four equal regions and so this is something that you'd probably often see unlike university admissions pages or something like that and they say the top 25 percent of our applicants have at least a test score of like 90 percent or something you know and then they would say the bottom 25 percent for applicants or our admission or admitted students or something like that have a test score that is I don't know 70 percent or 75 percent or something like that and then the median test score is 85 percent so that's how you would go about quartiles is that you would have you know the lower 25% the middle 25 to 50 then you've got the 50 to 75 and then you've got the top 25 percent so the 75% to 100 and so you've got these four equal regions which also include your minimum value at the very bottom your maximum at the very top and in the middle you've got your median values so that's the value directly in the middle it's because you're splitting it up into four equal regions and so the value that separates the second quantile which would be the 25 to 50 from the third quartile which would be from 50 to 75 that value there would be the median value all right and so if we go into percentiles so percentiles that may have been a name that you you've probably heard before percentiles again an example of a quantile but instead of saying you know like a quartile we do it for for a percentile I mean splitting it into 100 equal segments hence the percentiles of the perks name at the beginning though that's that's kind of where or the percent you may have noticed percent means out of 100 or so that's if you are familiar with percent and that's also the same kind of reasoning where this comes from and so we've got percentiles which means splitting into you 100 equal segments and so on an example of this is often used in test scores so if you've ever taken something like the SATs or something like that then you get a test score but you also get a percentile and the reason I've done that is it's to judge not you versus the test but you versus everyone else and so if it's a difficult test than something like getting a test score of 60% but you're in the 95th percentile means your score is actually a lot better and so what you can say with percentiles for example is that every percentile that you're in means you're better than you know that's many other people so for example if you reach the 99th percentile that means you're better than 99% of the people that took the test the 95th percentile would be 90 you're better than 95% of the people that take the test or something like that and so that's why percentiles are often used for tests and they're often used for normalization because they allow you to take into consideration you know these factors of like is it a difficult test is an easier test maybe more people are scoring higher so they don't really judge you directly versus the test but they normalize you against everyone else that took the test so you take the test you get a score and then um the percentile checks where that score lies relative to everyone else and so these percentiles they allow you to give a good normalization and they allow you to do great comparisons because they allow you to kind of eliminate some of these factors of test difficulty and of course you know there can always be luck involved and stuff and that may not get filtered out on and visual basis but if you do this for a lot of students and that's also why it's done in these kind of big standardized tests is that you get a percentile along with your score so that you understand if you know maybe if your score is lower but the test was really hard you can still see you know I I did really well because people found this test really hard and it was even harder for them than it was for me in this tutorial we're going to talk about the importance of data visualization all right so what we're going to talk about is first we're gonna look at the role that the computer plays kind of for us and what role the computer is actually made for then we're gonna look at what role the human should play in terms of data science then we're going to look at presenting data and finally we'll talk about interpreting data alright so let's get started and talk about the role that the computer plays no computer is much much faster at calculating than human because that's what it's made for it's made for crunching numbers it's made for doing fast calculations you know if you think about how faster computers are there in the gigahertz range so Giga means billions so they just do billions of things every second and so they're really good for doing repetitive things because they can do them so fast and then we can give them these logical tasks in terms of programming and we give them a structure and they just do it and they can do it over and over and over again they're not gonna mess up I can just repeat the same thing they won't get tired of it and they're really good and they're really fast at doing these things so that's the role that the computer should play for you it should be kind of a means to get these hard number crunching and all of these things done so there's there's really no need for you to kind of work out all this complicated math because your computer can do it much better and much faster than you and it's also less prone to error if you code it correctly so that's kind of the only part where you come in and it's only gonna mess up if you mess up but generally our computer does exactly what we tell it to do and it's really good and it's really fast at it now what role should a human play in terms of data science well humans have naturally developed to identify patterns and we've done this first for survival so that if we're walking around somewhere and we see a I don't know a big predator or hiding that we can identify that pattern of the predator and we can kind of pick it out even though it's trying to camouflage itself so humans by nature have become very very good at identifying patterns and you can also see this if you look at the clouds and you see things where you see animal shapes and the clouds or other things so those patterns aren't actually there but humans have become so good at identifying patterns we can see things in many many places and so that's what humans are really really good at we're able to look at things and we're able to pick out patterns now another thing that's really good for humans is we are very creative and through their creativity we can also use memory and bring it outside knowledge and we can also use a general understanding so these are all things that computers can't do so computers are kind of a means of getting stuff to us but once it's actually there it's our job to use our pattern recognition abilities and of course you can train machine learning algorithms for specific patterns later on or specific cases and make them really good at that but generally if you don't know exactly what's gonna come then our or your first step as a data scientist would be to try to identify these patterns use your creativity use your memory you know bring in all of these different things use all of these different things that make you human and use all of that on the data all of these things that a computer just doesn't have any access to okay so usually you know you're considering all this the best way to do all this would be in terms of data visualization so you can't just show spreadsheets with a bunch of numbers that won't really help you because looking at numbers it's really hard to pick out patterns the best way to do it would just be to plot values and then if we have these visuals in front of us then you know we can really identify we can see things go up and down and you know we can see them fluctuating and we can see them make very thin lines we can just look at a graph and we can just see things and of course you know we need a little bit of practice to understand what that graph is trying to tell us but once we understand the graph and in general then you know we can look at new grass and we can just see things so we can start to see patterns and they may not always be true but that doesn't mean you know and we can't pick them out and then that's later on you would also do some testing trying to see if those patterns are true if they make sense but generally data visualization is very good for this because it allows you to invoke all of your human characteristics the things that are really good that you know make us human the things that we talked about and the last slide all the things like the computer can't do and sometimes you know you if you deal just with just these numbers it's data visualization is for you and one sense so that you can see these things and try to pick them out I use them later on but also if you're trying to show these things to other people so maybe you have to do a presentation in a kind of summary then you want to make sure that your data visualizations are good because the people that are going to be looking at it are much much less trained looking at data and analyzing data than you are and so if you try to convey them a message and just show them a big spreadsheet with numbers and just point out like here look look look these numbers you know they pop up and they're gonna be like what are you talking about so that's why it's really important to have really good data visualization skills one of them is to enable you to do your job but the other part of it is to show it to other people and to kind of help you convey information to them you know and of course we talked about statistical values and statistical values are very important and they can give us a kind of good idea about the data and what's going on inside of the data but visualizing data is just taking it to the next level and statistical values aren't enough there they can give us you know they can help us they can support us they can give us ideas but if we really want to understand what's going on sometimes we just have to take a look at what's going on and of course they are it's also important to make sure you choose the right visualizations and everything because other times you know may just look extremely weird but just this skill of being able to present data both for yourself as well as for other people is very very important for a data scientists and then we go over to interpreting data and we've kind of touched on this in the last section already but really with data visualization it just allows you to see this data and it allows you to apply some reasoning to the system and so you can if you look at data either you see something which is great you know that means you can try to test something see if it's actually there or you don't see anything and that also tells you something that you aren't really able to pick out a pattern so that there isn't there isn't anything obvious that's going on there may be something underlying that's more complicated but obvious to the user you know it's not there and so all of these things allow you to you know kind of easily or much more easily analyze your data and kind of prepare what are you gonna do after that so this data visualization it really gives you a deep deep understanding of what's going on with your data and then when we interpret this data and we look at these visualizations you know maybe you see dips and you know maybe you see some hills somewhere we can try to understand all of this by bringing in our outside knowledge so again what the human is really good at we can you know bring in the context of things you know maybe people are going out to lunch here and so that's why activity decreases or maybe everyone is coming to work in the morning and so that's why activity increases compared to you know 6 a.m. so all of these things we can bring in all of this context we can bring in all of this understanding to try to interpret the data try to better understand what's going on and then of course we're gonna see hopefully some trends or patterns of course like I said these may not always be there so we're actually so good at pattern recognition that we can see sometimes patterns that aren't really there and so a good example again of this would be just looking at the clouds in the sky and you can see animal patterns maybe but that's really not there that's just our minds you know identifying all of these patterns and so yeah that's that's pretty much why data visualization is so important to a data scientist it's because this whole you human aspect is it's just key in data science it's key and data analytics to be able to understand what's in front of you to be able to understand these this outside knowledge to be able to contextualize this creativity that's really key to a good data scientist and a computer can help you with all of this the computer can help you you know do the number crunching a computer can help you set up the visualizations and it can plot whatever you want for it but ultimately it's up to you to choose the right visualizations to do to look at the data to be able to communicate the visualization as well all of those things are up to you and so that's why the human is so so important in data science in this tutorial we're going to look at one variable graph so we're actually going to see some of the types of graphs that we can do you know that we talked about in our last tutorial where we just looked at the importance of data visualization so now we're gonna go into data visualization and look at the types of grass that you may want to use or that you may want to choose from all right and so the graphs that we're gonna look out in terms of one variable graphs are gonna be histograms bar plots and pie charts so let's get started with histograms now we can see an example of a histogram on the right but what's really cool about histograms is that it shows us the distribution of the data and it shows us the distribution across all the values in our data and so it shows us what happens the least and it also shows us what happens the most and histograms it they let us see where our data is concentrated and they also let us see how its distributed and so the through this it kind of shows a general behavior and so really what histogram is is it looks at each value and it just looks at how often the value has occurred and so what we see here for example is that around 0 you know we have the most occurrence of whatever value we're looking at and as we move to the left and as we move to the right these values start to drop off so they start to become less frequent and so that's what histogram shows us they the istagram shows us a kind of frequency how often these things occur and so there are different types of histograms that you can encounter or I mean generally a histogram is just this plotting of frequency versus your value and so there are different ways that this histogram can look like one of them is the one that we've just seen which is a normal distribution or it's called Gaussian like histogram because it follows this Gaussian distribution or this normal distribution that you know but we can also have like an exponentially decaying value so we start off very high and the further we get away from that initial value the quicker it's then it gonna decrease and you can actually compare that to the Gaussian like or to the normal distribution so the normal distribution looks more of like a bell it kind of goes up and then curves down slowly whereas the exponential it cuts off very fast and then kind of slows down later on so they do have different behaviors and then of course you know we can also get not just one peak like we see in this first case and the Gaussian like distribution but we can also get things like two peaks or we can even get three peaks or more we can have very large extended peaks and so our histograms there are means of showing us how this data is distributed there are means of showing us you know what things occur most frequently where is our data concentrated but that don't that doesn't mean that they're gonna have to have a specific value and so they're or your specific shape so there are many different shapes that our histograms can take on and depending on what shape that you get that also tells us something very different about our data all right so the next one variable plot that we'll look at is going to be bar plots and so what bar plots do is they may look a little bit similar to histograms at first but they are very different in some sense because bar plots allow us to compare across different groups and so that's what we see on the x-axis down there is we look at different groups and so we use the same em but we can compare that variable over different groups and so if we look at that in example so what we see on the right here is we look at different countries and what we show is we show the average income tax and so we see that country B for example has the highest average income tax whereas country D has the lowest income tax so through this you know we're still only looking at the income tax variable but we are able to compare us over different groups over different categories if you will so other examples would be if you look at control groups and test groups or if you're doing some like medical study or maybe some psychology study or something like that you always want to have your control group and then you can have different types of test groups and then you can plot each of these groups as a bar plot and you can look at the same variable but you can look how that changes over the different groups another example would be something like comparing male versus female heights so you've got one group that's male the other group that's female and you can just plot their average height and then the tax the income tax of different countries which is what we seen on the right over here all right and so the last one variable graph that we're gonna look at is gonna be pie charts and what pie charts are allowed to do is they allow us to section up our data on the and then we can kind of split it into percentiles and because of this we can see what our data is made up of so the whole pie corresponds to a hundred percent and then we kind of cut it down at different slices and through that slicing and then hopefully also color coding like we've done here and maybe even labeling or most definitely labeling so that you know what slice corresponds to what value we're able to see what categories you know or what what categories our data is made up of and so we can see what is most prominent but we can also see what is at least prominent in all of these things and then again here we can see also distributions not as well as in the histogram but we can still see distributions in terms of dominance in terms of how many groups there are is the data spread evenly is it you know heavily concentrated in one part of the pie all of these things allow you know is that's what we're able to do with pie charts we get this nice kind of group overview of one variable so examples of this would be you can look at NASA D distribution in a university and so you can have a pie chart and just each slice of pie which is to represent a different ethnicity and depending on you know how much of our percentage they make up the total University profile that's how big the slice of pie would be and so you can see dominance of some ethnicities as well as you know minorities but you can also see just by how many slices that are you can see how many different ethnicity groups there are and another example would be you can split up star reviews for a product so rather than you know looking at the average mm star review you can also just use a pie chart and you can see how many of my reviews are 5 stars how many of them were 4 stars 3 2 and 1 and so there you can again also get this nice different overview of how the review system would work now we're going to talk about two variable graphs so the graphs that we're gonna look at are gonna be scatter plots line graphs 2d histograms or two-dimensional instagrams and box and whisker plots alright so let's start off with scatter plots now for a scatter plot what we're doing is we're really scattering all of our data points onto a graph and so pretty much every data point that we have we kind of put a little dot onto it on the graph and scatter plots are great because they allow us to see spread of data between two variables so we're always plotting one variable on the x-axis and another variable on the y-axis and it just pretty much allows us to see how the data is distributed for these two variables and then through that we can also see more dense areas we can also see some sparse areas and we can also look at correlations so maybe you remember in the lecture we talked about correlations we were able to see through scatter plots where those correlations where or weren't any correlation so all of these things that's what scatter plots are really really nice for in scatter plots of course we can also use them to have like we see here little clusters so not everything needs to be connected by a line or a curve maybe something is more like a circle and so that's what scatter plots can show us too they can kind of show us these groupings and we see one cluster here but maybe you know you have bigger plots and then there would be smaller you know like ten little different groupings for different things so it's got our plots are really great for that because they just show us where the data points are located for these two variables and then we can you ourself see you know like how how do these look like do does one variable affect the other is there may be certain groupings that we can see where our dense areas where it's sparse where are things concentrated you know is everything spread up all over the place is it very very narrow and only in specific region scatter plots allow us to see all of these things very easily and so some examples where we could use scatter plots would be if we see if we look at the graph on the right we can look at something like a car price versus the number of cars sold so each of these data points pretty much represents a car that's been sold and then the x-axis tells us the price that the car has been sold at and the y-axis tells us the number of cars that that have been sold at this price and so what we see here for example very easily is that the more the car is priced the less it gets sold and then maybe you can think of that in terms of well the more its price maybe people don't want to buy such an expensive car maybe they found a cheaper version of it so maybe it's just a branding thing which is why it's more expensive maybe there's something just as good quality that's cheaper maybe people just don't have enough money so that's probably a big factor to that people just don't have enough money to buy these expensive cars and so that's why they drop off and so it may look a little bit different in terms of profits but the higher the car is priced the less we see it being sold so that's one example of a scatterplot then something else that we can look at is maybe the income versus years of education so only we would look at on the x-axis how many years someone has been educated and then we would look at that current income and that would just be a point on the on the graph and we can do that for many many different people and then we can see how different education for different people how that affects their current income so that's another thing where we can do a scatter plot for we can also go back to one of the earlier examples that we used very early on where we talked about people traveling to work and we can just plot the distance traveled versus the time it takes and travel to work and then we can see you know maybe some people travel faster it could be that some people travel the same distance but one takes longer than the other because one goes by car the other one goes by bike the other one takes public transport all of these things so all of that we can see in these scatter plots and just kind of take into account these different situations and see how that all looks for the more for the general population of our data or just generally for our data so scatter plots are really really great as a kind of first go-to just also identifying trends identifying regions and just giving a good overview of your data now the next thing that we'll look at is going to be line plots and line plots in some sense are kind of similar to scatter plots so we have the same bases of the X and the y axis but the points are connected and now it's very important to know when to choose line plots and want to choose scatter plots so line plots can carry a lot of advantages with them because this connectedness it makes it very easy for us to see trends because we can see where are these lines go not just trying to connect the points in our head you know like kind of connect the dots but that's exactly what I'm a line plot does is it connects the dots for us and so we can see these lines it's great if we want to see an evolution of something so maybe we want to see an evolution over time maybe you want to see an evolution over space and evolution with people something like that just if our data points are connected it's great to use a line plot so if we know that whatever happened before is connected to what happens and it's great to use line plots because line plots show us how things evolve because they're all connected as a line but if we're to do scatter plots and we just kind of plot points randomly and just because if we go back to or our kind of car sold car price example just because someone bought an expensive car or if we look at the expensive car and it's been bought say like five times and we look at a cheaper car it's and bought a hundred times there isn't really a logical connection to make between the two and so if we were to use line plots where we should use scatter plots really what we'd see is just a bunch of lines all over the place and so that's why it's important to kind of know when to use lime stalks and want to use scatter plots because it can be very very helpful if you use a scatter plot instead of a line plot it's gonna be a bit more confusing because you have to try to connect the dots yourself in your head but if you use a line plot instead of a scatter plot it's gonna look really weird because there's just lines all over the place and you can't really see anything so an example where we could use line plots is you have the typical distance versus time so you can look at you know how far someone or what time it is and then you know how far someone has traveled just a general curve of distance versus time that's very very common and you can look at the profit of a company versus the number of employees so the more employees they imply employee how does that change their profits so of course they have to pay the employees more but maybe the employees can also do more work and hopefully you know that kind of cancels out what you pay them and then increase this company profits and then what we can see on the right here is we can look at your creativity and how that changes with stress so you can see that the more stressed out you are the less creative you are and here it's also good to use a line plot because you kind of gradually advanced and stress and so each point and stress is kind of related and the higher you go up and stress the lower you go down in creativity and so there's this kind of relation where we can see this evolution so the more you get stressed out the less creative you become and so line plots are really nice here because there's not this chaotic movement everywhere but it's very nice and it's very easy to see this line it's very easy to follow okay so the next graph that we can talk about is 2-dimensional histograms now we've seen one-dimensional histograms in the last tutorial where we looked at the spread of data and we looked at the peaks and how you know things were distributed to the right and to the left but we can also do a 2-dimensional histogram and so what a 2-dimensional Instagram is it's a one-dimensional histogram but it's a pretty much a histogram for every single point of the other variable that we're looking at so really what these things allow us to see is they allow us to see how the different distributions of the two variable is relative to to another so we can see here for example in the red region that for those specific values them they happen a lot so that combination of values happens a lot and so we're able to kind of pinpoint these frequency occurrences again and we're also able to look at drop-offs but we're able to pinpoint that to two specific values now rather than just 1 which is what we did to the 2d histogram and these things are much harder to see in scatter plots because in scatter plots if we have a value occurring a hundred times it would just be the same dot and the dot wouldn't get bigger now of course you can make the dot bigger yourself if you wanted to or you could change the color or something like that but really if you do a scatter plot and the same thing happens a hundred times it's just gonna look like one dot whereas for two-dimensional histograms we can see that it's not just you know it's not just happening once but we can actually see the frequency of those variables or those those two variables together so an example of a two-dimensional histogram would be if we look at ticket prices versus tickets sold and so if you look at the lower left corner and we can kind of see this red peak so that's cheaper ticket prices but the tickets are also sold often so we know that tickets at that price are sold quite often and these could be you know like new rising brand bands these could be like you know you kind of standard bands that maybe you want to take someone on a day-to but you don't want to spend much money on a ticket but still a concert is a nice idea and so that's a good ticket price that sells a lot of tickets because it gives you the pleasure of the event without making it too expensive and then if you move more towards higher ticket prices and then if you move more towards more tickets sold then you can see that for high tickets high ticket prices which would be you know like these big bands then we can again see how many tickets we've sold so we can see that for you know a higher price and if we go up and ticket sold so if you want to see lots of tickets sold for a high price then the red Peaks are gonna give us all of these more famous artists so that's you know one kind of application but of course there are many many better ones it's just these things you know if you're in the moment and you you can kind of then you would realize oh this is when a two-dimensional histogram would be a great thing for me to use so a lot of these graphs they're great to know and once you're in the moment then it's much easier for you to pick out which graph would be best representative finally the last graph that we're gonna look at is gonna be a box-and-whisker plot and I want box and whisker plots allow us to do is they allow us to see the spread within our data so it's not just like a bar plot which just shows us one value but we can actually see the statistical spread so we can see median values which is what we see here we can see quartiles the little dots on the outside actually show us outliers and so what box and whisker plots allow us to do is they allow us to see the statistical information but they allow us to see it visually and that makes comparing across different groups which is what we're doing here much easier and so a good example of that would be if we look at ticket prices for football games for different teams so we have different teams and different teams of course use different stadiums and there they have different popularities and some teams may be much more expensive or the ticket prices may be much more expensive than other ones and so we can compare these ticket prices using box and whisker plots and then we can see you know what is the higher end of these so those are gonna be the more luxurious seats and then we go to the bottom and those are going to be the less luxurious seats probably the ones where you stand and then you have middle values depending on you know the standard seats and where you are in the stadium if you're close to the field if you're further away from the field but you're still sitting all of these things we can kind of see here and that's what gives us a spread we can compare that across different teams and we can see the spread across difference teams we can also see which teams are more expensive you know where do the prices vary the most for specific teams and maybe some teams have a super launch and then they have your you know standing places that are just much cheaper and so you would see a lot larger spread or maybe some teams just have you know only seats and see you'd see a much lower spread and so all of these things were able to compare using box-and-whisker plots over different groups in this tutorial we're going to talk about 3 and higher variable graphs so the graphs that we're gonna look at are is gonna be heat maps and then we'll also look at multi variable bar plots as well as how we can add more variables to some of the lower dimensional graphs that we've talked about earlier all right so let's start with heat maps now what heat maps allow us to do is they allow us to plot two variables against each other and the X and the y and they laws to show an intensity or a size or something like that in the Z direction or towards us so an example of this which is kind of what I've tried to illustrate on the right is a customer moving through a storm and so we can track the path of the customer in the X and y direction of the store so you can kind of get this bird's eye view and see where they move to and the darker spots actually tell us the positions where they spend more time at so we can see that they spend a little bit of time you know at the beginning they moved in and then they stopped lunch was what we see with dark spot being maybe they found like the candy aisle or something there was a specific piece of candy that they wanted and then they moved on and then they started to go towards the corner around the corner a little bit and maybe they reached the fruits in the vegetable section there and picked out several things and then they started to head towards the checkout counter which happens at the very end and they were moving at a more constant pays sometimes they stop to look a little bit but they just kind of continued moving on and so the three variables that we've shown here as we've shown their exposition in the store we've shown they're by position in the store and to their color we've also shown the time that they spend at each position so that's what we can use heat maps for and then another example the heat map would for example be if you take a flashlight and you move it over the screen and really what you're showing is the amount of time that you've shown the flashlight onto a specific region so that's kind of another example the heat map but usually heat map as the name implies it allows you to track positions and so it's very often used for things like tracking customers through stores or just tracking general people location where they like to spend their time and the intensity that you see in terms of the color is usually the amount of time that they spent there all right so we can also do multi variable bar plus a multi varied bar plot and so it is very similar to a single bar plot where we just plotted one value over different groups but rather than just plotting one we kind of cramped them together and we plot several and so an example of this would be that we plot you know goal scores I'm goal scored 14 the shots taken on goal as well as the shots on target and so we can see maybe there are teams that shoot less on goal without score less but that's because they also shoot less and therefore they also shoot less on target or maybe there are some teams that do score a lot and that's because they shoot a bunch but they just don't hit the target that often or maybe there are really good teams that score a lot and they also shoot a lot on target and so all of these things were able to then compare over different groups and so that's what we can use multi variable bar plots for if there are several variables that would give us a better understanding of the system than just looking at the variables in one at a time but it also be really cool if we could compare all of them then we could use multi variable bar plots for that and just plot them on the same bar plot and then we can see how they change you know within a group we can also see how they change over different groups okay and something that we can do is we can also just add extra dimensions to lower dimensional graphs that we've had and so we're kind of limited to three dimensions because that's the amount of space dimensions that we live in but if we take a scatter plot for example where we started off with just the X and the y axis and points located what we can do is we can actually add a third axis so we can take the X and the y and then we can add a Z and that gives us an extra depth dimension which is exactly what we see here so rather than just plotting unlike a two dimensional field unlike a plane we even actually plot it in a volume and so we can see this kind of scattered ball that we've done kind of kind of all that we've done here which is kind of located at the center of our plot and so this can be really cool because it allows us to see depth to the problem with this is that we have snapshots every time and so really we're looking at two-dimensional snapshots and so to get the best understanding of this we need to rotate our scatter plots or our plots as we do them so that we can also add in our depth perception because right now for looking at it it may look three-dimensional but really it's just a two-dimensional snapshot and to get the best understanding if our scatter plot is located more towards us and more towards the left or something like that or maybe it's just really high and close to us or maybe it's really low and far away to understand all of these things we need to be able to rotate our scatter plot so that we can see it from different angles which then gives us this depth perception and we can do the same thing with 3d line graphs so here we see an example of maybe the position of a skier as they're skiing down a hill and then we can kind of trace that through time and we see that they're kind of they're going down the hill in this nice exact motion as you should and we can just track their position over time so here we've added this extra dimension to the 3d line graph rather than just taking maybe a time and a position at a time or something like that we've added a second positions or actually even a third position so we've got the X to the one there's that position and then we just trace it and over time and so that gives us this whole line here and so that's how we can take these lower dimensional plots that we've looked at before and we can just add extra dimensions to them if we want as long as it's still easy to see as long as it makes sense what we're looking at yeah we're really just able to maybe just slap on another direction there and you know compare another variable in this tutorial we're going to touch on the third major section that is really great for data scientists or that should be an essential of data scientists which is the ability to program okay and so why do we program well there are different reasons why we want to be able to program the first one is going to be the ease of automation the second one will be the ability to customize and finally it's because there are many great external libraries for us to use that it would just make our job so much easier alright but so let's get started let's talk about the ease of automation for us what do I mean with that well being able to program it really allows you to prototype really fast allows us to automate things and it also gives us the extra benefit if if we have something in our mind we can just take that and kind of put it into the computer by programming it and so we're able to automate everything very fast and we don't have to do these repetitive tasks you know maybe copy pasting stuff into or from Excel or all these things and if we just want to repeat something or we want to quickly change something up and just change a small thing we don't have to do a lot of stuff we can just change that in our code and then click play and let the computer take care of all of that for us rather than us having to do everything manually so it's very easy for us to automate things um and also for doing reports it's very easy to automatically create these reports you know all you have to do is set up your program to deal with the data that you're going to give it and then I can automatically create reports every week and the reports can be different because you give a different data and it should still look the same but the data the values can be different and so that will just automatically create all these reports for you and you don't have to do that all yourself the program does it for you but you've built the program and you're giving it this different data so you're still doing all of the analysis it's just you get to skip the part of copy pasting and like looking across and taking over the values and doing all the formatting of just doing the same report over and over and over again all of that is taken care of for you and all you have to do is just put in the right data you know write out everything that you want to do and then click play and let the computer handle all that for you because remember that's what the computer is doing and good at doing doing these repetitive tasks okay we also want to be able to program because it really allows us to customize it's very easy once we go into data analysis and when we see things that we get these ideas that we want to expand or different directions that we want to progress our analysis into and being able to program it really just allows us to take all that and put it as a code and just choose that direction and we can very easily dive much deeper into our analysis and discover things fast because it's up to us to where we want to go and so this ability to customize with programming it's it's very very important because we're not reliant on anything else we're not reliant on you know some software and maybe it breaks down or maybe we don't know how to perfectly use it and we have to read the manual and read a like a Help section and know we know how to program and we just type down exactly what we want to do exactly where we want to take it exactly what we want to see and we can customize very very fast with that we can also prototype very very fast with that and maybe if a visualization is not working to turn a scatterplot into a line plot it's very easy you just change one word so all of these things are very very easy to do with programming because we have all that power at our fingertips and we can just you know change everything that we're looking at everything that's being calculated maybe we want to calculate an extra thing and take out something else because it's irrelevant all of these things were able to customize and all of that we can do because we're able to program so really what we're doing is we're making the data ours so we're taking full control of the data we're taking full control of where we want to go with our analysis what we want to see and what we want to show all right so let's talk about first libraries but also give you two great Python libraries that you should you know maybe feel comfortable with or that you should maybe consider using for data analysis so first of all what are libraries will libraries are pieces of code have been pre-written by others that you can just take in and use and so a very good example of this is something known as a math library and so that has all the square root functions taking to the power you know taking the exponential the sine the cosine all of these things that you know and you want to use but you don't want to program yourself so like it pretty much avoids that middle step of you having to program the equation to calculate a sine because all of these things those are things that we don't want to do we don't want to get distracted from our target we want to be able to do exactly what we want to do without having the program completely other stuff and so that's what libraries are great for they're developed for by the community for everyone to use you know everyone is helping each other and these libraries they just bring a lot of power with it and so one of these libraries is called pandas and pandas is pretty much like Excel but it allows us to do or we can do programming with it which just makes it so much better because we can do things so fast with it we can do all this customization we can do all this automation whereas you know like Excel if you give it too much stuff too much to run it would just start to crash because it has to handle all of this other things all these other visual things you know the UI and there's a lot more it's a lot it's not a structure as well whereas in programming the program you know your computer just goes through everything step-by-step it doesn't have to take care of all of these visualization things it just does the calculations down below but we can still do all sorts of data management with them so we can shift our data around we can drop columns we can split things up you know we can split things up our row we can pick out certain rows we can even do statistical calculations on our data so we can say you know hey calculate the mean for this we don't even have to you know make our own formula for calculating the mean or for calculating the standard deviation or for calculating correlation between different columns all of that can be done with pandas with just a you know a couple of key words and so it's really easy to do data analysis with it because all of the functions that are there and we know exactly what we want to do but we don't have to write the code for all of it so if you wanted to look at correlations we just say hey pandas do correlations rather than having to you know code all the correlations for ourselves and doing you know coding that whole algorithm and that makes it really easy and really fast to get results and to get to where you're heading because you don't have to go into any of these mineral places you can pretty much just skip the middleman of having to you know write all of those algorithm to yourself and you can just use them so that you have your start you have your idea you know exactly what you want to do and you can do exactly that to get to your goal the other library that is very cool would be matplotlib which is what I use a lot for data visualization it allows me to create graphs allows me to visualize my data allows a bunch of customization so I can really just move everything around in it I can move my spines I can turn things on and off you know all of these things are very easy to do with MATLAB there's a lot of great customization that I'm able to do with it so these are like kind of two basic Python libraries that you should probably maybe get to know em or you can look at some of my other courses and one of them pandas would deal with the data analysis part and matplotlib would help you deal with the data visualization part of it so that's it that's a super basic breakdown of the three main components of the otherwise vague term data science if any of this has piqued your interest then you may have a data science future ahead of you and I encourage you to continue to pursue your interest if you want to learn more from me I've got a blog on my website coding with max comm that dies more into different topics related to data science you can also get access to some of the resources such as cheat sheets and workbooks that I've compiled for you there if you're serious about learning data science you can also check out my courses on data science which are designed to teach you all you need to know about data science even if you have no prior experience of course if you have any questions that aren't answered at my website you can always feel free to reach out me personallyhey everyone and welcome to my mini course on the essentials of data science this mini course provides a super basic looking to data science what it is and the three main components that make up data science data science is a very mainstream word like it's thrown around a lot but its actual definition is quite vague this mini course is designed to help those of you who are curious about data science develop a better and more specific understanding of the topic there are definitely more advanced techniques within data science such as machine learning but even these can be traced back to the three essential components that we'll cover before we get straight into it I thought I'd quickly introduce myself my name is Max and I work as a data scientist after getting my degree in physics I find myself more and more drawn into the world of data science so instead of diving into the realm of physics research I taught myself all the tools and techniques a data scientist needs and shortly after landed my dream data science job I've since also started teaching data science to others and have been fortunate enough to teach what is currently over 9,000 students the skills of gathered and learned over the past five years of my data science journey so let's jump right into it so what is data science well data science is you can kind of summarize it in different ways but the main parts of it are transforming data into information and this is a really big step because a lot of people talk about you know data and big data and all these things but data by itself isn't really that useful until you can turn it into information and so if you just have a bunch of numbers appearing somewhere and it's just you know so much of it no one can make sense of that and that's where you need a data scientist to be able to transform all of these all of this vagueness and kind of this noise to that's going on and you need to be able to extract information from it and that's what a data scientist does now what you do with this - with this information or how you get this information it's through analyzing your data so a big part of it would be you know cleaning things up doing some some processes on it and then you analyze once you've clean things up and that is one of the ways that you can then get information out of your data through this analysis and you can kind of continue on and you see trends and patterns and all types of correlations hopefully and all of these things again build up into this turning data into information component and then ultimately you also need to contextualize everything that you have so your computer can't do that for you you can Peter can kind of crunch the numbers and stuff but it's your responsibility also to make sense what's in front of you and even if you see something you just don't blindly trust it but you need to understand you know where am I at where am i coming from where is this data coming from you need to be able to contextualize these things and then of course be able to apply as well as understand them and so once you have this data you know it's great but turning it into an information into great information that you can use and directly apply that's where the real power lies and that's also kind of the role of a data scientist so that's what the data that's what data science pretty much is and so what is the data scientists do well we kind of already talked about this just a little bit but let's go over it again any more concrete examples um so a data scientist would for example get and process this raw data and then convert it into something a little bit cleaner so you can imagine kind of just like a data stream coming in and it's you have this measuring device and constantly is just measuring all sorts of data and because like nothing is really constant so everything will be fluctuating up and down and so a data scientist would be to kind of take all of this data it'd be to kind of clean it up a little bit you know maybe reduce this fluctuation that you know isn't supposed to be there that's just kind of background stuff going on and then put it into a format so that you can easily plot it against some things and then we already get to the next point that you know once this data is cleaner you can maybe do start doing some calculations on them figuring out the core statistical components you know like what is the average values of these what what am I really dealing with you know getting a first look at first understanding of what it actually is that you're tackling and then once you have this kind of understanding then you can start to do some visualize they which helped you as a data scientist maybe see some trends or patterns already but visualization was also really key because they let you show it to other people and there are great means of communication so they help both you as a data scientist as well as helping others when you try to convey this information to them all right and then finally you have to suggest some applications of the information so it's not really enough to just be able to look at it and say like yeah I see it goes up and down and that's that's good but what does that mean how does this transfer into something useful and that's also one of the key roles of a data scientist transferring information into knowledge and so you've got this data into information step but you also need to transfer this information into knowledge and those are two really powerful things that are worth a lot a lot and that's pretty much what a data scientist focuses on and then you can go further you know and take this data and do machine learning with it or something if you really understand what's going on or if you have some hypotheses of you know what could happen so you can take things a lot further but ultimately this kind of turning data into information and then into knowledge that's kind of your role all right so let's go into the essential techniques or the essential components of data science so the first essential component and we kind of touched on this already is statistics and basically we're gonna cover this later on but let's just give a kind of quick wrap down so in statistics need to understand different data types that you can encounter and so there are data can come in different ways and we'll go again into more detail with this later but it's not just you know you get a bunch of numbers data can come in very many different ways depending on the field that you're in and so you need to be prepared and you need to kind of be aware that data may not always just be a direct number for you then of course you need to understand some key statistical terms like you know the different types of means and also understanding fluctuations in data and the reason that this is important is because these key statistical terms give you an overview of how this data is behaving and depending on how the data is behaving you may want to approach it differently so if you know that your data is very clean there's a very little fluctuation then if you visualize things you can probably trust what's going on or if you want to maybe fit some curves to it or something but if you see there's a lot of fluctuation in your data visualizing it is going to be much more difficult because you just see jumps everywhere and you're not really sure which of this is actually true and which of this is caused by you know like some interference somewhere or someone is messed with my system and so all of these things will kind of be hinted to you through statistical terms so it's probably good that you know you're kind of comfortable with these things and that you can be able to get some meaning and meaning out of them all right and then finally it be in statistics to be able to you know split up and group or segment data points so that when you have this big data set you want to be able to you know maybe split it up into smaller things compare different regions look more into more detail into some things and maybe you know isolate two components because you know hey these things are probably going to be important the rest I don't really care about that much so being able to kind of pinpoint an isolate and meddle with the data a little bit so these are the kind of statistical components that we're gonna look into all right so the next big thing and we've already talked about this too is data visualization and we'll see why data visualization is a really key skill for data scientists and then we're also be gonna be covering different types of grass that you can use and how you can compare different number of variables so for example you can have one variable grass where you only look at one thing and you only want to look at this and you want to see how these how this changes you have your typical two variable grass which you probably know where you have this X and a y-axis and then you can kind of see how two variables relate to each other or you can have three variable or even higher variable graphs and where you plot maybe three different things or even more if you want as long as it makes sense next to each other so that you can compare multiple things at the same time all right and now we come to the other big thing that you're probably going to need as a data scientist which is going to be the ability to program now not every data scientist can do this but this is really really essential in my opinion to your role as a data scientist because knowing how to program is going to make your life so much easier if you know how to program you can kind of take your ideas and your thoughts and you can put them into actions in the computer and you can just automate everything you can customize things you can explore you can prototype you can test and you're not reliant on some you know application you don't have to master some application and if it doesn't work or if one feature isn't there you have to contact customer support and maybe it's not even possible and then you have to wait for an update or maybe something is bugged with programming there's just you're so much more reliant on yourself and you can really just do whatever it is you want to do and you're not reliant on other people or on the tools that other people have built for you but rather you can just pretty much go and you know just do what you want to do without there being major roadblocks and then we'll also look at some essential packages in Python so in programming you never want to reinvent the wheel you always want to start off where the last person left off and so the ability to program and be able to write simple programs you would need to teach yourself but you wouldn't need to write highly complex mathematical packages or data analysis packages those are already out there all you need to do is be able to download them and implement them in your code and they're gonna work you know they've been tested a lot there's a huge community's working on them on improving them and everything all of this is for the community and so the whole community kind of works together to improve it no one's really directly trying to make a lot of money off of it so they're not going to charge you all of these service fees and everything everyone's just trying to improve their package because if it improves everyone also benefits from it and so we'll look at some of the libraries or we'll talk about some libraries that you can use especially in Python and to help you along your way with data analysis and to become a successful data scientist in this chapter we're going to talk about statistical data types now we're going to look at the three different types of data which are summarized as numerical categorical and ordinal types of data now these are the types of data that we talked about before how you can't just expect your data to be cut be kind of numerical and so we'll see the miracle data but we'll also see the two other types of data that you may be you know encountering in your career as a data scientist all right so let's talk about numerical data first though numerical data is also known as quantitative data and it's pretty much things that you can kind of measure it's it's great numerical stuff that you can do math with you can compare it you know saying this Plus this makes sense he is greater than B these are you know all examples of numerical data numerical but data can we split up into two different segments one of them is going to be discrete and so discrete means the values only take on distinct numbers and an example of this would be you know IQ or something like that a measurement of IQ or if you do a coin toss the number of times that you toss heads so you can you know you can have 15 heads you can have 12 heads out of you know 20 coin tosses you can have 500 heads out of a thousand coin tosses or 500 out of 600 or all of these things but all of these are distinct numbers and now they don't have to be whole specifically but they do have to be distinct so that's that's the kind of very important part that you know there's a kind of step size that you're dealing with and of course you can still say hey you know flipping eight heads out of twenty is better than filling seven heads out of twenty so if you want to flip heads lettuce or flipping eight out of 20 is worse than flipping 7 out of 20 if you're going for as many tails as you can so all of these kind of comparisons that make sense so that's the discrete part of numerical data then we have the continuous part and now the continuous part is really that values can just take on any number and they're not unlimited by decimal place so a value that can you know can be like one point one and then the next value would be one point two that's not continuous that's still discrete because you have this step size of zero point one continuous means literally ever number from start to finish can be taken on and this doesn't mean that every possible number in the universe from negative infinity to plus infinity and all imaginary numbers and everything that comes with it that doesn't that that's not required for continuous it could really be that just every number between zero and one can be taken on so for example let's say you have a bottle of water and this bottle of water can hold one liter now if you fill your bottle up and it starts off empty and you fill it all the way up to the top the amount of water that you've had needed to take on every single number between zero and one because you can't just fill up water you know in kind of small increments of say hey I'm gonna put in 0.2 liters every single time because the water doesn't just you know teleport from A to B but when you're pouring in water it's more like we see in the stream here and the water level rises and Rises and Rises and so the amount of water that we have in our cup needs to take on every value between zero and one so that's an example of continuous data for but you see that you know we can be limited to zero and to be between zero and one we don't have to you know start at zero and go all the way up to infinity or something but it's just that the range that we're looking at every single number can can be applied or every single number can you know happen another good example would be the speed of a car if you start and you you know you're standing still and you're studying you're standing at a stoplight and then you want to accelerate in the speed limit us say you know 50 miles an hour or something to get to 50 miles an hour from your starting position your car has to take on every single speed in between and of course you won't see that you know on your spot on the speedometer it would say something like zero miles an hour one mile an hour you know maybe you can go to like it's going 0.1 0.2 0.3 or something like that so it may look discrete to you but that's not how your car is going your car doesn't say it like oh I'm gonna go in these step sizes of speed it's gonna accelerate gonna take on every value starting from zero going up to 50 miles an hour and you're gonna when you're in this transition you're gonna take on every single one of those speed values so that's how continuous data looks like and it's important to understand the difference between this discrete and continuous just because you may want to approach it differently now of course if we're dealing with computers our computers can't deal with infinite numbers in the decimal places we have to cut it off somewhere and so usually continuous data is gonna be rounded off at some point but it's still important for you to know that you're dealing with continuous data here rather than discrete so that you know hey there can still be other stuff in between here and or all of these things rather than you know having specific step sizes and all you see is just kind of a bunch of lines at every step size but you can expect that when you have continuous data that everything is just kind of filled filled up that everything can and may even well be in between certain places so that's that's kind of the important thing to note between discrete and continuous alright so the next type of data that we'll have is categorical now categorical data doesn't really have a mathematical meaning and you may also know it to be qualitative data um and categorical data it describes characteristics so a good example of this would be for example gender so here there is no real mathematical meaning to gender of course you know if you have good data you can say male is zero and female is one but you can't really compare the two numbers even though you assign numbers to them and you may just do this so that you can split it up later on there your computer can understand but it doesn't really make any sense to compare you can't say you know is male equal what you can say male is not equal to female but you can't really say is one greater than the other or is one approximately equal to the other those things don't really make sense because they're not well-defined what does that mean and you can't really add them up either you can't say male plus female but that doesn't it doesn't give you a third category or something so categories that you can't really apply math to them but there are nice ways to kind of split up or group your data and they provide these nice qualitative pieces of information that are still important it's just you can't really go that well about you know like plotting them on a line or something like that so those are important things to note with categorical data and then another example would for example be yeah ethnicity or you could also have nationality all of these things are examples of categorical types of data yeah I'm so like we said you can assign numbers to them but that's really just for your code so that it's easy to kind of split them up but you still can't really compare them how are you gonna compare nationalities there is really no definition for you know comparing one type of category to another alright and so the third type of data that you can encounter is something called ordinal data now ordinal data is a mixture of numerical and categorical data and a good example of this would be both tell ratings so you have you know star ratings 0 0 1 2 3 4 or 5 stars or maybe even 6 stars or you know whatever it is whatever the hotels go up to these days um but it's still not as straightforward to compare so I'm sure you've seen two different types of three-star hotels one of them you know had the bare minimums the beds were okay but it wasn't really anything special and then you had this three-star hotels that you could have sworn we're at least four-star and so star ratings do make sense we can say you know a four-star hotel is probably better than the three tier hotel because there have been standards there are standards for these things they have been checked you know if you go to a four-star hotel you know what to kind of expect but still it's not completely defined so like you know coming back to this three star example it's very hard if you just say hey we're going to a three-star Hotel it's very hard to know exactly what to expect because there are different parts of three-star hotels there are three-star hotels that have developed onto like have a swimming pool maybe or something like that and then there are those three-star hotels that are really more like hostels or something that I've just made it past the to start place and so there it's much harder to kind of define or to know what to expect now if you take averages of these star systems though then you do get a much better idea of what's going on so if you have you know consumer reviews or something like that and you say Oh from you know 500 reviews our hotel has an average rating of like 3.8 then you know that the three star hotel that you're looking at is pretty much a four star hotel it feels like a four star hotel even though it may not have all of those qualifying characteristics that's the kind of feel you get from it whereas from another three star hotel you may have a rating of like 2.9 or something and there you know oh you know this hotel is more towards the lower end of the three star some people may not even consider it to be three stars and of course you know this rating may be a little bit biased because they went to a different three hard star hotel first and then they went to this one and they were expecting something completely else from the three star hotel so they said this can't be three stars this is two stars but it's because of the way that the ranking system is defined underneath and everything and so when we have these averages of these ordinal numbers then they kind of start to make a little bit more sense alright so let's go over a small exercise and see if we can identify what type of data we're dealing with so the first thing we'll look at is gonna be the survey response to happiness now you have people filling out a survey and then this and then one of the questions is you know how would you rate your happiness and it's gonna be bad neutral good or excellent what type of data with this be well this would be an ordinal type of data because it's still in a form of categories and you're asking for the subjective opinion but it does make sense so you can still compare them you can say excellent is greater than good good is greater than neutral neutral is greater than bad but what exactly does it mean to be good and excellent you know where do different people draw the line for this there's it's still a little bit of vagueness involved but generally it does make sense and you can compare it and if you have a lot of surveys and you averaged them the values you're gonna get are probably going to be very well representative or at least pretty good representative all right so if we look at the next thing which is going to be the height of a child what type of data is that now we can say it's probably numerical and well it actually most definitely is numerical so the height of a child is a numerical value but let's go a little bit deeper and say is the height of a child discrete or is the height of a child continuous well even though when you measure height you get something like you know five foot five foot three or 160 centimeters or something like that it's not a discrete value because to get that height you have to have reached every single height before and so even though at the moment you may be measuring it you're kind of rounding it off to how much your measuring tape can measure so like your measuring tape is kind of limiting the height but if you had a super super precise measuring instrument you could measure not just you know five foot three or something like that you could really go into detail with the inches and the decimal places and there and everything kind of going on so the height of a child would be a numerical data type but it would be continuous all right now let's take about talk about the weight of an adult do you expect the weight of an adult to be either discrete or continuous so we can probably agree that it's numerical because it's a weight value it's it's pretty much defined to be a number and what do you expect it to be discrete or continuous well the right answer here is gonna be continuous again because to reach a certain weight they would have had to have reached every single weight in between before so again the weight is something that we can consider to be continuous all right and so finally let's look at the number of coins in your wallet again we can already by the name it says a number of coins so we can probably agree that this is a numerical type of data but the number of coins in your wallet would that be discrete or continuous well the answer would be discrete because it doesn't really matter what's your anoint your coins are they could be 50 cent pieces that could be 25 cent pieces ten or five or ones or anything you know like a two or something like that but they're not going to be but the number of coins that you're gonna have we're gonna sum up to a whole number so you can have one coin you can have two you can have three all of these things but you can't have infinite fractions of a coin you can't have say you know the square root of two number of coins that doesn't really make sense so you have a defined step size you have one coin and then if you have a second coin then you have to you get a third coin meaning of three you're going in step sizes of one so for the number of coins in your wallet we'd be having discrete numerical data in this tutorial we're going to talk about the different types of averages now we're going to see the three different types of averages which is the mean the median and the mode alright let's get started so we'll start off with the mean now the mean is the typical average that you know and really what the mean is is you just sum all of your values up and then you divide them by the total number of values that you have now the great pros of the mean is that it's very easy to understand it makes sense we just have everything we have and just kind of add it all up and then divided by what we have and that should give us a good representation of what is the average and it also takes into account all of the data so since we're adding everything up and then but dividing by how much data we have we're taking into consideration every single data point now there are some problems with this so one of the problems is that the mean may not always be the best description and we'll see why when we look at examples for when we should use the median and the mode and the mean is also very heavily affected by outliers so since we're taking everything into consideration if we have big outliers that's really gonna change how our mean looks like so if we just have normal values you know between like one and five and all of a sudden we have like 10,000 in there that's really gonna affect our mean so mean is heavily influenced by outliers and the bigger the outlier more the mean is influenced by it all right so let's see some examples of the mean we'll go through a worked example first and we can see our data set here which is just a bunch of numbers and what we're gonna do to calculate the mean is we're just gonna take every single one of these numbers and we're gonna add them up and we can see the total result that we get here and then the next thing we're gonna do is we're gonna take this total result we're gonna count the amount of data points that we have and we're gonna divide one by the other which then gives us our mean as we can see here so that's an example calculation of the mean but let's see some example applications of the means so when would we use it well good application would say if you look at the time it takes you to walk to the supermarket so sometimes you walk a little bit faster and maybe it takes you 20 minutes to get there sometimes you walk a little bit slower it takes you 25 but on average it takes you somewhere like 22 or maybe 22 and a half minutes or something like that so if you say I'm gonna go to the supermarket you're like it's gonna take me this much time to get there another good example of the mean would be exam score for a class so to get a good understanding of how people do in an exam or in a class you can look at the mean exam score last year and since our exam scores are kind of in a smaller range a mean is gonna be good to use it because you can get anything between 0 and 100 but realistically speaking no one's probably going to get a zero so your range is even smaller and so you're less affected by outliers and you kind of know how hard a class is gonna be just by being you know able to compare their means so if you look at one class and it's mean is higher than the other but they have a large number of students or something then you can probably say hey it's easier to get a good grade here or something like that or maybe you know some of these it's more simpler overviews without diving too deep into it alright another good example of the mean would be to say how much chocolate do you require when you get this kind of sweet craving and you're not gonna say like oh you know I require one chocolate bar two chocolate bars but like you're gonna say Oh on average you know I require you know maybe three-quarters of a chocolate bar and sometimes I may want a little bit more because I feel like it and when I start eating chocolate I crave it even more sometimes you know I have it at first and like the tasters doesn't sit right with me right now and so I have a little bit less but these are kind of the amount of things so like if you have this craving you know either you say oh I'm gonna try to be strong or you like hmm well I know this feeling and I know if I eat about you know three-quarters of a bar of chocolate or something I'm gonna feel good my craving is gonna be satisfied so you kind of know what to expect so these are some of the examples for how we would deal with a mean well when we would use mean all right so let's look at the next thing which is going to be the median now the median represents the middle value in your debt data says now if you have an even number of data points you don't really have a middle value and so in that case the meeting is gonna be the mean of the two values so it's going to be the two meeting values added together and then divided by two so the pros of using a median value is that the median can sometimes be more accurate than the mean and we'll see some examples of this the median also evenly splits your data so you're not really you know affected by the mean in the sense that if you have an outlier in the mean and it drags everything to the right it could be that your outlier drags things so far to the right that all of your data is to the left of the mean and only the outliers to the right so that would be an extreme case but that can happen whereas the median you know it's always located directly in the center of your data and the median also doesn't care about outliers so if you have huge outliers at the beginning and at the end it doesn't really care because outliers by definition aren't very common because they're outliers and so if you have them at the beginning or house them at the end they're gonna be very few in number which makes them outliers and therefore the median doesn't really care about outliers that much a con though is that the median doesn't really give you much information on the rest of the data sure you know you know what's at the center but you don't know how does everything we behave you only know where is the center of our data so let's see some examples well do a working example first where we see our data set here and we can count how many values we have is we go from left to right then we can say we've got 1 2 3 4 5 6 7 8 9 10 11 12 and 13 data points so we've got an odd number and so our median value our center value is going to be the seventh data point because it's 6 from the beginning and it's also 6 from the end so it's equally spaced both from the beginning and from the end and so that's why we see our median value here is 26 it's located directly in the center now what is the median useful for well the median is often used if you look at you know household incomes for a country because if you were to use the moon then these billionaires they would just completely you know they would give you a false description of what really an average household income is because normally if you have you know like an average value and you can say oh the average household income from this family would be say $40,000 or something like that or that would be the median value but if you were to use the mean instead then all of the billionaires and all the millionaires in the country they would change that household income and then you would say oh you know the average household income per family would look like 60k and that's a bad representation because that doesn't actually give you a realistic look at what the average household family has and the average household family really does it's you know centered at like 40k and sure there are people below and there few people be high but that's what's in the middle whereas if you were to use the mean instead for your average you would kind of get this inflated household income which wouldn't be representative to the rest of your and the rest of the country another good example of the median would be the distance that people cover to get to work so if you look at this in terms of you know kilometers then you can say like oh you know some people they walk to work and it's like you know one kilometer at most so something like that and then you can expect people to travel most people travel around three kilometers to work and sure there are some you know that travel much further because they want to live outside of the city and there are some that travel very very short distances because they have a house right next to the office where their house is the office or something like that depending on where you're working but then you can look at you know like we're in the middle how do people travel to work what time or what distance do they need to cover and so that would be another good use of the median and a meeting another good meeting value is what do you usually spend when you buy a new item of clothing and so sure you know sometimes may go to that expensive clothing store and you could get a jacket that costs I don't know north of a couple hundred euros or dollars whatever system you want to use and sometimes you can go to a secondhand store and get it for very cheap but usually if you go into stores a jacket I don't know maybe cost you like a hundred dollars or something like that and so you know if you go out you can expect to pay about $100 no not really you know taking that much accountant - what story going in - so most of the stores that you're gonna visit are gonna have that price for the jacket so that would be another good use for the median all right let's look at the third type of average that we can do which is the mode now the mode looks at the most common value in your data and it's not really defined if there are several most common values but if there's only one most occurring value then that's what your mode would be and so we'll see an example of this in a second to the pros of using the mode is that it's not only applicable to numerical data so if you look at categories for example then you can say hey we've got five people from the US you know and two from Canada and one from France and you know that the mode is gonna be the US because there are five people from the US so mode is the great average that's not only applicable to numerical data in this sense but you can technically also apply it to categories or to ordinal numbers if you wanted so that you can say the most common country that we have were the the average kind of country that we would expect tear is the US and sure there are other countries but the average or the most common one is gonna be the u.s. in this case so yeah and then of course and the other Pro is that we allow to see what's most common what pops up the most so that's a great use of the mode if there are cases when you know recurring values happen a lot which is the case for discrete numbers for example so in discrete numbers values recur often and so it's good to use the mode icon of the mode it's gonna be that it doesn't really again give you good understanding the rest of the data similar to what we had for the median but also it's not really applicable if you just have a bunch of different types of data then there isn't really gonna be a mode if there's not enough of each data it's not really good to use the mode you don't want to you know have thousands of data points and the most reoccurring value it reoccurs like three times that's not good you want to use the mode for situations where data re occurs often so like we saw the country example but let's actually see a worked example but also some other examples for the mode so the worked example here would be again we take our data set and we can count how many times different numbers appear and so if we go through the numbers we'll see that twenty six occurs the most and so that's gonna be our mode here so we've got 22 and 25 that both occurred twice but 26 occurs three times and so 26 is gonna be our Millah it's gonna be our most occurring value now the mode is gonna be useful for things like the peak of a histogram so if you draw this histogram and if you don't know what a histogram is don't worry we'll cover that in a later lecture to let me go into data visualization but the peak of a histogram that's gonna show you the mode of the data the most occurring data a good another use of the mode will be if you look at employee income at a company because at a company you know you can again have the boss which takes off the mean and you can have you know higher level employees to which we kind of shift the median but if one third of your employees earn minimum wage that's gonna be the best average or say 40% of your employees earn minimum wage you're probably not your employees because that wouldn't be a very good system to have but a 40% of the employees at the company that you're looking at earning a minimum wage that's not a really good thing to have and if you look at the mode you'll easily see that the average in this case would be to earn minimum wage because that's what most people earn and sure you know the boss he or the CEO or something you know he may shift the mean up heavily and then the fact that you have higher ups if you look at the median value you may even well be too far you know too far to the right that you really don't consider these employees that all are in the same amount but you really want to get that description which is what you get here from the mode and then also the out kind of an of an election is where you use the mode for and sure sometimes you may only have two values sometimes you may have three but if you have different candidates and say you have five different candidates then the person with the most votes is gonna win the election because they have the most and so they are again you'll use the mode in this lecture we're gonna look at a spread of data and we're going to start off with looking at the terms range and domain then we're gonna move on to understanding what variance and standard deviation means and then finally we'll look at covariance as well as correlation all right so let's start off with a range and domain now let's start off with the range though so the range is basically the difference between the maximum and the minimum value in our data set so that's that's kind of simple to think about so let's just kind of go through this with a work example let's set up a company in the town and this is the only company in the town and the owner of the company earns a salary of 200k a year and then the employees you know they all have different salaries but the lowest employees or maybe the part-time workers they earn something like 50k a year so we've got data kind of ranging from 15k up to 200 K and so our range is the difference between the maximum and the minimum value in our DNA so we take 200 K and we subtract 15 K from it and we've got a range of 185 K in salary so that's how big our salary can change so it can if we start at 15 K it can go all the way up to 200 K so that's a hundred and eighty five K range of salary that people in this company can have all right and the domain is going to be the values that our data points can take on or the region that our data points lie in so if we look at this example again our domain is gonna start at 15 K and go up to 200 K so what the domain defines is it defines kind of starting and ending points or it defines a section in our data and so in this case the domain would define you know we would start at 15k and it would end at 200 K and what the domain tells us is that everything or all salaries within you know between 15k and 200 K that they are possible but within this domain or within this company it's not possible to have salaries outside of the glist domain so if our domain again is 15k to 200 K then we can't have a salary of 14 K because that's outside of our domain and we also can't have a salary of 205 K because again that's outside of our domain so pretty much all salaries within 15 to 200 K are possible anything outside of the domain is not possible because that's no longer in our domain all right so let's move on and look at the variance and standard deviation and we'll talk about the variance first and what the variance tells us it pretty much tells us how much our data differs from the mean value and it looks at each mean value and it looks at how different each value is from the mean value and then it gives us the variance it does some calculation and we don't really need to know the formula it's more important right now just to understand the concept of variance and so what it variance really tells us is it tells us how much our data can fluctuate so if we have a high variance that means a lot of our values differ greatly from the mean value and that will make our very it's bigger if we have a low variance that means a lot of our values are very close to the mean value and so that will make our variance lower and now if we turn to the standard deviation the standard deviation is literally just the square root of the variance so if you understand one then you also understand the other and now we can combine this if we know the range of our data to kind of get a better feel for our data and so let's use an example where we have two different countries just countries a and B and they have the same mean height for women which in this case will say is 165 centimeters or 5 feet 4 and we'll say that the range of heights for them could be identical so let's say they can range you know the range let's say it could be like 30 centimeters or something you can go anywhere from say 150 all the way up to 80 or we can even increase that and say like anywhere from as low as 140 up to like two meters or something like that but let's just keep the range for these the same and they both have the mean height now if country a has a standard deviation of five centimeters which is approximately two inches and country B has a standard deviation of 10 centimeters which is approximately four interests then what you can expect knowing these values is that if you go into country a the people that you're gonna see are gonna be much more similar in height so our standard deviation is lower that means our values differ lower from the mean and so that means a lot of the women that you're gonna see are going to be very close to 165 centimeters or 5 feet 4 plus minus 2 inches so it's very what you can expect when you go to this company that when you go to this country is that everyone is gonna be or every a lot of the women are gonna be about that height whereas if you go to country B they have a much larger standard deviation and so you can't really expect everyone to be about 5'4 because it fluctuates a lot more and so if you go to that country you can expect to see a lot more women of different heights both taller and shorter than for all right and so that's how we can kind of use the variants in the standard deviation or the standard deviation to give us a little bit more perspective on our data and kind of allow us to infer some stuff about our data all right so let's talk about covariance and correlation and so covariance will or already has the name very incident but covariance is measured between two different variables and it pretty much measures if you have two variables so let's say we've got you know me drinking coffee in the morning and my general tiredness so if I use these two values and you know get data point so this is how much coffee I drank in the morning and this is how tired I feel this morning or something like that and so what the covariance does is it looks at how much one of these values differs or changes when I change the other one so what does that mean for example well if I drink more coffee what the covariance would look at is how much does my tiredness change so that's what you do with covariance you see you say I change one how much does that affect the other thing that I look at and our correlation is very similar to covariance so we kind of normalize the covariance by dividing by the standard deviation of each variable so what that means is we get the covariance for my drinking coffee versus feeling tired and then we would just divide by the standard deviation of me drinking coffee and a standard deviation of me feeling tired and so really what we're doing with the correlation is we're just kind of bringing it down to relative terms that would fit our data better so that's kind of the abstract idea the important thing to just keep in mind is that we're looking at one and we're seeing how much that changes and we're seeing how much that change affects the other one all right so there are different types of correlation values that we can have and they can range anywhere between negative 1 and 1 or so their domain is between negative 1 and 1 and a correlation of 1 means a perfect positive correlation so that when one variable goes up the other goes up so for my coffee example that would be if I have coffee in the morning then I also feel more happy so the more coffee I have the more happy I feel and of course there's going to be a limit but let's say I only drink up to two cups of coffee or something like that and I can drink anything in between and the more I have the more happy I am about it so that would be a positive correlation the more I have of coffee the more I have of happiness and so they would kind of go up together and then when we get closer to zero the zero point is gonna mean no correlation to us so anything between zero and one is going to be a kind of slightly positive correlation it's not going to be a super strong and we'll actually see some examples on the next slide but yeah so anything between zero and one is going to be a kind of slight positive correlation not super strong and the closer you get to zero the more it means no correlation so an example for the zero case would be that it doesn't matter how much coffee I drink in the morning it's not gonna affect the whether they're unrelated one does not affect the other so I could drink you know one cup of coffee during a sunny day and one cup of coffee during the rainy day and it's not gonna change the weather it's not gonna affect the weather so they're pretty much uncorrelated and then we can also go down into the negative range and so the closer we get to negative one or if we reach exactly negative one that correlation of negative one means a perfectly negative correlation and so here we can take our example of coffee versus tiredness and so the more coffee I have the less tired I'm gonna be so coffee goes up and tiredness goes down so that's how we can kind of understand this correlation and it comes from the covariance so it was important to understand the covariance we usually use the correlation because the correlation because we divided by the standard deviation of each is much better fit to our data now there is one thing that's very important to remember and that's that correlation does not imply causation so just because two things are correlated that does not mean that one causes the other so a good example of this would be if I live in a climate where it's usually cloudy in the morning and I know it to be sunny in the afternoon but every morning when it's cloudy I drink coffee and then it becomes sunny in the afternoon that's not even though they may be correlated me drinking coffee and it becoming sunny it me drinking coffee does not cause it to be sunny that's just you know by chance it's just because it happens every day and by chance there's this kind of correlation that appears but that does not mean that me drinking coffee you know results in the weather getting better a causation would be me drinking coffee and me feeling less tired or me drinking coffee and me feeling happy about it because I like the taste those would be causations so that's an important thing to keep in mind just because things are correlated does not mean that one causes the other all right so let's see these things on a graph and so here we have the examples again that we've talked about but we can kind of see how the data would look like for different types of correlations and so we can see a perfectly the perfect correlation of one so one goes up the other goes up we can see on the left side and we pretty much get this really nice straight line so one value goes up the other value goes up with it and then the closer we reach zero the less related or the less correlation there is between them and then the more kind of variance we have in data so we'll notice for the case of perfect correlation which is the one or the case of perfect anti-correlation which is the minus one which again we had the example of more coffee less tired and in those cases you know we have a very nice thin line and our data doesn't jump around a lot but the closer we get to zero the less we can see you know one causing the other and the more we can see our data kind of spread out and so that's what correlation would look like in terms of graphics in this tutorial we're gonna go through quantiles and percentiles all right so let's get started so what our quantiles quantiles allow us to split our data into certain regions that if we're dealing with probability they all have the same probability of occurring or if we're just dealing with you know sizes of data we want to split our data into equal regions so that's what we can do with quantiles is just splitting everything up so that every time we split it you know we have equal amounts of data all right and so an example of a quantile would be something known as a quartile and so that's when we split our data into four equal regions hence the name quartile so a quantile is the general name for doing this splitting procedure and then if we say quartile that means we're doing quantiles but for four equal regions and so this is something that you'd probably often see unlike university admissions pages or something like that and they say the top 25 percent of our applicants have at least a test score of like 90 percent or something you know and then they would say the bottom 25 percent for applicants or our admission or admitted students or something like that have a test score that is I don't know 70 percent or 75 percent or something like that and then the median test score is 85 percent so that's how you would go about quartiles is that you would have you know the lower 25% the middle 25 to 50 then you've got the 50 to 75 and then you've got the top 25 percent so the 75% to 100 and so you've got these four equal regions which also include your minimum value at the very bottom your maximum at the very top and in the middle you've got your median values so that's the value directly in the middle it's because you're splitting it up into four equal regions and so the value that separates the second quantile which would be the 25 to 50 from the third quartile which would be from 50 to 75 that value there would be the median value all right and so if we go into percentiles so percentiles that may have been a name that you you've probably heard before percentiles again an example of a quantile but instead of saying you know like a quartile we do it for for a percentile I mean splitting it into 100 equal segments hence the percentiles of the perks name at the beginning though that's that's kind of where or the percent you may have noticed percent means out of 100 or so that's if you are familiar with percent and that's also the same kind of reasoning where this comes from and so we've got percentiles which means splitting into you 100 equal segments and so on an example of this is often used in test scores so if you've ever taken something like the SATs or something like that then you get a test score but you also get a percentile and the reason I've done that is it's to judge not you versus the test but you versus everyone else and so if it's a difficult test than something like getting a test score of 60% but you're in the 95th percentile means your score is actually a lot better and so what you can say with percentiles for example is that every percentile that you're in means you're better than you know that's many other people so for example if you reach the 99th percentile that means you're better than 99% of the people that took the test the 95th percentile would be 90 you're better than 95% of the people that take the test or something like that and so that's why percentiles are often used for tests and they're often used for normalization because they allow you to take into consideration you know these factors of like is it a difficult test is an easier test maybe more people are scoring higher so they don't really judge you directly versus the test but they normalize you against everyone else that took the test so you take the test you get a score and then um the percentile checks where that score lies relative to everyone else and so these percentiles they allow you to give a good normalization and they allow you to do great comparisons because they allow you to kind of eliminate some of these factors of test difficulty and of course you know there can always be luck involved and stuff and that may not get filtered out on and visual basis but if you do this for a lot of students and that's also why it's done in these kind of big standardized tests is that you get a percentile along with your score so that you understand if you know maybe if your score is lower but the test was really hard you can still see you know I I did really well because people found this test really hard and it was even harder for them than it was for me in this tutorial we're going to talk about the importance of data visualization all right so what we're going to talk about is first we're gonna look at the role that the computer plays kind of for us and what role the computer is actually made for then we're gonna look at what role the human should play in terms of data science then we're going to look at presenting data and finally we'll talk about interpreting data alright so let's get started and talk about the role that the computer plays no computer is much much faster at calculating than human because that's what it's made for it's made for crunching numbers it's made for doing fast calculations you know if you think about how faster computers are there in the gigahertz range so Giga means billions so they just do billions of things every second and so they're really good for doing repetitive things because they can do them so fast and then we can give them these logical tasks in terms of programming and we give them a structure and they just do it and they can do it over and over and over again they're not gonna mess up I can just repeat the same thing they won't get tired of it and they're really good and they're really fast at doing these things so that's the role that the computer should play for you it should be kind of a means to get these hard number crunching and all of these things done so there's there's really no need for you to kind of work out all this complicated math because your computer can do it much better and much faster than you and it's also less prone to error if you code it correctly so that's kind of the only part where you come in and it's only gonna mess up if you mess up but generally our computer does exactly what we tell it to do and it's really good and it's really fast at it now what role should a human play in terms of data science well humans have naturally developed to identify patterns and we've done this first for survival so that if we're walking around somewhere and we see a I don't know a big predator or hiding that we can identify that pattern of the predator and we can kind of pick it out even though it's trying to camouflage itself so humans by nature have become very very good at identifying patterns and you can also see this if you look at the clouds and you see things where you see animal shapes and the clouds or other things so those patterns aren't actually there but humans have become so good at identifying patterns we can see things in many many places and so that's what humans are really really good at we're able to look at things and we're able to pick out patterns now another thing that's really good for humans is we are very creative and through their creativity we can also use memory and bring it outside knowledge and we can also use a general understanding so these are all things that computers can't do so computers are kind of a means of getting stuff to us but once it's actually there it's our job to use our pattern recognition abilities and of course you can train machine learning algorithms for specific patterns later on or specific cases and make them really good at that but generally if you don't know exactly what's gonna come then our or your first step as a data scientist would be to try to identify these patterns use your creativity use your memory you know bring in all of these different things use all of these different things that make you human and use all of that on the data all of these things that a computer just doesn't have any access to okay so usually you know you're considering all this the best way to do all this would be in terms of data visualization so you can't just show spreadsheets with a bunch of numbers that won't really help you because looking at numbers it's really hard to pick out patterns the best way to do it would just be to plot values and then if we have these visuals in front of us then you know we can really identify we can see things go up and down and you know we can see them fluctuating and we can see them make very thin lines we can just look at a graph and we can just see things and of course you know we need a little bit of practice to understand what that graph is trying to tell us but once we understand the graph and in general then you know we can look at new grass and we can just see things so we can start to see patterns and they may not always be true but that doesn't mean you know and we can't pick them out and then that's later on you would also do some testing trying to see if those patterns are true if they make sense but generally data visualization is very good for this because it allows you to invoke all of your human characteristics the things that are really good that you know make us human the things that we talked about and the last slide all the things like the computer can't do and sometimes you know you if you deal just with just these numbers it's data visualization is for you and one sense so that you can see these things and try to pick them out I use them later on but also if you're trying to show these things to other people so maybe you have to do a presentation in a kind of summary then you want to make sure that your data visualizations are good because the people that are going to be looking at it are much much less trained looking at data and analyzing data than you are and so if you try to convey them a message and just show them a big spreadsheet with numbers and just point out like here look look look these numbers you know they pop up and they're gonna be like what are you talking about so that's why it's really important to have really good data visualization skills one of them is to enable you to do your job but the other part of it is to show it to other people and to kind of help you convey information to them you know and of course we talked about statistical values and statistical values are very important and they can give us a kind of good idea about the data and what's going on inside of the data but visualizing data is just taking it to the next level and statistical values aren't enough there they can give us you know they can help us they can support us they can give us ideas but if we really want to understand what's going on sometimes we just have to take a look at what's going on and of course they are it's also important to make sure you choose the right visualizations and everything because other times you know may just look extremely weird but just this skill of being able to present data both for yourself as well as for other people is very very important for a data scientists and then we go over to interpreting data and we've kind of touched on this in the last section already but really with data visualization it just allows you to see this data and it allows you to apply some reasoning to the system and so you can if you look at data either you see something which is great you know that means you can try to test something see if it's actually there or you don't see anything and that also tells you something that you aren't really able to pick out a pattern so that there isn't there isn't anything obvious that's going on there may be something underlying that's more complicated but obvious to the user you know it's not there and so all of these things allow you to you know kind of easily or much more easily analyze your data and kind of prepare what are you gonna do after that so this data visualization it really gives you a deep deep understanding of what's going on with your data and then when we interpret this data and we look at these visualizations you know maybe you see dips and you know maybe you see some hills somewhere we can try to understand all of this by bringing in our outside knowledge so again what the human is really good at we can you know bring in the context of things you know maybe people are going out to lunch here and so that's why activity decreases or maybe everyone is coming to work in the morning and so that's why activity increases compared to you know 6 a.m. so all of these things we can bring in all of this context we can bring in all of this understanding to try to interpret the data try to better understand what's going on and then of course we're gonna see hopefully some trends or patterns of course like I said these may not always be there so we're actually so good at pattern recognition that we can see sometimes patterns that aren't really there and so a good example again of this would be just looking at the clouds in the sky and you can see animal patterns maybe but that's really not there that's just our minds you know identifying all of these patterns and so yeah that's that's pretty much why data visualization is so important to a data scientist it's because this whole you human aspect is it's just key in data science it's key and data analytics to be able to understand what's in front of you to be able to understand these this outside knowledge to be able to contextualize this creativity that's really key to a good data scientist and a computer can help you with all of this the computer can help you you know do the number crunching a computer can help you set up the visualizations and it can plot whatever you want for it but ultimately it's up to you to choose the right visualizations to do to look at the data to be able to communicate the visualization as well all of those things are up to you and so that's why the human is so so important in data science in this tutorial we're going to look at one variable graph so we're actually going to see some of the types of graphs that we can do you know that we talked about in our last tutorial where we just looked at the importance of data visualization so now we're gonna go into data visualization and look at the types of grass that you may want to use or that you may want to choose from all right and so the graphs that we're gonna look out in terms of one variable graphs are gonna be histograms bar plots and pie charts so let's get started with histograms now we can see an example of a histogram on the right but what's really cool about histograms is that it shows us the distribution of the data and it shows us the distribution across all the values in our data and so it shows us what happens the least and it also shows us what happens the most and histograms it they let us see where our data is concentrated and they also let us see how its distributed and so the through this it kind of shows a general behavior and so really what histogram is is it looks at each value and it just looks at how often the value has occurred and so what we see here for example is that around 0 you know we have the most occurrence of whatever value we're looking at and as we move to the left and as we move to the right these values start to drop off so they start to become less frequent and so that's what histogram shows us they the istagram shows us a kind of frequency how often these things occur and so there are different types of histograms that you can encounter or I mean generally a histogram is just this plotting of frequency versus your value and so there are different ways that this histogram can look like one of them is the one that we've just seen which is a normal distribution or it's called Gaussian like histogram because it follows this Gaussian distribution or this normal distribution that you know but we can also have like an exponentially decaying value so we start off very high and the further we get away from that initial value the quicker it's then it gonna decrease and you can actually compare that to the Gaussian like or to the normal distribution so the normal distribution looks more of like a bell it kind of goes up and then curves down slowly whereas the exponential it cuts off very fast and then kind of slows down later on so they do have different behaviors and then of course you know we can also get not just one peak like we see in this first case and the Gaussian like distribution but we can also get things like two peaks or we can even get three peaks or more we can have very large extended peaks and so our histograms there are means of showing us how this data is distributed there are means of showing us you know what things occur most frequently where is our data concentrated but that don't that doesn't mean that they're gonna have to have a specific value and so they're or your specific shape so there are many different shapes that our histograms can take on and depending on what shape that you get that also tells us something very different about our data all right so the next one variable plot that we'll look at is going to be bar plots and so what bar plots do is they may look a little bit similar to histograms at first but they are very different in some sense because bar plots allow us to compare across different groups and so that's what we see on the x-axis down there is we look at different groups and so we use the same em but we can compare that variable over different groups and so if we look at that in example so what we see on the right here is we look at different countries and what we show is we show the average income tax and so we see that country B for example has the highest average income tax whereas country D has the lowest income tax so through this you know we're still only looking at the income tax variable but we are able to compare us over different groups over different categories if you will so other examples would be if you look at control groups and test groups or if you're doing some like medical study or maybe some psychology study or something like that you always want to have your control group and then you can have different types of test groups and then you can plot each of these groups as a bar plot and you can look at the same variable but you can look how that changes over the different groups another example would be something like comparing male versus female heights so you've got one group that's male the other group that's female and you can just plot their average height and then the tax the income tax of different countries which is what we seen on the right over here all right and so the last one variable graph that we're gonna look at is gonna be pie charts and what pie charts are allowed to do is they allow us to section up our data on the and then we can kind of split it into percentiles and because of this we can see what our data is made up of so the whole pie corresponds to a hundred percent and then we kind of cut it down at different slices and through that slicing and then hopefully also color coding like we've done here and maybe even labeling or most definitely labeling so that you know what slice corresponds to what value we're able to see what categories you know or what what categories our data is made up of and so we can see what is most prominent but we can also see what is at least prominent in all of these things and then again here we can see also distributions not as well as in the histogram but we can still see distributions in terms of dominance in terms of how many groups there are is the data spread evenly is it you know heavily concentrated in one part of the pie all of these things allow you know is that's what we're able to do with pie charts we get this nice kind of group overview of one variable so examples of this would be you can look at NASA D distribution in a university and so you can have a pie chart and just each slice of pie which is to represent a different ethnicity and depending on you know how much of our percentage they make up the total University profile that's how big the slice of pie would be and so you can see dominance of some ethnicities as well as you know minorities but you can also see just by how many slices that are you can see how many different ethnicity groups there are and another example would be you can split up star reviews for a product so rather than you know looking at the average mm star review you can also just use a pie chart and you can see how many of my reviews are 5 stars how many of them were 4 stars 3 2 and 1 and so there you can again also get this nice different overview of how the review system would work now we're going to talk about two variable graphs so the graphs that we're gonna look at are gonna be scatter plots line graphs 2d histograms or two-dimensional instagrams and box and whisker plots alright so let's start off with scatter plots now for a scatter plot what we're doing is we're really scattering all of our data points onto a graph and so pretty much every data point that we have we kind of put a little dot onto it on the graph and scatter plots are great because they allow us to see spread of data between two variables so we're always plotting one variable on the x-axis and another variable on the y-axis and it just pretty much allows us to see how the data is distributed for these two variables and then through that we can also see more dense areas we can also see some sparse areas and we can also look at correlations so maybe you remember in the lecture we talked about correlations we were able to see through scatter plots where those correlations where or weren't any correlation so all of these things that's what scatter plots are really really nice for in scatter plots of course we can also use them to have like we see here little clusters so not everything needs to be connected by a line or a curve maybe something is more like a circle and so that's what scatter plots can show us too they can kind of show us these groupings and we see one cluster here but maybe you know you have bigger plots and then there would be smaller you know like ten little different groupings for different things so it's got our plots are really great for that because they just show us where the data points are located for these two variables and then we can you ourself see you know like how how do these look like do does one variable affect the other is there may be certain groupings that we can see where our dense areas where it's sparse where are things concentrated you know is everything spread up all over the place is it very very narrow and only in specific region scatter plots allow us to see all of these things very easily and so some examples where we could use scatter plots would be if we see if we look at the graph on the right we can look at something like a car price versus the number of cars sold so each of these data points pretty much represents a car that's been sold and then the x-axis tells us the price that the car has been sold at and the y-axis tells us the number of cars that that have been sold at this price and so what we see here for example very easily is that the more the car is priced the less it gets sold and then maybe you can think of that in terms of well the more its price maybe people don't want to buy such an expensive car maybe they found a cheaper version of it so maybe it's just a branding thing which is why it's more expensive maybe there's something just as good quality that's cheaper maybe people just don't have enough money so that's probably a big factor to that people just don't have enough money to buy these expensive cars and so that's why they drop off and so it may look a little bit different in terms of profits but the higher the car is priced the less we see it being sold so that's one example of a scatterplot then something else that we can look at is maybe the income versus years of education so only we would look at on the x-axis how many years someone has been educated and then we would look at that current income and that would just be a point on the on the graph and we can do that for many many different people and then we can see how different education for different people how that affects their current income so that's another thing where we can do a scatter plot for we can also go back to one of the earlier examples that we used very early on where we talked about people traveling to work and we can just plot the distance traveled versus the time it takes and travel to work and then we can see you know maybe some people travel faster it could be that some people travel the same distance but one takes longer than the other because one goes by car the other one goes by bike the other one takes public transport all of these things so all of that we can see in these scatter plots and just kind of take into account these different situations and see how that all looks for the more for the general population of our data or just generally for our data so scatter plots are really really great as a kind of first go-to just also identifying trends identifying regions and just giving a good overview of your data now the next thing that we'll look at is going to be line plots and line plots in some sense are kind of similar to scatter plots so we have the same bases of the X and the y axis but the points are connected and now it's very important to know when to choose line plots and want to choose scatter plots so line plots can carry a lot of advantages with them because this connectedness it makes it very easy for us to see trends because we can see where are these lines go not just trying to connect the points in our head you know like kind of connect the dots but that's exactly what I'm a line plot does is it connects the dots for us and so we can see these lines it's great if we want to see an evolution of something so maybe we want to see an evolution over time maybe you want to see an evolution over space and evolution with people something like that just if our data points are connected it's great to use a line plot so if we know that whatever happened before is connected to what happens and it's great to use line plots because line plots show us how things evolve because they're all connected as a line but if we're to do scatter plots and we just kind of plot points randomly and just because if we go back to or our kind of car sold car price example just because someone bought an expensive car or if we look at the expensive car and it's been bought say like five times and we look at a cheaper car it's and bought a hundred times there isn't really a logical connection to make between the two and so if we were to use line plots where we should use scatter plots really what we'd see is just a bunch of lines all over the place and so that's why it's important to kind of know when to use lime stalks and want to use scatter plots because it can be very very helpful if you use a scatter plot instead of a line plot it's gonna be a bit more confusing because you have to try to connect the dots yourself in your head but if you use a line plot instead of a scatter plot it's gonna look really weird because there's just lines all over the place and you can't really see anything so an example where we could use line plots is you have the typical distance versus time so you can look at you know how far someone or what time it is and then you know how far someone has traveled just a general curve of distance versus time that's very very common and you can look at the profit of a company versus the number of employees so the more employees they imply employee how does that change their profits so of course they have to pay the employees more but maybe the employees can also do more work and hopefully you know that kind of cancels out what you pay them and then increase this company profits and then what we can see on the right here is we can look at your creativity and how that changes with stress so you can see that the more stressed out you are the less creative you are and here it's also good to use a line plot because you kind of gradually advanced and stress and so each point and stress is kind of related and the higher you go up and stress the lower you go down in creativity and so there's this kind of relation where we can see this evolution so the more you get stressed out the less creative you become and so line plots are really nice here because there's not this chaotic movement everywhere but it's very nice and it's very easy to see this line it's very easy to follow okay so the next graph that we can talk about is 2-dimensional histograms now we've seen one-dimensional histograms in the last tutorial where we looked at the spread of data and we looked at the peaks and how you know things were distributed to the right and to the left but we can also do a 2-dimensional histogram and so what a 2-dimensional Instagram is it's a one-dimensional histogram but it's a pretty much a histogram for every single point of the other variable that we're looking at so really what these things allow us to see is they allow us to see how the different distributions of the two variable is relative to to another so we can see here for example in the red region that for those specific values them they happen a lot so that combination of values happens a lot and so we're able to kind of pinpoint these frequency occurrences again and we're also able to look at drop-offs but we're able to pinpoint that to two specific values now rather than just 1 which is what we did to the 2d histogram and these things are much harder to see in scatter plots because in scatter plots if we have a value occurring a hundred times it would just be the same dot and the dot wouldn't get bigger now of course you can make the dot bigger yourself if you wanted to or you could change the color or something like that but really if you do a scatter plot and the same thing happens a hundred times it's just gonna look like one dot whereas for two-dimensional histograms we can see that it's not just you know it's not just happening once but we can actually see the frequency of those variables or those those two variables together so an example of a two-dimensional histogram would be if we look at ticket prices versus tickets sold and so if you look at the lower left corner and we can kind of see this red peak so that's cheaper ticket prices but the tickets are also sold often so we know that tickets at that price are sold quite often and these could be you know like new rising brand bands these could be like you know you kind of standard bands that maybe you want to take someone on a day-to but you don't want to spend much money on a ticket but still a concert is a nice idea and so that's a good ticket price that sells a lot of tickets because it gives you the pleasure of the event without making it too expensive and then if you move more towards higher ticket prices and then if you move more towards more tickets sold then you can see that for high tickets high ticket prices which would be you know like these big bands then we can again see how many tickets we've sold so we can see that for you know a higher price and if we go up and ticket sold so if you want to see lots of tickets sold for a high price then the red Peaks are gonna give us all of these more famous artists so that's you know one kind of application but of course there are many many better ones it's just these things you know if you're in the moment and you you can kind of then you would realize oh this is when a two-dimensional histogram would be a great thing for me to use so a lot of these graphs they're great to know and once you're in the moment then it's much easier for you to pick out which graph would be best representative finally the last graph that we're gonna look at is gonna be a box-and-whisker plot and I want box and whisker plots allow us to do is they allow us to see the spread within our data so it's not just like a bar plot which just shows us one value but we can actually see the statistical spread so we can see median values which is what we see here we can see quartiles the little dots on the outside actually show us outliers and so what box and whisker plots allow us to do is they allow us to see the statistical information but they allow us to see it visually and that makes comparing across different groups which is what we're doing here much easier and so a good example of that would be if we look at ticket prices for football games for different teams so we have different teams and different teams of course use different stadiums and there they have different popularities and some teams may be much more expensive or the ticket prices may be much more expensive than other ones and so we can compare these ticket prices using box and whisker plots and then we can see you know what is the higher end of these so those are gonna be the more luxurious seats and then we go to the bottom and those are going to be the less luxurious seats probably the ones where you stand and then you have middle values depending on you know the standard seats and where you are in the stadium if you're close to the field if you're further away from the field but you're still sitting all of these things we can kind of see here and that's what gives us a spread we can compare that across different teams and we can see the spread across difference teams we can also see which teams are more expensive you know where do the prices vary the most for specific teams and maybe some teams have a super launch and then they have your you know standing places that are just much cheaper and so you would see a lot larger spread or maybe some teams just have you know only seats and see you'd see a much lower spread and so all of these things were able to compare using box-and-whisker plots over different groups in this tutorial we're going to talk about 3 and higher variable graphs so the graphs that we're gonna look at are is gonna be heat maps and then we'll also look at multi variable bar plots as well as how we can add more variables to some of the lower dimensional graphs that we've talked about earlier all right so let's start with heat maps now what heat maps allow us to do is they allow us to plot two variables against each other and the X and the y and they laws to show an intensity or a size or something like that in the Z direction or towards us so an example of this which is kind of what I've tried to illustrate on the right is a customer moving through a storm and so we can track the path of the customer in the X and y direction of the store so you can kind of get this bird's eye view and see where they move to and the darker spots actually tell us the positions where they spend more time at so we can see that they spend a little bit of time you know at the beginning they moved in and then they stopped lunch was what we see with dark spot being maybe they found like the candy aisle or something there was a specific piece of candy that they wanted and then they moved on and then they started to go towards the corner around the corner a little bit and maybe they reached the fruits in the vegetable section there and picked out several things and then they started to head towards the checkout counter which happens at the very end and they were moving at a more constant pays sometimes they stop to look a little bit but they just kind of continued moving on and so the three variables that we've shown here as we've shown their exposition in the store we've shown they're by position in the store and to their color we've also shown the time that they spend at each position so that's what we can use heat maps for and then another example the heat map would for example be if you take a flashlight and you move it over the screen and really what you're showing is the amount of time that you've shown the flashlight onto a specific region so that's kind of another example the heat map but usually heat map as the name implies it allows you to track positions and so it's very often used for things like tracking customers through stores or just tracking general people location where they like to spend their time and the intensity that you see in terms of the color is usually the amount of time that they spent there all right so we can also do multi variable bar plus a multi varied bar plot and so it is very similar to a single bar plot where we just plotted one value over different groups but rather than just plotting one we kind of cramped them together and we plot several and so an example of this would be that we plot you know goal scores I'm goal scored 14 the shots taken on goal as well as the shots on target and so we can see maybe there are teams that shoot less on goal without score less but that's because they also shoot less and therefore they also shoot less on target or maybe there are some teams that do score a lot and that's because they shoot a bunch but they just don't hit the target that often or maybe there are really good teams that score a lot and they also shoot a lot on target and so all of these things were able to then compare over different groups and so that's what we can use multi variable bar plots for if there are several variables that would give us a better understanding of the system than just looking at the variables in one at a time but it also be really cool if we could compare all of them then we could use multi variable bar plots for that and just plot them on the same bar plot and then we can see how they change you know within a group we can also see how they change over different groups okay and something that we can do is we can also just add extra dimensions to lower dimensional graphs that we've had and so we're kind of limited to three dimensions because that's the amount of space dimensions that we live in but if we take a scatter plot for example where we started off with just the X and the y axis and points located what we can do is we can actually add a third axis so we can take the X and the y and then we can add a Z and that gives us an extra depth dimension which is exactly what we see here so rather than just plotting unlike a two dimensional field unlike a plane we even actually plot it in a volume and so we can see this kind of scattered ball that we've done kind of kind of all that we've done here which is kind of located at the center of our plot and so this can be really cool because it allows us to see depth to the problem with this is that we have snapshots every time and so really we're looking at two-dimensional snapshots and so to get the best understanding of this we need to rotate our scatter plots or our plots as we do them so that we can also add in our depth perception because right now for looking at it it may look three-dimensional but really it's just a two-dimensional snapshot and to get the best understanding if our scatter plot is located more towards us and more towards the left or something like that or maybe it's just really high and close to us or maybe it's really low and far away to understand all of these things we need to be able to rotate our scatter plot so that we can see it from different angles which then gives us this depth perception and we can do the same thing with 3d line graphs so here we see an example of maybe the position of a skier as they're skiing down a hill and then we can kind of trace that through time and we see that they're kind of they're going down the hill in this nice exact motion as you should and we can just track their position over time so here we've added this extra dimension to the 3d line graph rather than just taking maybe a time and a position at a time or something like that we've added a second positions or actually even a third position so we've got the X to the one there's that position and then we just trace it and over time and so that gives us this whole line here and so that's how we can take these lower dimensional plots that we've looked at before and we can just add extra dimensions to them if we want as long as it's still easy to see as long as it makes sense what we're looking at yeah we're really just able to maybe just slap on another direction there and you know compare another variable in this tutorial we're going to touch on the third major section that is really great for data scientists or that should be an essential of data scientists which is the ability to program okay and so why do we program well there are different reasons why we want to be able to program the first one is going to be the ease of automation the second one will be the ability to customize and finally it's because there are many great external libraries for us to use that it would just make our job so much easier alright but so let's get started let's talk about the ease of automation for us what do I mean with that well being able to program it really allows you to prototype really fast allows us to automate things and it also gives us the extra benefit if if we have something in our mind we can just take that and kind of put it into the computer by programming it and so we're able to automate everything very fast and we don't have to do these repetitive tasks you know maybe copy pasting stuff into or from Excel or all these things and if we just want to repeat something or we want to quickly change something up and just change a small thing we don't have to do a lot of stuff we can just change that in our code and then click play and let the computer take care of all of that for us rather than us having to do everything manually so it's very easy for us to automate things um and also for doing reports it's very easy to automatically create these reports you know all you have to do is set up your program to deal with the data that you're going to give it and then I can automatically create reports every week and the reports can be different because you give a different data and it should still look the same but the data the values can be different and so that will just automatically create all these reports for you and you don't have to do that all yourself the program does it for you but you've built the program and you're giving it this different data so you're still doing all of the analysis it's just you get to skip the part of copy pasting and like looking across and taking over the values and doing all the formatting of just doing the same report over and over and over again all of that is taken care of for you and all you have to do is just put in the right data you know write out everything that you want to do and then click play and let the computer handle all that for you because remember that's what the computer is doing and good at doing doing these repetitive tasks okay we also want to be able to program because it really allows us to customize it's very easy once we go into data analysis and when we see things that we get these ideas that we want to expand or different directions that we want to progress our analysis into and being able to program it really just allows us to take all that and put it as a code and just choose that direction and we can very easily dive much deeper into our analysis and discover things fast because it's up to us to where we want to go and so this ability to customize with programming it's it's very very important because we're not reliant on anything else we're not reliant on you know some software and maybe it breaks down or maybe we don't know how to perfectly use it and we have to read the manual and read a like a Help section and know we know how to program and we just type down exactly what we want to do exactly where we want to take it exactly what we want to see and we can customize very very fast with that we can also prototype very very fast with that and maybe if a visualization is not working to turn a scatterplot into a line plot it's very easy you just change one word so all of these things are very very easy to do with programming because we have all that power at our fingertips and we can just you know change everything that we're looking at everything that's being calculated maybe we want to calculate an extra thing and take out something else because it's irrelevant all of these things were able to customize and all of that we can do because we're able to program so really what we're doing is we're making the data ours so we're taking full control of the data we're taking full control of where we want to go with our analysis what we want to see and what we want to show all right so let's talk about first libraries but also give you two great Python libraries that you should you know maybe feel comfortable with or that you should maybe consider using for data analysis so first of all what are libraries will libraries are pieces of code have been pre-written by others that you can just take in and use and so a very good example of this is something known as a math library and so that has all the square root functions taking to the power you know taking the exponential the sine the cosine all of these things that you know and you want to use but you don't want to program yourself so like it pretty much avoids that middle step of you having to program the equation to calculate a sine because all of these things those are things that we don't want to do we don't want to get distracted from our target we want to be able to do exactly what we want to do without having the program completely other stuff and so that's what libraries are great for they're developed for by the community for everyone to use you know everyone is helping each other and these libraries they just bring a lot of power with it and so one of these libraries is called pandas and pandas is pretty much like Excel but it allows us to do or we can do programming with it which just makes it so much better because we can do things so fast with it we can do all this customization we can do all this automation whereas you know like Excel if you give it too much stuff too much to run it would just start to crash because it has to handle all of this other things all these other visual things you know the UI and there's a lot more it's a lot it's not a structure as well whereas in programming the program you know your computer just goes through everything step-by-step it doesn't have to take care of all of these visualization things it just does the calculations down below but we can still do all sorts of data management with them so we can shift our data around we can drop columns we can split things up you know we can split things up our row we can pick out certain rows we can even do statistical calculations on our data so we can say you know hey calculate the mean for this we don't even have to you know make our own formula for calculating the mean or for calculating the standard deviation or for calculating correlation between different columns all of that can be done with pandas with just a you know a couple of key words and so it's really easy to do data analysis with it because all of the functions that are there and we know exactly what we want to do but we don't have to write the code for all of it so if you wanted to look at correlations we just say hey pandas do correlations rather than having to you know code all the correlations for ourselves and doing you know coding that whole algorithm and that makes it really easy and really fast to get results and to get to where you're heading because you don't have to go into any of these mineral places you can pretty much just skip the middleman of having to you know write all of those algorithm to yourself and you can just use them so that you have your start you have your idea you know exactly what you want to do and you can do exactly that to get to your goal the other library that is very cool would be matplotlib which is what I use a lot for data visualization it allows me to create graphs allows me to visualize my data allows a bunch of customization so I can really just move everything around in it I can move my spines I can turn things on and off you know all of these things are very easy to do with MATLAB there's a lot of great customization that I'm able to do with it so these are like kind of two basic Python libraries that you should probably maybe get to know em or you can look at some of my other courses and one of them pandas would deal with the data analysis part and matplotlib would help you deal with the data visualization part of it so that's it that's a super basic breakdown of the three main components of the otherwise vague term data science if any of this has piqued your interest then you may have a data science future ahead of you and I encourage you to continue to pursue your interest if you want to learn more from me I've got a blog on my website coding with max comm that dies more into different topics related to data science you can also get access to some of the resources such as cheat sheets and workbooks that I've compiled for you there if you're serious about learning data science you can also check out my courses on data science which are designed to teach you all you need to know about data science even if you have no prior experience of course if you have any questions that aren't answered at my website you can always feel free to reach out me personally\n"

Intro to Data Science - Crash Course for Beginners

Random Videos