#26 AI for Good Specialization [Course 1, Week 2, Lesson 2]

Exploring Air Quality Data: A Step-by-Step Guide

The first step in exploring air quality data is to examine the distribution of values for each column. This can be done by creating histograms, which provide a visual representation of the data. By looking at these histograms, we can see how much missing data there is in each column and get a sense of the characteristics of our data set.

When examining the data, it becomes clear that there are significant amounts of missing values, ranging from 15 to 30,000 per column. This means that between 10 and 20% of the time, sensors would not be able to produce a reading for one of the pollutants, resulting in missing values. Replacing these missing values with estimates will be a challenge in our analysis.

One way to visualize the distribution of PM 2.5 values for a particular station is to create a histogram. This allows us to see the range of values and get a sense of the distribution of data. For example, we can see that the PM 2.5 at the USM station covers a range of around zero to approximately 70, with a peak at around 50. The distribution is uneven and skewed towards lower values, which is good for air quality.

The histogram provides a clear visual representation of the data and allows us to get a sense of the characteristics of our data set. We can use selectors to choose different stations or pollutants and investigate the distribution of each pollutant at each station. This is an important step in getting familiar with the characteristics of our data set and considering whether AI might add value to the project.

In addition to histograms, we can also create box plots to visualize the distribution of data. A box plot shows the median and range of the data, as well as any outliers. By examining these plots, we can gain a different perspective on the data and get a sense of how it compares across different stations. For example, when looking at PM 2.5 values for all stations, we can see that there are only a few instances where the value exceeds 50, and even fewer instances where it exceeds 100.

The box plots provide valuable insights into the characteristics of our data set and allow us to compare the distributions of different pollutants across different stations. This is an important step in getting familiar with the data and considering how AI might be used to analyze and understand it.

One of the challenges we face when working with air quality data is that many sensors have limited range or are unable to produce a reading for certain values. This means that our machine learning model may not be accurate when predicting missing values or predicting the values in between stations, particularly when pollution levels are high. We will need to address this challenge in our analysis and consider how it might impact the accuracy of our predictions.

To gain a deeper understanding of the data, we can also create scatter plots for any two pollutants and compare their distributions. These plots show the relationship between the values of two different pollutants and allow us to get an intuition for how strongly they might correlate. By examining individual scatter plots and histograms, we can gain a better understanding of the characteristics of our data set and consider how AI might be used to analyze and understand it.

In conclusion, exploring air quality data requires careful examination of the distribution of values across different columns, stations, and pollutants. By using visualizations such as histograms and box plots, we can get a sense of the characteristics of our data set and gain insights into how different pollutants correlate with each other. As we move forward in our analysis, we will need to address challenges such as missing values and consider how AI might be used to improve the accuracy of our predictions.

"WEBVTTKind: captionsLanguage: enin this video I'll walk you through the first part of the data exploration notebook for the Bogota air quality project here you'll be looking at some summary statistics and visualizations about the data to get a better sense of the characteristics of the data set itself plus any challenges that you might face in running analyzes over the set okay so here we are inside the notebook first off here at the top of the notebook is a link to the website hosted by the city of Bogota and you can also view this website in English and here you can see some more details about the air quality monitoring Network in Bogota the rmcab and this is where the data set comes from that you'll be working on in these labs you can explore this site to see what their mapping application looks like today and get some sense of how they're using this data you can also download any of the data if you like by going to reports up here and then station report jumping back into the notebook the first thing to do will be to run this cell to import all the necessary packages just like we saw at the start of the previous notebook and by importing this one called utils here I'm importing all the functionality that is defined in this file called utils.py that's in the same folder as your notebook you can see what's in the utils.by file by clicking on the Jupiter icon up here and opening the file the way we've set up these Labs is that for each notebook we've defined various functions here in the utils file that you will use and if you're already a python programmer you might find some of this stuff useful in your own projects both visualization tools as well as the modeling functions so please have a look heading back to the notebook here we'll just run this first cell to import the packages then I'm going to read in the data set with this cell and what I'm doing is reading it into a variable called raw underscore data what I'm also doing here is printing out how many rows are contained in the data set as well as the contents of the first five rows of data to spot check that everything was read in successfully and you can see that this is the same data set that you're looking at in a spreadsheet previously where you have various pollutants in different columns as well as a station identifier and a timestamp listing the date and time and in the following cell this command prints out how many data points are missing from each column and so each of these numbers will give you a sense of how much missing data there is in each column so you should be able to see that with 15 to 30 000 missing values per column in a data set of roughly 160 000 rows you're missing between 10 and 20 of the data across various columns so this will be a significant issue in your analysis and recall this means that it is likely that between 10 and 20 of the time various sensors tended not to be able to produce a reading for one of the pollutants so they were trying to detect replacing missing values with an estimate will be one of the challenges that you'll tackle in the next notebook when you run this next cell you'll see the distribution of PM 2.5 values for a particular station plotted here in a histogram viewing the data like this can give you a sense of the distribution of values for example I can see here that the PM 2.5 at the USM station covers a range of around zero to probably around 70 there with a Peak at around what's that about five so we have a um a very uneven distribution which is skewing towards a lower value and this is obviously a good thing we we want a lower value of PM 2.5 most of the time you can use these selectors here to choose a different station or different pollutant and investigate the distribution of each pollutant at each station it's important to get a sense of what the distribution of your data looks like across different features in this case different pollutants and different sensor stations in this way you're able to familiarize yourself with the characteristics of your data set as you consider whether AI might add value to the project that you're working on when you run this next cell you generate another type of graph known as a box plot for a particular pollutant PM 2.5 in this case and now for all the stations at once so you'll have the station name on the x-axis along the bottom here and the PM 2.5 values and ranges on the y-axis here the way that these box plots work is that the horizontal line in the middle of each box shows that median for the distribution and the extent of the Shaded box shows the range that captures 50 percent of the data around the median or to put it another way the second and third quartiles of the data individual data points are also plotted here along the vertical axis this plot is essentially showing the same information that you're looking at in the histograms but now in a slightly different way if you imagine you're looking down from on top of the histograms then you can see the range of values covered along the vertical axis and get a different perspective of what the distributions look like including the full range of the data and any outliers uh so when I look at the data here uh there are a couple of things uh that give me pause um all around the the data being very uh sparse at the higher values so this is good for for the city of Bogota and the citizens there the the the cases where uh PM 2.5 is is above 50 is pretty rare and above 100 very rare um and so while that's good for air quality what it does mean is that we don't have that many data points to train on at the points where the air pollution is the worst and so that tells me that I don't expect my machine learning model to be as accurate in predicting missing values or predicting the values in between stations when the pollution is potentially at its most dangerous and that's we'll come back to later in the lab we'll talk about how inaccurate predictions when the pollution is bad whether that's a false positive or a false negative could potentially be more harmful for a real world application the next cell in the notebook allows you to create Scatter Plots for any two pollutants and compare the distributions to one another now you can see that you have one pollutant along the x-axis and another along the y-axis and each point shows the value of each of those two pollutants for a sensor measurement made at that particular time in station first we can look at PM 2.5 plotted against pm10 and not surprisingly there is a strong positive correlation between the two I recommend that you click around and have a look at different pollutants across different stations and see which of those correlate get an intuition for how strongly they might correlate and also whether some might be a negative rather than a positive correlation looking at individual Scatter Plots and histograms and other visuals like these is a really common step in what's called exploratory data analysis so this is where you're trying to get a sense of what the characteristics of a data set is so please join me the next video to continue your exploratory data analysis and we'll wrap up this lab on exploring air quality datain this video I'll walk you through the first part of the data exploration notebook for the Bogota air quality project here you'll be looking at some summary statistics and visualizations about the data to get a better sense of the characteristics of the data set itself plus any challenges that you might face in running analyzes over the set okay so here we are inside the notebook first off here at the top of the notebook is a link to the website hosted by the city of Bogota and you can also view this website in English and here you can see some more details about the air quality monitoring Network in Bogota the rmcab and this is where the data set comes from that you'll be working on in these labs you can explore this site to see what their mapping application looks like today and get some sense of how they're using this data you can also download any of the data if you like by going to reports up here and then station report jumping back into the notebook the first thing to do will be to run this cell to import all the necessary packages just like we saw at the start of the previous notebook and by importing this one called utils here I'm importing all the functionality that is defined in this file called utils.py that's in the same folder as your notebook you can see what's in the utils.by file by clicking on the Jupiter icon up here and opening the file the way we've set up these Labs is that for each notebook we've defined various functions here in the utils file that you will use and if you're already a python programmer you might find some of this stuff useful in your own projects both visualization tools as well as the modeling functions so please have a look heading back to the notebook here we'll just run this first cell to import the packages then I'm going to read in the data set with this cell and what I'm doing is reading it into a variable called raw underscore data what I'm also doing here is printing out how many rows are contained in the data set as well as the contents of the first five rows of data to spot check that everything was read in successfully and you can see that this is the same data set that you're looking at in a spreadsheet previously where you have various pollutants in different columns as well as a station identifier and a timestamp listing the date and time and in the following cell this command prints out how many data points are missing from each column and so each of these numbers will give you a sense of how much missing data there is in each column so you should be able to see that with 15 to 30 000 missing values per column in a data set of roughly 160 000 rows you're missing between 10 and 20 of the data across various columns so this will be a significant issue in your analysis and recall this means that it is likely that between 10 and 20 of the time various sensors tended not to be able to produce a reading for one of the pollutants so they were trying to detect replacing missing values with an estimate will be one of the challenges that you'll tackle in the next notebook when you run this next cell you'll see the distribution of PM 2.5 values for a particular station plotted here in a histogram viewing the data like this can give you a sense of the distribution of values for example I can see here that the PM 2.5 at the USM station covers a range of around zero to probably around 70 there with a Peak at around what's that about five so we have a um a very uneven distribution which is skewing towards a lower value and this is obviously a good thing we we want a lower value of PM 2.5 most of the time you can use these selectors here to choose a different station or different pollutant and investigate the distribution of each pollutant at each station it's important to get a sense of what the distribution of your data looks like across different features in this case different pollutants and different sensor stations in this way you're able to familiarize yourself with the characteristics of your data set as you consider whether AI might add value to the project that you're working on when you run this next cell you generate another type of graph known as a box plot for a particular pollutant PM 2.5 in this case and now for all the stations at once so you'll have the station name on the x-axis along the bottom here and the PM 2.5 values and ranges on the y-axis here the way that these box plots work is that the horizontal line in the middle of each box shows that median for the distribution and the extent of the Shaded box shows the range that captures 50 percent of the data around the median or to put it another way the second and third quartiles of the data individual data points are also plotted here along the vertical axis this plot is essentially showing the same information that you're looking at in the histograms but now in a slightly different way if you imagine you're looking down from on top of the histograms then you can see the range of values covered along the vertical axis and get a different perspective of what the distributions look like including the full range of the data and any outliers uh so when I look at the data here uh there are a couple of things uh that give me pause um all around the the data being very uh sparse at the higher values so this is good for for the city of Bogota and the citizens there the the the cases where uh PM 2.5 is is above 50 is pretty rare and above 100 very rare um and so while that's good for air quality what it does mean is that we don't have that many data points to train on at the points where the air pollution is the worst and so that tells me that I don't expect my machine learning model to be as accurate in predicting missing values or predicting the values in between stations when the pollution is potentially at its most dangerous and that's we'll come back to later in the lab we'll talk about how inaccurate predictions when the pollution is bad whether that's a false positive or a false negative could potentially be more harmful for a real world application the next cell in the notebook allows you to create Scatter Plots for any two pollutants and compare the distributions to one another now you can see that you have one pollutant along the x-axis and another along the y-axis and each point shows the value of each of those two pollutants for a sensor measurement made at that particular time in station first we can look at PM 2.5 plotted against pm10 and not surprisingly there is a strong positive correlation between the two I recommend that you click around and have a look at different pollutants across different stations and see which of those correlate get an intuition for how strongly they might correlate and also whether some might be a negative rather than a positive correlation looking at individual Scatter Plots and histograms and other visuals like these is a really common step in what's called exploratory data analysis so this is where you're trying to get a sense of what the characteristics of a data set is so please join me the next video to continue your exploratory data analysis and we'll wrap up this lab on exploring air quality data\n"