Easy Web Scraping in Python using Pandas for Data Science

Using pandas to Script Data from Websites: A Step-by-Step Guide

To read data from websites using pandas, we can utilize the `read_html` function. This function allows us to easily script data from websites by simply reading in the HTML and then beautifying it using pandas.

Before we begin, let's assume that we have a website with a single table that we want to extract data from. To determine how many tables are on the webpage, we can use the `len` function to get the length of the list returned by `read_html`. In this case, there is only one table, so we'll select the first table using `DF[0]`.

If there were multiple tables on the webpage, we would need to determine which table we want to select. For example, if we wanted to select a second table, we would use `DF[1]`. However, in this case, we're lucky and have only one table, so we'll use `DF[0]`.

Now that we've selected the first table, let's assign it to a variable called `DF_2019`. We can do this by using the syntax `DF_2019 = DF[0]`. This will create a new pandas DataFrame object called `DF_2019` that contains our extracted data.

Before we proceed with cleaning and analyzing our data, let's take a look at what the table looks like. We can use the `head()` function to view the first few rows of the table.

Now, let's do some data cleaning. Before I mentioned earlier that the table header will repeat every 20 players, so let's remove those headers from the DataFrame. To do this, we'll define a string variable called `h` that contains the column name for the header row. We can use the `.index[0]` attribute to get the index of the first row in the DataFrame, which corresponds to the table header.

Next, we'll use the `.loc` function to select only the rows where the value in the specified column is not equal to the string 'age'. This will remove all rows that contain the word 'age' from our DataFrame. Finally, we'll assign this new DataFrame back to `DF_2019`.

To verify that the table header has been removed successfully, let's use the `.info()` function to view some information about our DataFrame.

Now that we've cleaned up our data, let's take a look at how many players there are for each team. We can use the `value_counts()` function to count the number of occurrences of each team in our DataFrame.

Next, let's display a simple histogram using the `.plot.hist()` function. We'll specify the variable we want to plot as `pts`, which stands for points. We'll also set the `kde` parameter to `False`, so that the x-axis will show the actual frequency of points instead of an estimated probability.

To customize our histogram, let's use the `.hist_kws` function to change the color and line width of the bars. Finally, we'll display the histogram using the `.show()` function.

Finally, let's take a look at some interesting insights from our data analysis. In this example, we can see that most players scored between 0 and 20 points, while only a few scored more than 20 points. This tells us something about the skill level of the players in the league.

In conclusion, using pandas to script data from websites is a simple and effective way to extract data from webpages and perform data cleaning and analysis tasks. By following these steps, you can easily apply this technique to your own projects and grow your data science portfolio.

"WEBVTTKind: captionsLanguage: enhave you ever wanted to scrape data from websites in Python there are packages available such as URL Lib and beautifulsoup but wouldn't it be better if there's a much simpler way so in this video I'm going to show you how you can use pandas read HTML function to scrape data from websites so without further ado let's get started so the first thing that you want to do is fire up your web browser and head over to the github of the data professor and then you want to click on the code repository click on the Python repository and then scroll down find pandas read HTML for web scraping click on that and then right-click on the raw link safely ass and then save it into your computer okay and a second way is to fire up your Google collab click on open notebook click on the github tab type in data professor scroll down and pandas read HTML okay and because I have that I will open up my local version and so for this tutorial I will be clearing out the output and so if you want to follow along please do so so I'll click on the Edit on the menu bar and then clear all outputs okay and so only the input cell will be shown so that we can do this together but if you don't have access to a interactive version you can follow along using the github and so the first thing that we want to do is check out the website that we are going to scrape our data from and say that we want data from the Year 2019 so the data will be coming from the Basketball Reference comm and so this is the NBA player stats for the season of 2018 to 2019 so let's have a look what's in the table so you will be seeing all of the players okay and notice that the header is shown right here with various fields such as the rank the player name the position the age the team okay and the number of games played and etc and notice that the header will repeat itself every 20 players you see that there's 1 through 20 here and then the header is repeated and then you have 21 through 40 and then the same thing repeat every 20 players and so we're going to have to delete the subsequent header but we're going to keep only the first header okay and then we're gonna do that inside the tubular notebook okay so head back to the Jupiter notebook okay so we have two methods of doing this we can either use the URL directly as in meta 2 or we could do it programmatically as in method 1 where we break it up into the building blocks and so the building block will contain the URL component and the year so the URL component will be this line here and then the year will be replaced by the open and closing braces okay and we're going to use the format function to do that which will replace the open and closing braces with the year because the argument is year right here so we're combining it using the string which we define here which is the URL and this URL string contains the braces opening and closing and then we're going to use the format function and as argument we're going to use the year and the years 2019 and so let's run this okay and then here we get the URL and so the benefit of this first approach is that you could do it programmatically if you want to whip script several years for example 2015 to 2019 and you're gonna make a list of the year and then you're gonna make a for loop so let's run that and then you're gonna get five URL with the years that are changing here okay so you could try this out that's your homework and let me know how it goes alright so we have already defined the URL and now let's use pandas to read in the web page so first thing is to load in the pandas package so import pandas SPD and then we're gonna define a data frame variable and inside it we will be assigning the contents from the function PD read HTML and in S argument there's two part the first is the URL and then the URL will be the string which contains the URL to the website and Heather equals to zero so that we will be able to define that the first role is the Heather okay so let's run this and we see that the data is loaded into this data frame so you see that it easily loads the content from the HTML web page without using any URL Lib okay and then we're gonna do the beautification of this data frame directly using pandas so no additional libraries that are needed just plain pandas to do this so you just read in the HTML and then you're gonna beautify it using pandas okay let's continue so let's see how many tables are in the web page and so using the length function we see that there is only one table and so this is really straightforward in cases where web pages contain multiple tables you will have to determine which table do you want okay and so here we're going to select the first table and so we're gonna type in DF open bracket 0 closing bracket and so if you want to select a second table if there is a second table then we're going to type in DF open bracket 1 closing bracket okay and do the same if there is a particular table that you want to be selected in cases where there are multiple tables on the webpage so because this webpage has only one table we will use 0 which represents the first table and so it looks pretty neat it looks really nice here and notice that there are some missing values and notice that some players have multiple occurrences because they have been a part of different teams in the same year okay and so we're gonna assign the first table into the DF 2019 variable okay and so let's do some data cleaning so before I mentioned to you that the table header will repeat every 20 players so let's remove that let's remove the second and subsequent header roles because there will be more than 10 so what we need to do is define the f 2019 and in the bracket we're gonna use as argument D F 2019 dot H so we're selecting the first column age and whenever and whenever the first column is selected we're going to look for the string age because for the H column there is only numerical values and whenever we see the word age we will remove the entire row okay and so the entire row will also contain the table header okay let's do that so you see that there are all of the subsequent table header selected for this entire data frame and so let's have a look at the link how many Heather are there and there are a total of 26 Heather okay and then we're gonna use the D F 2019 drop function so the dot drop function will allow us to drop all of these roles from the data frame okay okay and now let's have a look at the dimension of the table again and now we have 708 comma 30 so let's look at the before 2019 dot shape and so before we have 734 rolls and 30 columns and so here we have 708 rolls and 30 columns okay so 26 are indeed removed and so let's do a quick exploit or a data analysis so in this tutorial we're gonna use the C board and so here we're gonna display a simple histogram using the dist plot function and the variable we will be using the points the F dot pts okay and KDE will be false because we want to retain the original frequency otherwise if KDE is true it will be the probability that will be shown here okay so I can show you if it's true then it won't be the frequency it'll be probability here so we'll set it to false because we want to have the actual frequency the count number that for each bar how many players have that much points okay so between zero and one so one point how many players there are about 40-something players having one point okay and number of players having 35 points or more will be less than 10 so we see that the majority have points between 0 and 20 and very few have points greater than 20 okay and so let's say that we want to change the bar line color because right now it's transparent so we're gonna use the dictionary function here inside the hist underscore kws so we're gonna define the edge color of the bars to be black and we're gonna make it about two which is the size of the line width okay right here and let's say that we want to change the fill color to another color so we're gonna use the color option okay so you could use two helps code so the hex code is the hash tag followed by the six alphanumerical characters and so congratulations you have successfully used pandas read HTML function to script data from websites and so to read in the web page content is simply using the read HTML function and so the subsequent process will be using pandas to pre-process the data by removing the redundant table header or looking for missing values and etc okay so as always the best way to learn data science is to do data science so please feel free to modify this notebook to script data from other websites and upload it to your github so that you can grow your data science portfolio okay so enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videoshave you ever wanted to scrape data from websites in Python there are packages available such as URL Lib and beautifulsoup but wouldn't it be better if there's a much simpler way so in this video I'm going to show you how you can use pandas read HTML function to scrape data from websites so without further ado let's get started so the first thing that you want to do is fire up your web browser and head over to the github of the data professor and then you want to click on the code repository click on the Python repository and then scroll down find pandas read HTML for web scraping click on that and then right-click on the raw link safely ass and then save it into your computer okay and a second way is to fire up your Google collab click on open notebook click on the github tab type in data professor scroll down and pandas read HTML okay and because I have that I will open up my local version and so for this tutorial I will be clearing out the output and so if you want to follow along please do so so I'll click on the Edit on the menu bar and then clear all outputs okay and so only the input cell will be shown so that we can do this together but if you don't have access to a interactive version you can follow along using the github and so the first thing that we want to do is check out the website that we are going to scrape our data from and say that we want data from the Year 2019 so the data will be coming from the Basketball Reference comm and so this is the NBA player stats for the season of 2018 to 2019 so let's have a look what's in the table so you will be seeing all of the players okay and notice that the header is shown right here with various fields such as the rank the player name the position the age the team okay and the number of games played and etc and notice that the header will repeat itself every 20 players you see that there's 1 through 20 here and then the header is repeated and then you have 21 through 40 and then the same thing repeat every 20 players and so we're going to have to delete the subsequent header but we're going to keep only the first header okay and then we're gonna do that inside the tubular notebook okay so head back to the Jupiter notebook okay so we have two methods of doing this we can either use the URL directly as in meta 2 or we could do it programmatically as in method 1 where we break it up into the building blocks and so the building block will contain the URL component and the year so the URL component will be this line here and then the year will be replaced by the open and closing braces okay and we're going to use the format function to do that which will replace the open and closing braces with the year because the argument is year right here so we're combining it using the string which we define here which is the URL and this URL string contains the braces opening and closing and then we're going to use the format function and as argument we're going to use the year and the years 2019 and so let's run this okay and then here we get the URL and so the benefit of this first approach is that you could do it programmatically if you want to whip script several years for example 2015 to 2019 and you're gonna make a list of the year and then you're gonna make a for loop so let's run that and then you're gonna get five URL with the years that are changing here okay so you could try this out that's your homework and let me know how it goes alright so we have already defined the URL and now let's use pandas to read in the web page so first thing is to load in the pandas package so import pandas SPD and then we're gonna define a data frame variable and inside it we will be assigning the contents from the function PD read HTML and in S argument there's two part the first is the URL and then the URL will be the string which contains the URL to the website and Heather equals to zero so that we will be able to define that the first role is the Heather okay so let's run this and we see that the data is loaded into this data frame so you see that it easily loads the content from the HTML web page without using any URL Lib okay and then we're gonna do the beautification of this data frame directly using pandas so no additional libraries that are needed just plain pandas to do this so you just read in the HTML and then you're gonna beautify it using pandas okay let's continue so let's see how many tables are in the web page and so using the length function we see that there is only one table and so this is really straightforward in cases where web pages contain multiple tables you will have to determine which table do you want okay and so here we're going to select the first table and so we're gonna type in DF open bracket 0 closing bracket and so if you want to select a second table if there is a second table then we're going to type in DF open bracket 1 closing bracket okay and do the same if there is a particular table that you want to be selected in cases where there are multiple tables on the webpage so because this webpage has only one table we will use 0 which represents the first table and so it looks pretty neat it looks really nice here and notice that there are some missing values and notice that some players have multiple occurrences because they have been a part of different teams in the same year okay and so we're gonna assign the first table into the DF 2019 variable okay and so let's do some data cleaning so before I mentioned to you that the table header will repeat every 20 players so let's remove that let's remove the second and subsequent header roles because there will be more than 10 so what we need to do is define the f 2019 and in the bracket we're gonna use as argument D F 2019 dot H so we're selecting the first column age and whenever and whenever the first column is selected we're going to look for the string age because for the H column there is only numerical values and whenever we see the word age we will remove the entire row okay and so the entire row will also contain the table header okay let's do that so you see that there are all of the subsequent table header selected for this entire data frame and so let's have a look at the link how many Heather are there and there are a total of 26 Heather okay and then we're gonna use the D F 2019 drop function so the dot drop function will allow us to drop all of these roles from the data frame okay okay and now let's have a look at the dimension of the table again and now we have 708 comma 30 so let's look at the before 2019 dot shape and so before we have 734 rolls and 30 columns and so here we have 708 rolls and 30 columns okay so 26 are indeed removed and so let's do a quick exploit or a data analysis so in this tutorial we're gonna use the C board and so here we're gonna display a simple histogram using the dist plot function and the variable we will be using the points the F dot pts okay and KDE will be false because we want to retain the original frequency otherwise if KDE is true it will be the probability that will be shown here okay so I can show you if it's true then it won't be the frequency it'll be probability here so we'll set it to false because we want to have the actual frequency the count number that for each bar how many players have that much points okay so between zero and one so one point how many players there are about 40-something players having one point okay and number of players having 35 points or more will be less than 10 so we see that the majority have points between 0 and 20 and very few have points greater than 20 okay and so let's say that we want to change the bar line color because right now it's transparent so we're gonna use the dictionary function here inside the hist underscore kws so we're gonna define the edge color of the bars to be black and we're gonna make it about two which is the size of the line width okay right here and let's say that we want to change the fill color to another color so we're gonna use the color option okay so you could use two helps code so the hex code is the hash tag followed by the six alphanumerical characters and so congratulations you have successfully used pandas read HTML function to script data from websites and so to read in the web page content is simply using the read HTML function and so the subsequent process will be using pandas to pre-process the data by removing the redundant table header or looking for missing values and etc okay so as always the best way to learn data science is to do data science so please feel free to modify this notebook to script data from other websites and upload it to your github so that you can grow your data science portfolio okay so enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos\n"