Python Tutorial - Web Scraping With Python

Hi Everyone,

My name is Thomas Laich, and I'm currently a data scientist working at the Center for Data Science at New York University. In this course on Web Scraping with Python, you will learn some of the fundamental techniques and computational web scraping methods that can help you automate data extraction from online sources. We'll cover topics such as creating software to scrape data from websites, comparing and adjusting price set points, gathering public opinion around a company, and more.

As a data scientist, I can attest that these skills are incredibly valuable and can be applied in various contexts. Many businesses employ individuals with experience in web scraping because they can gather valuable information from online sources. For example, businesses can scrape competitor sites to compare prices for similar products or services, adjust their own price set points accordingly, and gather public opinion about their company. They can also scrape social media sites or other public forms to gather contact or demographic information of clients or potential customers, enabling them to direct resources towards this group more effectively.

Of course, web scraping is not just limited to businesses; it's also a fun activity that you can engage in on your own. For instance, you could search for your favorite memes on popular websites, scour through classified ads looking for specific items, or look for trending topics on social media sites. You might even be interested in cooking blogs and find recipes to try out. The possibilities are endless, and I'm excited to share these techniques with you.

One of the projects I've worked on is collecting data for the website American Violence.org. As was famously implied by former FBI Director James Comey, crime data has been challenging to collect and analyze across city agencies in the United States. However, it turns out that many such agencies publish this data online. My work with my collaborators involved collecting, processing, and formatting these data into a single repository, starting with murder data for some of the largest cities in the U.S.

This experience has shown me how valuable web scraping skills can be in various contexts. By understanding the fundamental techniques and computational methods of web scraping, you'll be able to automate data extraction from online sources and gain insights that can inform business decisions or personal interests. In this course, we'll focus on the acquisition phase, using Python and the web crawling framework Scrappy. While Scrappy is a great choice for its ease of use and scalability, the skills and techniques you learn in this course will be valuable regardless of the specific tools or frameworks you choose to use.

Throughout this course, we'll break down the web scraping pipeline into three main components: setup, acquisition, and processing. The first piece involves defining the goal or task, identifying online sources that can help achieve the desired end result. The second piece involves acquiring online data, which includes accessing, parsing, and extracting data from websites to create meaningful and useful data structures. The third piece involves processing the downloaded data through various analyses or processes needed to achieve the desired goal.

We'll be using Python as our primary programming language for web scraping, and we've chosen Scrappy because it allows us to jump in quickly and easily scale to large scraping projects. However, I encourage you to explore other tools and frameworks even if you're not sold on Scrappy or Python. The skills and intuition you build in this course will be valuable in any computational web scraping environment.

I hope you're as thrilled as I am to take part in this course and gain the skills to start scraping the web for whatever interests you!

"WEBVTTKind: captionsLanguage: enhi everyone my name is Thomas Laich I'm currently a data scientist working in the Center for data science at New York University in this course web scraping with Python you will learn some of the fundamental techniques and computational web scraping that is you will learn to create software to automate data extraction from online sources before moving to specifics and technicalities let me convince you that these techniques can be a valuable addition to your data science know-how and that this course will be the perfect place to start or strengthen the foundational pieces of this skill set you might ask yourself why businesses might employ those with experience web scraping what can businesses gain from web scraping well they can scrape competitor sites to gather prices for similar products or services to compare and adjust their own price set points they can scrape online reviews of their products or services and gather public opinion around the company in general they can scrape social media sites or other public forms for contact or other information of clients or potential clients to meaningfully direct resources towards this group of possible customers and this is just a short list we list here a few fun things you can do scraping the web you could search for your favorite memes from your favorite sites you can scour through classified ads looking for your favorite things you can look for trending topics on social media sites you could look for recipes you might be interested in on cooking blogs in fact there's a whole lot you can do now let me give you an example that I've worked on here at the Center for data science while working with an amazing sociologist I have been heavily involved in collecting the data for the website American violence org as was famously implied by the former FBI director James Comey crime data has not been easy to collect and analyze across city agencies in the United States it turns out though that many such agencies publish this data online the work I've done with my collaborators is to collect process and format these data into a single repository starting with murder data for some of the largest cities in the u.s. so realize this many of the techniques you will learn in this course are the same that I used to collect data for American violence org which is now helping track trends and murders across the United States to better visualize the focus of what you will learn in these lectures and exercises let's roughly break down the web scraping pipeline into three pieces the first piece is the set up that is defining the goal or task and identifying the online sources which you believe will help you achieve the desired end result the second is the acquisition of these online data this includes accessing the data parsing this information and extracting these data into meaningful and useful data structures the third is the processing phase where you run these downloaded data through whatever analyses or processes needed to achieve the desired goal this course focuses on the acquisition phase to accomplish this we will be using Python and the web crawling frameworks scrappy we chose scrappy since we can jump in quickly and easily scale to large scraping projects however even if you aren't sold on using scrappy or Python you will still build techniques and intuition that will be valuable in any computational web scraping environment you enjoy so I hope you're as thrilled as I am to take part in this course and gain the skills to start scraping the web for whateverhi everyone my name is Thomas Laich I'm currently a data scientist working in the Center for data science at New York University in this course web scraping with Python you will learn some of the fundamental techniques and computational web scraping that is you will learn to create software to automate data extraction from online sources before moving to specifics and technicalities let me convince you that these techniques can be a valuable addition to your data science know-how and that this course will be the perfect place to start or strengthen the foundational pieces of this skill set you might ask yourself why businesses might employ those with experience web scraping what can businesses gain from web scraping well they can scrape competitor sites to gather prices for similar products or services to compare and adjust their own price set points they can scrape online reviews of their products or services and gather public opinion around the company in general they can scrape social media sites or other public forms for contact or other information of clients or potential clients to meaningfully direct resources towards this group of possible customers and this is just a short list we list here a few fun things you can do scraping the web you could search for your favorite memes from your favorite sites you can scour through classified ads looking for your favorite things you can look for trending topics on social media sites you could look for recipes you might be interested in on cooking blogs in fact there's a whole lot you can do now let me give you an example that I've worked on here at the Center for data science while working with an amazing sociologist I have been heavily involved in collecting the data for the website American violence org as was famously implied by the former FBI director James Comey crime data has not been easy to collect and analyze across city agencies in the United States it turns out though that many such agencies publish this data online the work I've done with my collaborators is to collect process and format these data into a single repository starting with murder data for some of the largest cities in the u.s. so realize this many of the techniques you will learn in this course are the same that I used to collect data for American violence org which is now helping track trends and murders across the United States to better visualize the focus of what you will learn in these lectures and exercises let's roughly break down the web scraping pipeline into three pieces the first piece is the set up that is defining the goal or task and identifying the online sources which you believe will help you achieve the desired end result the second is the acquisition of these online data this includes accessing the data parsing this information and extracting these data into meaningful and useful data structures the third is the processing phase where you run these downloaded data through whatever analyses or processes needed to achieve the desired goal this course focuses on the acquisition phase to accomplish this we will be using Python and the web crawling frameworks scrappy we chose scrappy since we can jump in quickly and easily scale to large scraping projects however even if you aren't sold on using scrappy or Python you will still build techniques and intuition that will be valuable in any computational web scraping environment you enjoy so I hope you're as thrilled as I am to take part in this course and gain the skills to start scraping the web for whatever\n"