Python 008: Web Scraping

This will help you:

Access websites in Python programs and scrape specific information from websites.

Web scraping can be useful for many things, such as summarizing information, trend analysis, data processing, and strange art exhibits. If you're interested, it combines well with the text generation activityRequests is an HTTP library for Python (it allows you to access documents on the internet: that is, webpages). Beautiful Soup is an HTML parsing library that helps work with the HTML documents that make up web pages.

Scraping the New York Times Twitter feed.

Scraping the New York Times Twitter feed.

Time: 1-2 hours / Level: B3

You should already:

  • Be familiar with Python syntax, and know what dictionaries and attributes are

  • Install requests: try running pip install requests in the terminal.

  • Install Beautiful Soup: try running pip install beautifulsoup4 in the terminal.

Get the code and resources for this activity by clicking below. It will allow you to download the files from a Google Drive folder. Unzip the folder and save it in a sensible location.

Step 1: Warm-up - A little bit about HTML

  • Open web_scraping.py and read the comments to understand what is happening.

  • Run python web_scraping.py in the terminal and see what happens.

  • Go to a web page and look at the HTML source (Ctrl U, or right-click and "View page source").

  • Inspect the HTML source (Ctrl I, or right-click and "Inspect"). Hover over different lines of the source code and see how different parts of the page become highlighted. Expand sections to get more specific.

To understand how Beautiful Soup works, you need to have a basic understanding of how HTML documents are laid out. This page (scroll down to "What are the tags up to?") explains how tags are used to format a page. This page has a picture under "HTML Page Structure" that helps explain the nested structure. If you look here, you'll see a bunch of different types of tags. Some of them are container tags, designed to contain other text and tags, and some of them are formatting tags, which display text and elements a certain way. Beautiful Soup treats all tags as pretty much the same type of thing. Functions that work with tags just want to know the tag name.

Step 2: Warm-up - Parsing Demo

  • Read the Quickstart tutorial. It's very short, and if you want to follow some of the examples, three_sisters.py and three_sisters.html can be found in this folder. If you double-click three_sisters.html, it should look like a browser page. If you edit it, you can see your changes.

  • Open example.py and read the comments to understand each step.

    Note: tag.parenttag.name, and tag.text are some of the attributes of tags. soup is technically a tag too, except its parent is None. You can use the names of sub-tags like attributes, as in soup.title.

  • Run python example.py to see what different attributes and properties get you.

  • Go to the URL and inspect the web page (Ctrl I, or right-click and "Inspect"). Try to find the parts of the HTML source that the code is searching for.

  • Edit the code in example.py to get different parts of the page. See what works with a different URL.

    There are comprehensive explanations of navigating HTML trees, and searching HTML trees, but the first couple sections of "Going down" are what I find useful, and "Kinds of filters" (if you feel like that's TMI, skip to the arguments for find functions).

Step 3: Activity - Scraping a Twitter feed

  • Open twitter_scraper.py. Read through the code and comments.

  • Replace each # TODO: comment with a line of code to complete the program.

  • To test the program, run python twitter_scraper.py and enter a Twitter username.

Step 4: Activity - Simple Command-Line Wikipedia

  • Open wiki_scraper.py. Read through the code and comments.

  • Replace each # TODO: comment with a line of code to complete the program.

  • To test the program, run python wiki_scraper.py and enter the name of a Wikipedia page.

Step 5: Make it your own

Find a website that's interesting or useful to you and inspect the HTML source to see where you can find useful pieces of information. Write a program that searches for those segments and prints or saves them. Use the documentation listed in "Warm-up: Parsing Demo" to learn how to do advanced HTML searches.

Step 6: Going further

Combine this activity with the text generation activity to randomly generate text based off of someone's Twitter stream or a webpage. Combine this activity with the SMS integration activity to send text from a webpage to your phone. You could even use texting in place of the terminal entry to request webpage info via text.