Python 008: Web Scraping
This will help you:
Access websites in Python programs and scrape specific information from websites.
Web scraping can be useful for many things, such as summarizing information, trend analysis, data processing, and strange art exhibits. If you're interested, it combines well with the text generation activity. Requests is an HTTP library for Python (it allows you to access documents on the internet: that is, webpages). Beautiful Soup is an HTML parsing library that helps work with the HTML documents that make up web pages.
Time: 1-2 hours / Level: B3
You should already:
Be familiar with Python syntax, and know what dictionaries and attributes are
Install requests: try running
pip install requests
in the terminal.Install Beautiful Soup: try running
pip install beautifulsoup4
in the terminal.
Get the code and resources for this activity by clicking below. It will allow you to download the files from a Google Drive folder. Unzip the folder and save it in a sensible location.
Step 1: Warm-up - A little bit about HTML
Open
web_scraping.py
and read the comments to understand what is happening.Run
python web_scraping.py
in the terminal and see what happens.Go to a web page and look at the HTML source (Ctrl U, or right-click and "View page source").
Inspect the HTML source (Ctrl I, or right-click and "Inspect"). Hover over different lines of the source code and see how different parts of the page become highlighted. Expand sections to get more specific.
To understand how Beautiful Soup works, you need to have a basic understanding of how HTML documents are laid out. This page (scroll down to "What are the tags up to?") explains how tags are used to format a page. This page has a picture under "HTML Page Structure" that helps explain the nested structure. If you look here, you'll see a bunch of different types of tags. Some of them are container tags, designed to contain other text and tags, and some of them are formatting tags, which display text and elements a certain way. Beautiful Soup treats all tags as pretty much the same type of thing. Functions that work with tags just want to know the tag name.
Step 2: Warm-up - Parsing Demo
Read the Quickstart tutorial. It's very short, and if you want to follow some of the examples,
three_sisters.py
andthree_sisters.html
can be found in this folder. If you double-clickthree_sisters.html
, it should look like a browser page. If you edit it, you can see your changes.Open
example.py
and read the comments to understand each step.Note:
tag.parent
,tag.name
, andtag.text
are some of the attributes of tags.soup
is technically a tag too, except its parent isNone
. You can use the names of sub-tags like attributes, as insoup.title
.Run
python example.py
to see what different attributes and properties get you.Go to the URL and inspect the web page (Ctrl I, or right-click and "Inspect"). Try to find the parts of the HTML source that the code is searching for.
Edit the code in
example.py
to get different parts of the page. See what works with a different URL.There are comprehensive explanations of navigating HTML trees, and searching HTML trees, but the first couple sections of "Going down" are what I find useful, and "Kinds of filters" (if you feel like that's TMI, skip to the arguments for find functions).
Step 3: Activity - Scraping a Twitter feed
Open
twitter_scraper.py
. Read through the code and comments.Replace each
# TODO:
comment with a line of code to complete the program.To test the program, run
python twitter_scraper.py
and enter a Twitter username.
Step 4: Activity - Simple Command-Line Wikipedia
Open
wiki_scraper.py
. Read through the code and comments.Replace each
# TODO:
comment with a line of code to complete the program.To test the program, run
python wiki_scraper.py
and enter the name of a Wikipedia page.