Python 008: Web Scraping

Having trouble?

This will help you:

Access websites in Python programs and scrape specific information from websites.

Web scraping can be useful for many things, such as summarizing information, trend analysis, data processing, and strange art exhibits. If you're interested, it combines well with the text generation activity. Requests is an HTTP library for Python (it allows you to access documents on the internet: that is, webpages). Beautiful Soup is an HTML parsing library that helps work with the HTML documents that make up web pages.

Scraping the New York Times Twitter feed.

Time: 1-2 hours / Level: B3

You should already:

Be familiar with Python syntax, and know what dictionaries and attributes are
Install requests: try running pip install requests in the terminal.
Install Beautiful Soup: try running pip install beautifulsoup4 in the terminal.

Get the code and resources for this activity by clicking below. It will allow you to download the files from a Google Drive folder. Unzip the folder and save it in a sensible location.

Download Activity Files

Step 1: Warm-up - A little bit about HTML

Open web_scraping.py and read the comments to understand what is happening.
Run python web_scraping.py in the terminal and see what happens.
Go to a web page and look at the HTML source (Ctrl U, or right-click and "View page source").
Inspect the HTML source (Ctrl I, or right-click and "Inspect"). Hover over different lines of the source code and see how different parts of the page become highlighted. Expand sections to get more specific.

To understand how Beautiful Soup works, you need to have a basic understanding of how HTML documents are laid out. This page (scroll down to "What are the tags up to?") explains how tags are used to format a page. This page has a picture under "HTML Page Structure" that helps explain the nested structure. If you look here, you'll see a bunch of different types of tags. Some of them are container tags, designed to contain other text and tags, and some of them are formatting tags, which display text and elements a certain way. Beautiful Soup treats all tags as pretty much the same type of thing. Functions that work with tags just want to know the tag name.

Step 2: Warm-up - Parsing Demo

Read the Quickstart tutorial. It's very short, and if you want to follow some of the examples, three_sisters.py and three_sisters.html can be found in this folder. If you double-click three_sisters.html, it should look like a browser page. If you edit it, you can see your changes.
Open example.py and read the comments to understand each step.
Note: tag.parent, tag.name, and tag.text are some of the attributes of tags. soup is technically a tag too, except its parent is None. You can use the names of sub-tags like attributes, as in soup.title.
Run python example.py to see what different attributes and properties get you.
Go to the URL and inspect the web page (Ctrl I, or right-click and "Inspect"). Try to find the parts of the HTML source that the code is searching for.
Edit the code in example.py to get different parts of the page. See what works with a different URL.
There are comprehensive explanations of navigating HTML trees, and searching HTML trees, but the first couple sections of "Going down" are what I find useful, and "Kinds of filters" (if you feel like that's TMI, skip to the arguments for find functions).

Step 3: Activity - Scraping a Twitter feed

Open twitter_scraper.py. Read through the code and comments.
Replace each # TODO: comment with a line of code to complete the program.
To test the program, run python twitter_scraper.py and enter a Twitter username.

Step 4: Activity - Simple Command-Line Wikipedia

Open wiki_scraper.py. Read through the code and comments.
Replace each # TODO: comment with a line of code to complete the program.
To test the program, run python wiki_scraper.py and enter the name of a Wikipedia page.

Step 5: Make it your own

Find a website that's interesting or useful to you and inspect the HTML source to see where you can find useful pieces of information. Write a program that searches for those segments and prints or saves them. Use the documentation listed in "Warm-up: Parsing Demo" to learn how to do advanced HTML searches.

Step 6: Going further

Combine this activity with the text generation activity to randomly generate text based off of someone's Twitter stream or a webpage. Combine this activity with the SMS integration activity to send text from a webpage to your phone. You could even use texting in place of the terminal entry to request webpage info via text.

I'm all done!

→ Return to Tech at the Foundry Project List

Python 008: Web Scraping

This will help you:

Access websites in Python programs and scrape specific information from websites.

Time: 1-2 hours / Level: B3

You should already:

Be familiar with Python syntax, and know what dictionaries and attributes are

Install requests: try running `pip install requests` in the terminal.

Install Beautiful Soup: try running `pip install beautifulsoup4` in the terminal.

Get the code and resources for this activity by clicking below. It will allow you to download the files from a Google Drive folder. Unzip the folder and save it in a sensible location.

Step 1: Warm-up - A little bit about HTML

Open `web_scraping.py` and read the comments to understand what is happening.

Run `python web_scraping.py` in the terminal and see what happens.

Go to a web page and look at the HTML source (Ctrl U, or right-click and "View page source").

Inspect the HTML source (Ctrl I, or right-click and "Inspect"). Hover over different lines of the source code and see how different parts of the page become highlighted. Expand sections to get more specific.

Step 2: Warm-up - Parsing Demo

Read the Quickstart tutorial. It's very short, and if you want to follow some of the examples, `three_sisters.py` and `three_sisters.html` can be found in this folder. If you double-click `three_sisters.html`, it should look like a browser page. If you edit it, you can see your changes.

Open `example.py` and read the comments to understand each step.

Note: `tag.parent`, `tag.name`, and `tag.text` are some of the attributes of tags. `soup` is technically a tag too, except its parent is `None`. You can use the names of sub-tags like attributes, as in `soup.title`.

Run `python example.py` to see what different attributes and properties get you.

Go to the URL and inspect the web page (Ctrl I, or right-click and "Inspect"). Try to find the parts of the HTML source that the code is searching for.

Edit the code in `example.py` to get different parts of the page. See what works with a different URL.

There are comprehensive explanations of navigating HTML trees, and searching HTML trees, but the first couple sections of "Going down" are what I find useful, and "Kinds of filters" (if you feel like that's TMI, skip to the arguments for find functions).

Step 3: Activity - Scraping a Twitter feed

Open `twitter_scraper.py`. Read through the code and comments.

Replace each `# TODO:` comment with a line of code to complete the program.

To test the program, run `python twitter_scraper.py` and enter a Twitter username.

Step 4: Activity - Simple Command-Line Wikipedia

Open `wiki_scraper.py`. Read through the code and comments.

Replace each `# TODO:` comment with a line of code to complete the program.

To test the program, run `python wiki_scraper.py` and enter the name of a Wikipedia page.

Step 5: Make it your own

Step 6: Going further

Academic Calendar
Careers at Babson
Rankings
Parents and Families

Campus Store
Make a Gift
California Bureau

Emergency Info
Offices and Services
Centers and Institutes

Python 008: Web Scraping

This will help you:

Access websites in Python programs and scrape specific information from websites.

Time: 1-2 hours / Level: B3

You should already:

Be familiar with Python syntax, and know what dictionaries and attributes are

Install requests: try running pip install requests in the terminal.

Install Beautiful Soup: try running pip install beautifulsoup4 in the terminal.

Get the code and resources for this activity by clicking below. It will allow you to download the files from a Google Drive folder. Unzip the folder and save it in a sensible location.

Step 1: Warm-up - A little bit about HTML

Open web_scraping.py and read the comments to understand what is happening.

Run python web_scraping.py in the terminal and see what happens.

Go to a web page and look at the HTML source (Ctrl U, or right-click and "View page source").

Inspect the HTML source (Ctrl I, or right-click and "Inspect"). Hover over different lines of the source code and see how different parts of the page become highlighted. Expand sections to get more specific.

Step 2: Warm-up - Parsing Demo

Read the Quickstart tutorial. It's very short, and if you want to follow some of the examples, three_sisters.py and three_sisters.html can be found in this folder. If you double-click three_sisters.html, it should look like a browser page. If you edit it, you can see your changes.

Open example.py and read the comments to understand each step.

Note: tag.parent, tag.name, and tag.text are some of the attributes of tags. soup is technically a tag too, except its parent is None. You can use the names of sub-tags like attributes, as in soup.title.

Run python example.py to see what different attributes and properties get you.

Go to the URL and inspect the web page (Ctrl I, or right-click and "Inspect"). Try to find the parts of the HTML source that the code is searching for.

Edit the code in example.py to get different parts of the page. See what works with a different URL.

There are comprehensive explanations of navigating HTML trees, and searching HTML trees, but the first couple sections of "Going down" are what I find useful, and "Kinds of filters" (if you feel like that's TMI, skip to the arguments for find functions).

Step 3: Activity - Scraping a Twitter feed

Open twitter_scraper.py. Read through the code and comments.

Replace each # TODO: comment with a line of code to complete the program.

To test the program, run python twitter_scraper.py and enter a Twitter username.

Step 4: Activity - Simple Command-Line Wikipedia

Open wiki_scraper.py. Read through the code and comments.

Replace each # TODO: comment with a line of code to complete the program.

To test the program, run python wiki_scraper.py and enter the name of a Wikipedia page.

Step 5: Make it your own

Step 6: Going further

Academic CalendarCareers at BabsonRankingsParents and Families

Campus StoreMake a GiftCalifornia Bureau

Emergency InfoOffices and ServicesCenters and Institutes

Install requests: try running `pip install requests` in the terminal.

Install Beautiful Soup: try running `pip install beautifulsoup4` in the terminal.

Open `web_scraping.py` and read the comments to understand what is happening.

Run `python web_scraping.py` in the terminal and see what happens.

Read the Quickstart tutorial. It's very short, and if you want to follow some of the examples, `three_sisters.py` and `three_sisters.html` can be found in this folder. If you double-click `three_sisters.html`, it should look like a browser page. If you edit it, you can see your changes.

Open `example.py` and read the comments to understand each step.

Note: `tag.parent`, `tag.name`, and `tag.text` are some of the attributes of tags. `soup` is technically a tag too, except its parent is `None`. You can use the names of sub-tags like attributes, as in `soup.title`.

Run `python example.py` to see what different attributes and properties get you.

Edit the code in `example.py` to get different parts of the page. See what works with a different URL.

Open `twitter_scraper.py`. Read through the code and comments.

Replace each `# TODO:` comment with a line of code to complete the program.

To test the program, run `python twitter_scraper.py` and enter a Twitter username.

Open `wiki_scraper.py`. Read through the code and comments.

Replace each `# TODO:` comment with a line of code to complete the program.

To test the program, run `python wiki_scraper.py` and enter the name of a Wikipedia page.

Academic Calendar
Careers at Babson
Rankings
Parents and Families

Campus Store
Make a Gift
California Bureau

Emergency Info
Offices and Services
Centers and Institutes