Web Scraping: BeautifulSoup Practice with Wikipedia

Read the BeautifulSoup Docs

What is BeautifulSoup?

BeautifulSoup is a python library that makes it easier to scrape data from web pages. Most webpages are transferred over the internet using HTML which stands for Hyper Text Markup Language. HTML is simply a format for organizing web content using a predefined types of opening and closing tag sets. Here is an example of an HTML document..

<!DOCTYPE html>
<html>
<head>
    <title>HTML Example Document HTML</title>
</head>
<body>
    <h1>My HTML Doc</h1>
    <p>This is a sample HTML document.</p>
    <h2>Hello, World!</h2>
    <p>This is my greeting.</p>
    <h2>See ya later, World!</h2>
    <p>this is my see ya later</p>
</body>
</html>

BeautifulSoup is used to navigate these tags as it provides methods (functions) for parsing HTML as well as XML documents. To use BeautifulSoup, you must first install it in your terminal and then import the class at the beginning of your python script.

Terminal

$ pip install beautifulsoup4

webscrapingscript.py

from bs4 import BeautifulSoup

Import requests

In this script, I want to parse a wikipedia page. BeautifulSoup will allow me to parse the HTML from the wikipedia page but first I must request the resource over the internet. I can do this within my python script using the requests library. The requests library sends a request over the internet using methods like get, post, put, delete, and more. requests.get is commonly used for getting a resource over the internet. So the first thing I need to do is import the library.

import requests

Next, I separated some of the necessary code into two functions in order to make my script more modular and reusable. The first function, create_wikipedia_url(subject) returns a wikipedia url as a string given a subject parameter. This subject parameter needs to be complient with how wikipedia urls are formatted. Otherwise, it will likely result in a wikipedia disambiguation page in which case you will not receive the content you are looking for. The get_wikipedia_content(url) function uses the requests library get method to get the desired html and then return the content.

def create_wikipedia_url(subject):
    """Returns url as a string. subject parameter must match wikipedia conventions such as an underscore where there are spaces"""
    base_url = "https://en.wikipedia.org/wiki/"
    return base_url + str(subject)
def get_wikipedia_content(url):
    """Returns html content of website given url as a parameter."""
    response = requests.get(url)
    if response.status_code == 200:
        return response.content
    else:
        print("Error: ", response.status_code)
        return None

Then I check if __name__ is equal to ‘__main__’ to ensure that the code that follows is only run if the python script is directly executed in the terminal or IDE. This prevents the code from running if it is imported elsewhere. After checking the name, I utilize my functions from before to create my url and generate my html soup content.

if __name__ == '__main__':
    # Get Input for URL
    person = input("Enter name of person: ")
    url = create_wikipedia_url(person)

    # Get BeautifulSoup object
    content = get_wikipedia_content(url)
    soup = BeautifulSoup(content, "html.parser")

Using BeautifulSoup Methods

Now we get the the meat of the BeautifulSoup library where we can use its methods to extract meaningful data from the content we scraped. Several of the most common methods available include:

  • find() or find_all(): used to locate specific elements or tags based on class, ID, tag, or other attribute, the find method will locate the first instance while the find_all method will locate all instances
  • find_parent() or find_parents(): allow navigation of HTML structure by finding the parent or parents of a given element
  • find_next_sibling() or find_previous_sibling(): allows you to find the next or previous sibling element of a given element. This can be useful when scraping tabular data or other sequential type of information
  • select() or select_one(): select returns a list of elements that match a given selector, select_one returns only the first element, these are CSS selector-based querying
  • get_text(): used to retrieve the text content from HTML elements, returns the combined text of the selected element and its descendants, excluding any HTML tags
  • prettify(): returns a nicely formatted and indented string representation of the parsed HTML

Here are some examples of using the above methods for scraping the content of the wikipedia page building on my previous code. The examples will use actor Chris Pratt’s wikipedia page. So the given input will be ‘Chris_Pratt’


Using soup.title.text

print(soup.title.text)

Output

Chris Pratt - Wikipedia

Using find_all() to display the headers…

headers = soup.find_all("h2")
for header in headers:
    print(header.get_text())

Output

Contents
Early life
Career
Public image
Personal life
Philanthropy
Filmography
Awards and nominations
References
External links

Using find_all to search and display different types of headers.

headers = soup.find_all(["h1", "h2", "h3", "h4"])
for header in headers:
    print(header.get_text())

Output

Contents
Chris Pratt
Early life
Career
2000–2013: Early work and breakthrough
2014–present: Franchise work and worldwide recognition
Public image
Personal life
Philanthropy
Filmography
Film
Television
Video games
Awards and nominations
References
External links

Using find_all() to display the paragraphs…

paragraphs = soup.find_all("p")
for paragraph in paragraphs:
    print(paragraph.get_text())

Output

Christopher Michael Pratt (born June 21, 1979)[1] is an American actor. He rose to[...]
Pratt has starred as Star-Lord in the Marvel Cinematic Universe, beginning[...]
Pratt's other starring roles were in The Magnificent Seven (2016), Passengers (2016),[...]
[A lot more]

Using find_all(), has_attr(), select() to count number of links on the page…

links = soup.find_all("a")
print("Total number of a tags: {}".format(len(links)))

count = 0
for tag in soup.find_all("a"):
    if tag.has_attr("href"):
        count += 1
print("Total number of a tags with href attribute: {}".format(count))

actual_links_length = len(soup.select("p > a")) + len(soup.select("p > i > a"))
print("Total number of a tags within a paragraph: {}".format(actual_links_length))

Output

Total number of a tags: 1322
Total number of a tags with href attribute: 1319
Total number of a tags within a paragraph: 149

Can also print each actual link if desired…

regular_links = soup.select("p > a")
italicized_links = soup.select("p > i > a")
print("\nRegular Links: ")
for link in regular_links:
    print(link.get_text(), end=", ")
print("\nItalicized Links: ")
for link in italicized_links:
    print(link.get_text(), end=", ")

Output

Regular Links: 
Andy Dwyer, NBC, The WB, Star-Lord, Marvel Cinematic Universe, Owen Grady, Jurassic World trilogy, 100 most influential people in the world, Virginia, Minnesota, Safeway, multiple sclerosis, Norwegian, Lake Stevens, Washington, wrestling, shot putter, track, Lake Stevens High School, [...], food insecurity amidst, COVID-19 pandemic in the United States, Washington state, 
Italicized Links: 
Parks and Recreation, Everwood, Wanted, Jennifer's Body, Moneyball, Zero Dark Thirty, Her, Guardians of the Galaxy, [...], Take Me Home Tonight, Men's Health,

Using find_all() and string parameter to find all links that match a specific string or list of strings…

gotg_links = soup.find_all("a", string=["Peter Quill / Star-Lord", "Peter Quill", "Star-Lord", "Guardians of the Galaxy"])
for gotg_link in gotg_links:
    print(gotg_link.get_text())
print("Guardians of the Galaxy Links Length: {}".format(len(gotg_links)))

Output

Star-Lord
Guardians of the Galaxy
Peter Quill / Star-Lord
Guardians of the Galaxy
Guardians of the Galaxy
Peter Quill / Star-Lord
Peter Quill / Star-Lord
Guardians of the Galaxy
Guardians of the Galaxy Links Length: 8

Using find() and next_sibling to get occupation…

try:
    occupation = soup.find("th", string=["Occupation", "Occupations"]).next_sibling
    print("{}'s occupation is {}".format(person, occupation.get_text()))
except:
    print("Occupation not found in wikipedia bio.")

Output

Chris_Pratt's occupation is Actor

Output (using Jocko_Willink)

Occupation not found in wikipedia bio.

Regex

There are so many cool things you can do with this library which is probably why it is called BeautifulSoup. One awesome way to get useful info using BeautifulSoup is to use it with regular expressions. In an upcoming post I will revisit this code using regex to extract even cooler insights from our soup.