Getting started with Beautiful Soup 4

Beautiful Soup is a Python library for extracting data from HTML and XML files.

If you have ever tried to scrape content from webpages you would know that it can be very tedious without the right tools. Especially if you are trying to scrape a website with multiple variable pages.

So what can you do with Beautiful Soup and Python?

You can for example get all the text from websites. You can get all links. You can search for specific words or strings. You can get all images. You can get specific CSS classes, attributes or tags in the HTML, and so on. The possibilities are vast and it is super easy to use.

If you are thinking about using regex, please don’t. Use a proper scraping tool instead.

So, let’s get to it!

Requirements

  • You need to have Python 2.7.9 or newer or Python 3.4 or newer installed on your computer.
  • You are a tiny bit familiar with Python. Or at least you have used it once.
  • You should be familiar with basic HTML markup and DevTools/Web Console. You don’t have to, but it will make things a lot easier.

Installation

Install Beautiful Soup 4 for Python 3 with pip. Open the Command Prompt or similar and enter:

pip3 install beautifulsoup4

For other systems and installation options read more here.

Different ways to get some HTML data

Before we can use Beautiful Soup to scrape some data, we need to actually fetch some data first.

Here are some common ways you can get data depending on use cases.


Get the contents of a live webpage using requests:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://requests.readthedocs.io/en/master/user/quickstart/')
soup = BeautifulSoup(r.text, 'html.parser')

print(soup)

Get content from a local HTML file:

Make sure you replace C:\path\to\your\file.html with the location of your HTML file and choose only one of the examples in the code.

from bs4 import BeautifulSoup

with open("C:\\path\\to\\your\\file.html", encoding="utf8") as fp:
	soup = BeautifulSoup(fp, 'html.parser')

# or you can write...

soup = BeautifulSoup(open("C:\\path\\to\\your\\file.html", encoding="utf8"), "html.parser")

print(soup)

Get content from all local HTML files in a folder:

Make sure you replace C:\Users\Jimmy\Desktop\SomeFolder with the location of your HTML files.

import os
from bs4 import BeautifulSoup

path = r'C:\\Users\\Jimmy\\Desktop\SomeFolder'
for filename in os.listdir(path):
	if filename.endswith(".html"):
		fullpath = os.path.join(path, filename)

		soup = BeautifulSoup(open(fullpath, encoding='utf-8'), 'html.parser')
		
		# Beautiful Soup scraping goes here...
		print(soup.text)

Get content from a short HTML snippet within the code:

Replace everything below `html_doc = """ and above """ with your HTML code.

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)

Get the contents of a dynamic webpage:

If you want to load the HTML of a dynamic webpage (a page with AJAX or JavaScript), you will have to use a WebDriver (for example Selenium with ChromeDriver) to load the page first. Then you can extract the HTML.

First you need to install Selenium web driver: pip install selenium. Then you need to download and add ChromeDriver (or similar) to PATH.

import time
# Import Selenium WebDriver
from selenium import webdriver
# Set ChromeDriver as WebDriver
driver = webdriver.Chrome()
# Launch the webpage
driver.get('https://www.dplay.no/kanaler/')
# Give the browser enough time for the WebElement to load
time.sleep(10) # Might differ depending on your connection...
# Print the HTML
print(driver.page_source)

Now that you know how to load some content you might want to know how to use the true magic of Beautiful Soup 🧙‍♂️!

Super simple n00b examples

In the following examples we are using the following HTML:

<html>
<head>
	<title>The Dormouse's story</title>
</head>
<body>
	<p class="title"><b>The Dormouse's story</b></p>

	<p class="story">Once upon a time there were three little sisters; and their names were
	<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
	<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
	<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
	and they lived at the bottom of a well.</p>

	<p class="story">...</p>

Let’s start with something easy. Let’s say we want to get the <title> tag of the HTML. That’s super easy:

print(soup.title)

This will give you: <title>The Dormouse's story</title> in return.

If you just want the text within the tags and not the title tags, you can just add .string so the command becomes:

print(soup.title.string)

This will give you The Dormouse's story in return.

So let’s say you want to get all of the link tags. You might be tempted to write print(soup.a.string), but this will only give you the first occurance of the tag.

Instead we need to use find_all:

print(soup.find_all('a'))

This will list all link tags.

If you just want the actual HTTP links you will need to let Beautiful Soup know that you want the href attribute for each link. In which case we need to make a for loop.

for link in soup.find_all('a'):
	print(link.get('href'))

This will give you:

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

We could go over a ton of other examples, but at this point, you have enough information to get started.

The official documentation has a lot of examples and as always duckduckgo is your friend! 😉

Leave a Reply

Your email address will not be published. Required fields are marked *