Beautiful Soup is a Python library for pulling data out of HTML and XML files. It's one of the most popular python modules and is pretty easy to use.
To install Beautiful Soup 4 (BS4), you can just use pip:
pip install beautifulsoup4
In this post, I am going to use the requests python library:
pip install requests
Before we get started, be sure to include the modules in your python script:
import requests
from bs4 import BeautifulSoup
The first step is to retrieve the html from the website.
response = requests.get('http://example.com/')
soup = BeautifulSoup(response.text, 'html.parser')
soup
contains the BS4 object that you can parse.
There are a few different parsers: html.parse, lxml, lxml-xml, and html5lib. Each has its own advantages and disadvantages.
To find an element with a specific id:
element = soup.find(id='elementTagID')
One common task is extracting all the URLs found within a page’s <a>
tags:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
To get the link text from the page you can do
for link in soup.find_all('a'):
print(link.get_text())
# Click here
# another link text
# IDK what else to put
To get all link elements with the CSS class sister
:
soup.find_all("a", class_="sister"):
© 2024 by Ryan Rickgauer