Python Bs4

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It's one of the most popular python modules and is pretty easy to use.

Click me to open

Installation

To install Beautiful Soup 4 (BS4), you can just use pip:

pip install beautifulsoup4

In this post, I am going to use the requests python library:

pip install requests

Before we get started, be sure to include the modules in your python script:

import requests
from bs4 import BeautifulSoup

The first step is to retrieve the html from the website.

response = requests.get('http://example.com/')
soup = BeautifulSoup(response.text, 'html.parser')

soup contains the BS4 object that you can parse.

There are a few different parsers: html.parse, lxml, lxml-xml, and html5lib. Each has its own advantages and disadvantages.

To find an element with a specific id:

element = soup.find(id='elementTagID')

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
  print(link.get('href'))

# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

To get the link text from the page you can do

for link in soup.find_all('a'):
  print(link.get_text())

# Click here
# another link text
# IDK what else to put

To get all link elements with the CSS class sister:

soup.find_all("a", class_="sister"):