Web Scraping in Python Using Beautiful Soup4
Table of contents
- Steps Involved in Web Scrape in Python Using Beautiful Soup
- 1. Install BeautifulSoup4
- 2. Install Requests
- 3. Confirm the Website can be Scraped
- 4. Move on to Using Beautiful Soup
- 5. Looping through the First Five “file-one” Article Elements in Parsed HTML Content
- 6. Create an Excel Sheet
- 7. Save to an Excel Sheet
- With BeautifulSoup, Save Yourself the Stress of Dealing With Chunks of Data
Have you ever envisioned the possibility of effortlessly extracting targeted data from websites? This vision can become a reality through the art of web scraping with Python using BeautifulSoup, which serves as a widely adopted Python library renowned for its simplicity and effectiveness in navigating and parsing HTML and XML data, and streamlining the extraction of desired information from web pages. By harnessing the capabilities of Beautiful Soup, we gain a powerful ally in our quest for data extraction, simplifying the process and enabling us to focus solely on the relevant data we seek.
In this informative article, I will provide you with a step-by-step guide on how to extract data from the website https://www.thenetnaija.net/videos using the powerful BeautifulSoup library in Python. Our focus will be on retrieving the category, title, and number of votes from the first five files on the website. To begin the web scraping process, we need to access the HTML content of the website, which can be done by right-clicking on your computer and selecting the “Inspect” option. This will open up the HTML structure, allowing us to navigate and extract the desired data.
Steps Involved in Web Scrape in Python Using Beautiful Soup
The steps below provide a straightforward guide on how to extract data from websites with Python using BeautifulSoup:
1. Install BeautifulSoup4
BeautifulSoup is a popular Python Library that provides convenient methods of extracting data from Hypertext Markup Language (HTML) and Extensible Markup Language (XML) files. It aids easy navigation through the structure of a webpage to extract desired data. Beautiful Soup also provides a simple and intuitive Application Programming Interface (API) that abstracts away the complexities of parsing and traversing the document structure. You should also note that it allows you to extract specific elements, text, or attributes from the HTML/XML, based on the desired criteria such as tag names, class names, IDs, or other patterns.
2. Install Requests
The ‘requests’ library is important for web scraping because you can use it to send HTTP requests to web servers and retrieve the HTML content of web pages. Requests allow you to retrieve the HTML content of a webpage as a response object, after which you can extract the desired information from this response object using techniques like parsing with Beautiful Soup. Besides, requests are super great at handling all kinds of data ranging from JavaScript Object Notation (JSON), images and files. Such flexibility will come in handy especially when you are scraping web pages that provide data in different formats.
3. Confirm the Website can be Scraped
It’s important to understand that not all websites can be scraped. To determine if a website can be scraped, you can analyse the code snippet provided below and follow a similar approach:
Import the requests library, assuming you have it installed, to make HTTP requests.
Store the website URL in a variable.
Use the requests library to send a GET request to the website URL.
Print the variable containing the code that sends the GET request.
If the website can be scraped, it will return a response code of “<Response [200]>” However, if it cannot be scraped, it may return a “<Response [402]>” response code indicating access is forbidden.
It’s important to note that when you attach .text in the print statement, the HTML content of the website will be displayed in the console instead of the response code “200” if the website can be scraped. Please note that this code snippet is meant to help you understand the response code and determine if scraping is possible, but it does not guarantee that scraping will be allowed by the website. Always ensure you comply with the website’s terms of service and legal requirements when scraping data.
4. Move on to Using Beautiful Soup
The code line “bSoup = BeautifulSoup(data.content, “html.parser”)” is written in Python and uses the BeautifulSoup library to parse the HTML content of a web page. The “data.content” variable signifies the HTML content of the web page. When this code is executed, BeautifulSoup takes the HTML content from “data.content” and parses it using the “html.parser” parser. The line “findFileOne = bSoup.find_all(“article”, class_=”file-one”)” utilises BeautifulSoup to locate all article elements with the class name “file-one” within the parsed HTML content. This implies that the HTML structure of the web page contains multiple articles, potentially representing individual files or items. By using find_all(), all matching elements are collected and stored in the variable findFileOne.
5. Looping through the First Five “file-one” Article Elements in Parsed HTML Content
In this code, we use a for loop to iterate through the first five data entries in the findFileOne collection. Within each iteration, we extract specific information from each data entry.
The ‘category’ variable is extracted by finding the relevant <div> tag with the class “category” and retrieving its text content.
The ‘title’ variable is extracted by locating the <h2> tag and stripping any leading or trailing whitespace from its text content.
The ‘votes’ variable is extracted by finding the relevant <span> tag with the class “vote-count”, retrieving its text content, and removing any parentheses.
To keep track of the file number, we use the data variable incremented by 1. This helps us label each file’s content accordingly.
The extracted data is then formatted into a content string that includes the file’s number, category, title, and votes. Finally, the content of each file is printed, with a line break (“\n”) separating each set of data for improved legibility.
6. Create an Excel Sheet
It is ideal to save your extracted data to an Excel sheet. To do this, go back to the top of your code and create an Excel sheet using the pattern of code in the image above. This was done by importing the ‘openpyxl’ library and creating a new Excel workbook using ‘openpyxl.Workbook()’. We then print the sheet names before adding a new sheet. The ‘active’ property of the workbook represents the currently selected sheet, and we assign it to the ‘sheet’ variable. We set the title of this sheet to “Naija Movies” using ‘sheet.title.’ After that, we print the sheet names again to verify that the new sheet has been added. To add a header row, we use the ‘append()’ method of the sheet and pass a list containing the column names: [“number”, “category”, “title”, “votes”].
7. Save to an Excel Sheet
Inside the loop, we call ‘sheet.append([number, category, title, votes])’ to add a new row to the sheet. This row contains the extracted data for each iteration. Finally, we save the Excel workbook using ‘excel.save(“Top Naija Rated Movies.xlsx”).’ This saves the workbook as an Excel file named “Top Naija Rated Movies.xlsx” in the current directory. The file will contain the added sheet with the header row and the data extracted from the ‘findFileOne’ variable.
With BeautifulSoup, Save Yourself the Stress of Dealing With Chunks of Data
Congratulations on completing the article!!! Now it’s time to put your newfound knowledge into practice. Open a new browser and search for any website that interests you. Remember to respect the website’s terms of service and legal requirements when scraping data. You can refer back to the third step mentioned earlier to ensure that the website allows scraping. Don’t be afraid to give it a try and keep experimenting until you succeed. Good luck with your scraping adventure.