Do you know that 20 lines of Python code are all it takes to build a basic, but fully functional web scraper for an eCommerce website?
In this blog, we'll demonstrate how to build an eCommerce scraper using Python with exactly 20 lines of Python code. We're going to demonstrate the power of Programming, especially Python for data.
Web Scraping Definition: Web scraping is the process of extracting data from web pages and transforming it into a useful format like CSV / JSON, or putting it into a database of your choice.
Statutory warning: This is not a production-grade web scraper - just a fun way to demonstrate the power of Python web scraping libraries. We will show you how to optimize it into a production-level scraper in a later blog.
The eCommerce website
The first thing you need to do is find the URL of the website you want to scrape. For the purpose of this blog, we'll be demonstrating the web scraper on the Puma website which is an e-commerce website. We will be scraping the data for MANCHESTER CITY FC Jerseys that are on sale.
Here is the URL for you to visit: https://in.puma.com/in/en/collections/collections-football/collections-football-manchester-city-fc
Data fields to extract
We will be extracting the following four attributes from the eCommerce page:
1. Page URL
The first data field we will be extracting is the page URL of the product which is a must-have data field for most e-commerce web scraping projects. The URL is important because it is a unique identifier for each product page and can be used to find more information about the product. The page URL or product page URL is the direct link to the page we are scraping the data from. Here is an example of a product page URL.
2. Product Name
The name of the product is saved under Product Name on the output CSV file. An example of the product name from the page URL mentioned above is - "Manchester City Home Replica Men's Jersey ".
3. The Price
The product price is the price at which the product is being sold at the moment. Extracting pricing data helps us determine whether or not it's worth investing in this item or if there are better alternatives available on the market right now (e.g., cheaper prices).
In our case, the product price is displayed in red color.
4. The Description
The description data gives insight into what kind of features are included in each product (such as color options or size variations), which in turn can help brands find out whether or not the item is suitable for their target audience.
In our case, the description of the product, on the puma website - appears under the heading product story.
Also Read: How to find the most profitable products to sell online using Web Scraping
The Python Libraries used for building the eCommerce scraper
1. Requests.
Requests is a popular HTTP library for the Python programming language. The project's goal is to make HTTP requests simpler and more human-friendly. Python Requests module is a vital piece of any web scraping project written in Python. Python web scrapers use requests either directly or indirectly wrapped around frameworks. We can use the requests library to fetch content from the URL.
2. Beautifulsoup
BeautifulSoup is a popular Python library that makes it easy to scrape data from web pages. BeautifulSoup creates a parse tree for parsing HTML and XML documents. It is the soul of our web scraper. CSS
3. CSV
The Python csv library helps programmatically read and write tabular data in CSV/Excel formats. We will be using the library to write the scraped data into a csv file. You can use other libraries like Python Pandas to do the same thing more efficiently, but for our purpose - let's stick with the CSV library.
The process
Go to the page where the MANCHESTER CITY FC Jerseys are displayed
Grab the links to the products on sale and save them to a list.
Read the list and go to the product links one at a time
Find the elements to extract using the CSS selectors
Parse the information and save it to a file named puma_manchester_cityit csv and we are done
Let's start scraping
1. Step 1: Install the required libraries
Python requests, BeautifulSoup, and CSV libraries are imported and ready to fire.
import requests
from bs4 import BeautifulSoup
import csv
2. Step 2: Set a start URL
The start URL tells the web scraper where to start. In our case - the link to MANCHESTER CITY FC Jerseys on sale will be our start URL.
start_url = "https://in.puma.com/in/en/collections/collections-football/collections-football-manchester-city-fc"
3. Let's get the start URL
Let's start the fireworks
The next step is to go to the start URL, fetch its content, and identify the product links. We use the following two lines of code for that.
web_page = requests.get(start_url)
soup = BeautifulSoup(web_page.content, 'html.parser')
The HTTP request returns a Response Object with all the response data (content, encoding, status, etc) and we store it in the web_page variable. We can now use Beautifulcoup to parse the web_page.
3. Find the Product Links.
The next step is to navigate through the HTML and find the product URLs. We will add the product URLs to a list. We use CSS selectors to find the element on the page. CSS Selectors help us select HTML elements according to their Id, class, type, attribute etc.
If we open chrome developer tools and inspect it closely, we will see that all the product links have a common class and that is
"product-tile-title product-tile__title pdp-link line-item-limited line-item-limited--2"
product_links = []
for link in soup.find_all('a', class_='product-tile-title product-tile__title pdp-link line-item-limited line-item-limited--2'):
if link['href'] not in product_links:
product_links.append('https://in.puma.com/' + link['href'])
We use soup.find_all method to extract all product links from the page with the class above and add them to the list named product_links = []
if we notice one thing, the URL we get from the page is not complete. We need to add the first part which is "https://in.puma.com/" to make it into a proper URL.
Now we parse all the URLs we extracted from the previous step one by one
Before starting the parsing, we need to prepare the data to be saved to a csv file. We use the following lines of code to achieve that.
with open('puma_manchester_city.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['Product Name', 'Price', 'Description', 'Link'])
We write the data into a file named puma_manchester_city.csv using a writer object and the .write_row() method.
In the next step, we iterate over each product URL in the list product_links and parse them.
for product_url in product_links:
product_page = requests.get(product_url)
product_soup = BeautifulSoup(product_page.content, 'html.parser')
product_name = product_soup.find('h1', class_='product-name').text.strip()
price = product_soup.find('span', class_='value').tex
product_description = product_soup.find('div', class_='content', itemprop="description").text
writer.writerow([product_name, price, product_description, product_url])
Once we finish this and run the code - we will get a csv file with the data from the MANCHESTER CITY FC Jerseys category.
Is the data clean? No - we need to do additional data cleaning operations on top of the data or embed it within the scraper to get a cleaner set of data.
e-Commerce scraping is widely used by brands globally to acquire data from e-commerce websites and use it to improve their business. Web scraping can be used for many purposes: to gain insight into competitors' products/prices, to monitor stock levels across various sellers on Amazon, or even to help you find new products that might be relevant to your customers.
Wish to leverage web data scraping for your e-commerce store? Contact Datahut to learn more!
Related Reading:
コメント