Thasni M A
Scraping Decathlon using Playwright in Python

Decathlon is a renowned sporting goods retailer that offers a diverse range of products, including sports apparel, shoes, and equipment. Scraping the Decathlon website can provide valuable insights into product trends, pricing, and other market information. In this article, we'll dive into how you can scrape apparel data from Decathlon's website by category using Playwright and Python.
Playwright is an automation library that enables you to control web browsers, such as Chromium, Firefox, and WebKit, using programming languages like Python and JavaScript. It's an ideal tool for web scraping because it allows you to automate tasks such as clicking buttons, filling out forms, and scrolling. We'll use Playwright to navigate through each category and collect information on products, including their name, price, and description.
In this tutorial, you'll gain a fundamental understanding of how to use Playwright and Python to scrape data from Decathlon's website by category. We'll extract several data attributes from individual product pages:
Product URL - The URL of the resulting products.
Product Name - The name of the products.
Brand - The brand of the products.
MRP - MRP of the products.
Sale Price - Sale price of the products.
Number of Reviews - The number of reviews of the products.
Ratings - The ratings of the products.
Color - The color of the products.
Features - The description of the products.
Product Information - The additional product information of products which includes information such as composition, Origin, etc.
Here's a step-by-step guide for using Playwright in Python to scrape apparel data from Decathlon by category.
Also Read: Scraping Amazon Reviews with Playwright and Python
Importing Required Libraries
To start our process, we will need to import the required libraries that will interact with the website and extract the information we need.
import random
import asyncio
import pandas as pd
from playwright.async_api import async_playwright
'random' - This library is used for generating the random numbers, which can be useful for generating the test data or randomizing the order of tests.
'asyncio' - This library is used for handling the asynchronous programming in Python, which is necessary when using the asynchronous API of Playwright.
'pandas' - This library is used for data analysis and manipulation. In this tutorial, it may be used to store and manipulate the data that is obtained from the web pages being tested.
'async_playwright' - This is the asynchronous API for Playwright, which is used in this script to automate the browser testing. The asynchronous API allows you to perform multiple operations concurrently, which can make your tests faster and more efficient.
These libraries are used for automating browser testing using Playwright, including generating test data, handling asynchronous programming, storing and manipulating data, and for automating browser interactions.
Extraction of Product URLs
The second step is extracting the resultant apparel product URLs. Here we are extracting the product URLs by product category wise.
async def get_product_urls(browser, page):
product_urls = []
# Loop through all pages
while True:
# Find all elements with the product urls
all_items = await page.querySelectorAll('.adept-product-display__title-container')
# Extract the href attribute for each item and append to product_urls list
for item in all_items:
url = await item.getAttribute('href')
product_urls.append(url)
num_products = len(product_urls)
print(f"Scraped {num_products} products.")
# Find the next button
next_button = await page.querySelector('.adept-pagination__item:not(.adept-pagination__disabled) a[aria-label="Go to next page"]')
# Exit the loop if there is no next button
if not next_button:
break
# Click the next button with retry mechanism and delay
MAX_RETRIES = 5
for retry_count in range(MAX_RETRIES):
try:
# Click the next button
await next_button.click()
# Wait for the next page to load
await page.waitForSelector('.adept-product-display__title-container', timeout=800000)
# Add a delay
await asyncio.sleep(random.uniform(2, 5))
# Break out of the loop if successful
break
except:
# If an exception occurs, retry up to MAX_RETRIES times
if retry_count == MAX_RETRIES - 1:
raise Exception("Clicking next button timed out")
# Wait for a random amount of time between 1 and 3 seconds before retrying
await asyncio.sleep(random.uniform(1, 3))
return product_urls
Here, we are using the Python function ‘get_product_urls’ to extract product URLs from a web page. The function uses the Playwright library to automate the browser testing and extract the resultant product URLs from a webpage. The function takes two parameters, browser, and page, which are instances of the Playwright Browser and Page classes, respectively. It first uses 'page.querySelectorAll()' to find all the elements on the page that contain the product URLs. It then uses a for loop to iterate through each of these elements and extracts the href attribute, which contains the URL of the product page. And the function also checks if there is a "next" button on the page. If there is, the function clicks on the button and recursively calls itself to extract URLs from the next page. The function continues doing this until all the relevant product URLs have been extracted.

Here we are scraping product URLs based on the product category. Therefore, we need to first click on the product category button to expand the list of available categories and then click on each category to filter the products accordingly.
async def filter_products(browser, page):
# Expand the product category section
category_button = await page.query_selector('.adept-filter-list__title[aria-label="product category Filter"]')
await category_button.click(timeout=600000)
# Check if category section is already expanded
is_expanded = await category_button.get_attribute('aria-expanded')
if is_expanded == 'false':
await category_button.click(timeout=600000)
else:
pass
# Click the "Show All" button to show all categories
show_all_button = await page.query_selector('.adept-filter__checkbox__show-toggle')
await show_all_button.click(timeout=400000)
# Check if "Show All" button is already clicked
show_all_text = await show_all_button.text_content()
if show_all_text == 'Show All':
await show_all_button.click(timeout=400000)
else:
pass
# Wait for the category list to load
await page.wait_for_selector('.adept-checkbox__input-container', timeout=400000)
# Define a list of checkbox labels to select and clear
categories = ["Base Layer", "Cap", "Cropped Leggings", "Cycling Shorts", "Fleece", "Gloves", "Legging 7/8",
"Long-Sleeved T-Shirt", "Padded Jacket", "Short-Sleeved Jersey", "Down Jacket", "Socks",
"Sports Bra", "Sweatshirt", "Tank", "Tracksuit", "Trousers/Pants", "Windbreaker", "Zip-Off Pants",
"Shoes", "Sunglasses","Sport Bag", "Fitness Mat", "Shorts", "T-Shirt", "Jacket", "Leggings"]
product_urls = []
# Iterate over the list of category to select and clear
for category in categories:
# Select the checkbox
checkbox = await page.query_selector(f'label.adept-checkbox__label:has-text("{category}")')
await checkbox.click(timeout=600000)
# Check if checkbox is already selected
is_checked = await checkbox.get_attribute('aria-checked')
if is_checked == 'false':
await checkbox.click(timeout=600000)
else:
print(f"{category} checkbox is checked.")
# Wait for the page to load
await asyncio.sleep(10)
# Get the list of product URLs
product_urls += [(url, category) for url in await get_product_urls(browser, page)]
# Clear the checkbox filter
clear_filter_button = await page.query_selector(
f'button.adept-selection-list__close[aria-label="Clear {category.lower()} Filter"]')
if clear_filter_button is not None:
await clear_filter_button.click(timeout=600000)
print(f"{category} filter cleared.")
else:
clear_buttons = await page.query_selector_all('button[aria-label^="Clear"]')
for button in clear_buttons:
await button.click(timeout=600000)
print(f"{category} filter cleared.")
# Wait for the page to load
await asyncio.sleep(10)
return product_urls
Here, we are using the Python function ‘filter_products’ for filtering the products available on the Decathlon website by their category and returning a list of product URLs along with their respective categories. The function first expands the product category section on the website and then clicks the "Show All" button to display all available subcategories. It then defines a list of subcategories that it will iterate over and select the checkbox corresponding to each subcategory to filter the products accordingly. For each subcategory, it waits for the page to load and then retrieves the list of product URLs using the ‘get_product_urls’ function. And once all subcategories have been processed, the function also clears the filters by clicking on the "Clear" button for each subcategory.
Also Read: Scraping Amazon Product Category Without Getting Blocked
Information Extraction
In this step, we will identify wanted attributes from the Website and extract the Product Name, Brand, Number of Reviews, Rating, MRP, Sale Price, and Details of each product.
Extraction of Product Name
The next step is the extraction of the names of the products from the web pages.
async def get_product_name(page):
try:
# Find the product title element and get its text content
product_name_elem = await page.query_selector(".de-u-textGrow1.de-u-md-textGrow2.de-u-textMedium.de-u-spaceBottom06")
product_name = await product_name_elem.text_content()
except:
# If an exception occurs, set the product name as "Not Available"
product_name = "Not Available"
# Remove any leading/trailing whitespace from the product name and return it
return product_name.strip()
Here we used an asynchronous function 'get_product_name that takes a page argument, which represents a Playwright page object. The function attempts to find the corresponding product name element on the page by using the query_selector() method of the page object and passing the corresponding CSS selector. If the element is found, the function retrieves the text content of the element then it is returned as a string. If an exception occurs while attempting to find or retrieve the product name element, such as if the element is not found on the page, the function sets the product_name variable to "Not Available."
Extraction of Brand of the Products
The next step is the extraction of the brand of the products from the web pages.
async def get_brand_name(page):
try:
# Find the SVG title element and get its text content
brand_name_elem = await page.query_selector("svg[role=\'img\'] title")
brand_name = await brand_name_elem.text_content()
except:
# If an exception occurs, set the brand name as "Not Available"
brand_name = "Not Available"
# Return the cleaned up brand name
return brand_name
Similarly to the extraction of the product name, the function get_brand_name extracts the brand name of a product from a web page. The function tries to locate the brand name element using a CSS selector that targets the element containing the brand name. If the element is found, the function extracts the text content of the element using the text_content() method and assigns it to the brand_name variable. The brand name contains both the brand name and sub-brand name; for example, "Decathlon Wedze", Wedze is one of the sub-brand of Decathlon. If an exception occurs during the process of finding the brand name element or extracting its text content, the function sets the brand name as "Not Available."
Similarly, we can extract other attributes such as the MRP, Sale price, Number of Reviews, Ratings, Color, Features, and Product Information. We can apply the same technique to extract all of them. For each attribute, you need to define a separate function that uses the ‘query_selector’ method and ‘text_content’ method or a similar method to select the relevant element on the page and extract the desired information and also need to modify the CSS selectors used in the functions based on the structure of the web page you are scraping.
Extraction of MRP of the Products
async def get_MRP(page):
try:
# Find the MRP element and get its text content
MRP_elem = await page.query_selector(".js-de-CrossedOutPrice > .js-de-PriceAmount")
MRP = await MRP_elem.inner_text()
except:
# If an exception occurs, set the MRP as "Not Available"
try:
# Get MRP element and extract text content
MRP_elem = await page.query_selector(".js-de-CurrentPrice > .js-de-PriceAmount")
MRP = await MRP_elem.text_content()
except:
# Set MRP to "Not Available" if element not found or text content cannot be extracted
MRP = "Not Available"
# Return the MRP
return MRP
Extraction of Sale Price of the Products
async def get_sale_price(page):
try:
# Get sale price element and extract text content
sale_price_element = await page.query_selector(".js-de-CurrentPrice > .js-de-PriceAmount")
sale_price = await sale_price_element.text_content()
except:
# Set sale price to "Not Available" if element not found or text content cannot be extracted
sale_price = "Not Available"
return sale_price
Extraction of the Number of Reviews for the Products
async def get_num_reviews(page):
try:
# Find the number of reviews element and get its text content
num_reviews_elem = await page.wait_for_selector("span.de-u-textMedium.de-u-textSelectNone.de-u-textBlue")
num_reviews = await num_reviews_elem.inner_text()
num_reviews = num_reviews.split(" ")[0]
except:
num_reviews = "Not Available"
# Return the number of reviews
return num_reviews
Extraction of Ratings of the Products
async def get_star_rating(page):
try:
# Find the star rating element and get its text content
star_rating_elem = await page.wait_for_selector(".de-StarRating-fill + .de-u-hiddenVisually")
star_rating_text = await star_rating_elem.inner_text()
star_rating = star_rating_text.split(" ")[2]
except:
star_rating = "Not Available"
# Return the star rating
return star_rating
Extraction of the color of the products
async def get_colour(page):
try:
# Get color element and extract text content
color_element = await page.query_selector("div.de-u-spaceTop06.de-u-lineHeight1.de-u-hidden.de-u-md-block.de-u-spaceBottom2 strong + span.js-de-ColorInfo")
color = await color_element.inner_text()
except:
try:
# Find the color element and get its text content
color_elem = await page.query_selector("div.de-u-spaceTop06.de-u-lineHeight1 strong + span.js-de-ColorInfo")
color = await color_elem.inner_text()
except:
# If an exception occurs, set the color as "Not Available"
color = "Not Available"
return color
Extraction of Features of the Products
async def get_Product_description(page):
try:
# Get the main FeaturesContainer section
FeaturesContainer = await page.query_selector(".FeaturesContainer")
# Extract text content for the main section
text = await FeaturesContainer.text_content()
# Split the text into a list by newline characters
Product_description = text.split('\n')
# Remove any empty strings from the list
Product_description = list(filter(None, Product_description))
Product_description = [bp.strip() for bp in Product_description if bp.strip() and "A photo" not in bp]
except:
# Set Product_description to "Not Available" if sections not found or there's an error
Product_description = "Not Available"
return Product_description
Here this is an asynchronous function that extracts the product description section from a Decathlon product page, and the function also removes all unwanted characters using a list comprehension, which filters out any elements that contain that phrase. Finally, it returns the resulting list of strings as the product description.
Extraction of Product Information
async def get_ProductInformation(page):
try:
# Get ProductInformation section element
ProductInformation_element = await page.query_selector(".de-ProductInformation--multispec")
# Get all ProductInformation entry elements
ProductInformation_entries = await ProductInformation_element.query_selector_all(".de-ProductInformation-entry")
# Loop through each entry and extract the text content of the "name" and "value" elements
ProductInformation = {}
for entry in ProductInformation_entries:
name_element = await entry.query_selector("[itemprop=name]")
name = await name_element.text_content()
value_element = await entry.query_selector("[itemprop=value]")
value = await value_element.text_content()
# Remove newline characters from the name and value strings
name = name.replace("\n", "")
value = value.replace("\n", "")
# Add name-value pair to product_information dictionary
ProductInformation[name] = value
except:
# Set ProductInformation to "Not Available" if element not found or text content cannot be extracted
ProductInformation = {"Not Available": "Not Available"}
return ProductInformation
The code is defining an asynchronous function called get_ProductInformation that takes a page object as its parameter. This function is intended to extract product information from Decathlon's website. The function will loop through each product information entry and extract the text content of the "name" and "value" elements using the text_content method. It then removes any newline characters from the extracted strings using the replace method and adds the name-value pair to a dictionary called ProductInformation. If an exception occurs, such as when the element cannot be found, or the text content cannot be extracted, the code sets the ProductInformation dictionary to "Not Available."
Also Read: How to Scrape Product Information from Costco using Python
Request Retry with Maximum Retry Limit
Request retry is a crucial aspect of web scraping as it helps to handle temporary network errors or unexpected responses from the website. The aim is to send the request again if it fails the first time to increase the chances of success.
Before navigating to the URL, the script implements a retry mechanism in case the request timed out. It does so by using a while loop that keeps trying to navigate to the URL until either the request succeeds or the maximum number of retries has been reached. If the maximum number of retries is reached, the script raises an exception. This code is a function that performs a request to a given link and retries the request if it fails. The function is useful when scraping web pages, as sometimes requests may time out or fail due to network issues.
async def perform_request_with_retry(page, url):
# set maximum retries
MAX_RETRIES = 5
# initialize retry counter
retry_count = 0
# loop until maximum retries are reached
while retry_count < MAX_RETRIES:
try:
# try to make request to the URL using the page object and a timeout of 30 seconds
await page.goto(url, timeout=1000000)
# break out of the loop if the request was successful
break
except:
# if an exception occurs, increment the retry counter
retry_count += 1
# if maximum retries have been reached, raise an exception
if retry_count == MAX_RETRIES:
raise Exception("Request timed out")
# wait for a random amount of time between 1 and 5 seconds before retrying
await asyncio.sleep(random.uniform(1, 10))
Here function performs a request to a specific link using the ‘goto’ method of the page object from the Playwright library. When a request fails, the function tries it again up to the allotted number of times. The maximum number of retries is defined by the MAX_RETRIES constant as five times. Between each retry, the function uses the asyncio.sleep method to wait for a random duration from 1 to 5 seconds. This is done to prevent the code from retrying the request too quickly, which could cause the request to fail even more often. The perform_request_with_retry function takes two arguments: page and link. The page argument is the Playwright page object that is used to perform the request, and the link argument is the URL to which the request is made.
Extracting and Saving the Product Data
In the next step, we call the functions and save the data to an empty list.
async def main():
# Launch a Firefox browser using Playwright
async with async_playwright() as pw:
browser = await pw.firefox.launch()
page = await browser.new_page()
# Make a request to the Decathlon search page and extract the product URLs
await perform_request_with_retry(page, 'https://www.decathlon.com/search?SOLD_OUT=%7B%22label_text%22%3A%22SOLD_OUT%22%2C%22value%22%3A%7B%22%24eq%22%3A%22FALSE%22%7D%7D&query_history=%5B%22Apparel%22%5D&q=Apparel&category_history=%5B%5D&sorting=NATURAL|desc')
product_urls = await filter_products(browser, page)
# Print the list of URLs
print(product_urls)
print(len(product_urls))
data = []
# Loop through each product URL and scrape the necessary information
for i, (url, category) in enumerate(product_urls):
await perform_request_with_retry(page, url)
product_name = await get_product_name(page)
brand = await get_brand_name(page)
star_rating = await get_star_rating(page)
num_reviews = await get_num_reviews(page)
MRP = await get_MRP(page)
sale_price = await get_sale_price(page)
colour = await get_colour(page)
ProductInformation = await get_ProductInformation(page)
Product_description = await get_Product_description(page)
# Print progress message after processing every 10 product URLs
if i % 10 == 0 and i > 0:
print(f"Processed {i} links.")
# Print completion message after all product URLs have been processed
if i == len(product_urls) - 1:
print(f"All information for url {i} has been scraped.")
# Add the scraped information to a list
data.append((url, category, product_name, brand, star_rating, num_reviews, MRP, sale_price, colour,
ProductInformation, Product_description))
# Convert the list of tuples to a Pandas DataFrame and save it to a CSV file
df = pd.DataFrame(data,
columns=['product_url', 'category', 'product_name', 'brand', 'star_rating', 'number_of_reviews',
'MRP', 'sale_price', 'colour', 'product information', 'Product description'])
df.to_csv('product_data.csv', index=False)
print('CSV file has been written successfully.')
# Close the browser
await browser.close()
if __name__ == '__main__':
asyncio.run(main())
This is a Python script that uses an asynchronous function called "main" to scrape product information from Amazon pages. The script uses the Playwright library to launch a Firefox browser and navigate to the Amazon page. The function then extracts the URLs of each product using the "extract_product_urls" function and stores them in a list called "product_url". The function then loops through each product URL, loads the product page using the "perform_request_with_retry" function, and extracts various information such as the product name, brand, star rating, number of reviews, MRP, Sale price, Number of Reviews, Ratings, Color, Features, Product Information.
This information is then stored as a tuple in a list called "data." The function also prints a progress message after processing every 10 product URLs and a completion message after all the product URLs have been processed. The data in the "data" list is then converted to a Pandas DataFrame and saved as a CSV file using the "to_csv" method. Finally, the browser is closed using the "browser. close()" statement. The script is executed by calling the "main" function using the "asyncio. run(main())" statement. This statement runs the "main" function as an asynchronous coroutine.
Conclusion
In today's fast-paced business landscape, data is king and web scraping is the key to unlocking its full potential. With the right data and tools, brands can gain a deep understanding of the market and make informed decisions that can drive growth and profitability.
In today's cut-throat business world, brands must gain any competitive edge they can to stay ahead of the pack. That's where web scraping comes in, providing companies with critical insights into market trends, pricing strategies, and competitor data.
By leveraging the power of tools like Playwright and Python, companies can extract valuable data from websites like Decathlon, providing them with a wealth of information on product offerings, pricing, and other key metrics. And when combined with the services of a leading web scraping company like Datahut, the results can be truly game-changing.
Datahut's bespoke web scraping solutions can help brands acquire the precise data they need to make informed decisions on everything from product development to marketing campaigns. By partnering with Datahut, brands can gain access to vast amounts of relevant data points, giving them a complete understanding of their industry and competition. From product names and descriptions to pricing, reviews, and more, Datahut's web scraping services can provide companies with a competitive edge that can help them make more informed decisions, streamline their operations, and ultimately drive growth and profitability.
Ready to explore the power of web data for your brand? Contact Datahut, your web data scraping experts
Related Reading: Scraping Amazon Best Seller Data using Python: A Step-by-Step Guide