top of page
  • Writer's pictureThasni M A

Scraping Amazon Today's Deals Musical Instruments Data using Playwright Python

Updated: 4 days ago


Scraping Amazon Today's Deals Musical Instruments Data using Playwright Python

Amazon is an e-commerce giant that offers a vast range of products, from electronics to groceries. One of its popular sections is "Today's Deals," which features time-limited discounts on a variety of products, including musical instruments. These deals cover a broad range of categories, such as electronics, fashion, home goods, and toys. The discounts on musical instruments can vary from a few percentage points to more than 50% off the original price.


Amazon provides a broad selection of both traditional and electronic musical instruments from top brands and manufacturers. These can include acoustic and electric guitars, drums, keyboards, orchestral instruments, and a variety of accessories such as cases, stands, and sheet music. Additionally, Amazon's "Used & Collectible" section offers customers the chance to purchase pre-owned instruments at discounted prices.


This blog will provide a step-by-step guide on how to use Playwright Python to scrape musical instrument data from Today's Deals on Amazon and save it as a CSV file. We will be extracting the following data attributes from the individual pages of Amazon.


  • Product URL - The URL gets us to the target page of musical instruments.

  • Product Name - The name of the musical instruments.

  • Brand - The brand of musical instruments.

  • MRP - MRP of the musical instruments.

  • Offer Price - Offer Price of the musical instruments.

  • Number of Reviews - The number of reviews of musical instruments.

  • Rating - The rating of musical instruments.

  • Size - Size of the musical instruments.

  • Color - Color of the musical instruments.

  • Material - Material of the musical instruments.

  • Compatible Devices - Other devices that are compatible with musical instruments.

  • Connectivity Technology - The technology using which the musical instruments can be connected.

  • Connector Type: The type of the connector.

Also Read: Scraping Amazon product listing: All You Need to Know


Playwright Python


Playwright Python In this tutorial, we will be using Playwright Python to extract data. Playwright is an open-source tool for automating the web browsing. With the Playwright, you can automate the tasks such as navigating to a web page, filling out the forms, clicking the buttons, and verifying that certain elements are displayed on the page.


One of the important characteristics of Playwright is, it compatibility for many browsers such as Chrome, Firefox, and Safari. As a result, you can create tests that run on several browsers, ensuring improved coverage and lowering the possibility of compatibility problems. Additionally, Playwright has built-in tools for handling the common testing challenges, such as waiting for the elements to load, dealing with the network errors, and debugging the issues in the browser.


Another advantage of Playwright is that it supports parallel testing, , which allows you to run numerous tests simultaneously and greatly speeds up the test suite. This is especially helpful for large or complex test suites that can take long time to run. As a replacement for current web automation tools like Selenium, it is becoming more and more well-liked for its usability, performance, and compatibility for cutting-edge web technologies.


Here's a step-by-step guide for using Playwright in Python to scrape the musical instruments data from Today's Deals on Amazon.


Also Read: How to Scrape Product Information from Costco using Python


Importing Required Libraries


To start our process we will need to import Required libraries that will interact with the website and extract the information we need.


# Importing libraries
import random
import asyncio
import pandas as pd
from playwright.async_api import async_playwright
  • 'random' - This library is used for generating the random numbers, which can be useful for generating the test data or randomizing the order of tests.

  • 'asyncio' - This library is used for handling the asynchronous programming in Python, which is necessary when using the asynchronous API of Playwright.

  • 'pandas' - This library is used for data analysis and manipulation. In this tutorial, it may be used for store and manipulate the data that is obtained from the web pages being tested.

  • 'async_playwright' - This is the asynchronous API for Playwright, which is used in this script to automate the browser testing. The asynchronous API allows you to perform the multiple operations concurrently, which can make your tests faster and more efficient.


These libraries are used for automating browser testing using Playwright, including generating test data, handling the asynchronous programming, storing and manipulating data, and for automating browser interactions.



Extraction of Product Links


The second step is extracting the resultant Product links . Product link extraction is the process of collecting and organizing the URLs of products listed on a web page or online platform.

# Function to extract the product links
async def get_product_links(page):
    # Select all elements
    all_items = await page.query_selector_all('.a-link-normal.DealCardDynamic-module__linkOutlineOffset_2XU8RDGmNg2HG1E-ESseNq')
    product_links = []
    # Loop through each item and extract the href attribute
    for item in all_items:
        link = await item.get_attribute('href')
        product_links.append(link)
    # Return the list of product links
    return product_links

Here we used the Python function ‘get_product_links’ to extract the resultant product links from the web page. The function is asynchronous, It can manage waiting for lengthy procedures while carrying out numerous tasks at once without affecting the main thread of execution. The function takes a single argument, page, which is an instance of a web page in Playwright. The function uses the ‘query_selector_all' method to select all elements on the resutant page that match the specific CSS selector. This selector is will identify the elements that contain the product links. The function loops through each of the selected elements and uses the ‘get_attribute’ method to extract the href attribute, which contains the URL of the products. The extracted URL is appended to the empty list 'product_links' to store the extracted links.


Information Extraction

In this step, we will identify wanted attributes from the Website and extract the Product Name, Brand, Number of Reviews, Rating, Original Price, Offer Price, and Details of each musical instrument.


Extraction of Product Name

The extraction of the product names is a similar process to extraction of the product links. Here our goal is to select the elements on a each web pages that contain the specific product names, and extract the text content of those elements from the web pages.


# Function to extract the product name
async def get_product_name(page):
    # Try to extract the product name from the page
    try:
        product_name = await (await page.query_selector("#productTitle")).text_content()
    # If extraction fails, leave product name as "Not Available"
    except:
        product_name = "Not Available"
    return product_name

Here we used an asynchronous function ‘get_product_name’ to extract the product name from the resultant web pages. The function uses the ‘query_selector’ method to select the element on the each pages that matches the specific CSS selector and this function will also identify the element which contains the product name. The function uses the ‘text_content’ method of the selected element to extract the product name from the page. To handle the errors that occurring during the extraction of the product name form the page , the code uses a ‘try-except’ block. If the function is successfully extracted the product name, then it is returned as a string. If the extraction fails, the function returns the product name as "Not Available" ,which indicate that the product name was not found on the page.


Extraction of Brand of the Products


# Function to extract the brand of the product
async def get_brand(page):
    # Try to extract the brand from the page
    try:
        brand = await (await page.query_selector("tr.po-brand td span.po-break-word")).text_content()
    # If extraction fails, leave brand as "Not Available"
    except:
        brand = "Not Available"
    return brand

Similarly to the extraction of the product name, here we utilized an asynchronous function ‘get_product_brand’ to extarct the correspondi brand of the product from the web page. The function uses the ‘query_selector’ method select the element on the page that matches the specific CSS selector. This selector is used to identify the element that contains the brand of the corresponding products.

Next, the function uses the 'text_content' method of the selected element to extract the brand name from the page. To handle the errors that occurring during the extraction of the brand, the code uses a try-except block. If the brand of the product is successfully extracted, it is returned as a string and If the extraction is failed, the function returns a string "Not Available", which indicates that the brand of the corresponding product was not found on the page.


Similarly, we can extract other attributes such as the MRP, offer price, number of reviews, rating, size, color, material, compatible devices, connectivity technology, and connector type. We can apply the same technique that we used in previous steps to extract the other product attributes as well. For each attribute you want to extract, you would define a separate function that uses the ‘query_selector’ method to select the relevant element on the page, and then use the ‘text_content’ method or a similar method to extract the desired information and also need to modify the CSS selectors used in the functions based on the structure of the web page you are scraping.


Extraction of MRP of the Products


# Function to extract the MRP of the product
async def get_original_price(page):
    # Try to extract the original price from the page
    try:
        original_price = await (await page.query_selector(".a-price.a-text-price")).text_content()
        original_price = original_price.split("₹")[1]
    # If extraction fails, leave original price as "Not Available"
    except:
        original_price = "Not Available"
    return original_price


Also Read: How to Build an Amazon Price Tracker using Python


Extraction of Offer Price of the Products


# Function to extract the offer price of the product
async def get_offer_price(page):
    # Try to extract the offer price from the page
    try:
        offer_price = await (await page.query_selector(".a-price-whole")).text_content()
    # If extraction fails, leave offer price as "Not Available"
    except:
        offer_price = "Not Available"
    return offer_price


Extraction of the Number of Reviews for the Products


# Function to extract the number of ratings of the product
async def get_num_ratings(page):
    # Try to extract the number of ratings from the page
    try:
        ratings_text = await (await page.query_selector("#acrCustomerReviewText")).text_content()
        num_ratings = ratings_text.split(" ")[0]
    # If extraction fails, leave number of ratings as "Not Available"
    except:
        num_ratings = "Not Available"
    return num_ratings
Extraction of Ratings of the Products


Extraction of Ratings of the Products


# Function to extract the star rating of the product
async def get_star_rating(page):
    # Try to extract the star rating from the page
    try:
        star_rating = await (await page.query_selector(".a-icon-alt")).text_content()
        star_rating = star_rating.split(" ")[0]
    # If extraction fails, leave star rating as "Not Available"
    except:
        star_rating = "Not Available"
    return star_rating


Extraction of Product Size


# Function to extract the size of the product
async def get_size(page):
    # Try to extract the size from the page
    try:
        size = await (await page.query_selector("tr.po-size td span.po-break-word")).text_content()
    # If extraction fails, leave size as "Not Available"
    except:
        size = "Not Available"
    return size

Extraction of colors of the products


# Function to extract the color of the product
async def get_color(page):
    # Try to extract the color from the page
    try:
        color = await (await page.query_selector("tr.po-color td span.po-break-word")).text_content()
    # If extraction fails, leave color as "Not Available"
    except:
        color = "Not Available"
    return color

Extraction of Materials of the Products


# Function to extract the material of the product
async def get_material(page):
    # Try to extract the material from the page
    try:
        material = await (await page.query_selector("tr.po-back.material td span.po-break-word")).text_content()
    # If extraction fails, leave material as "Not Available"
    except:
        material = "Not Available"
    return material

Extraction of Compatible Devices for the Products


#Function to extract the compatible devices of the product
async def get_compatible_devices(page):
    # Try to extract the compatible devices from the page
    try:
        compatible_devices = await (await page.query_selector("tr.po-compatible_devices td span.po-break-word")).text_content()
    # If extraction fails, leave compatible devices as "Not Available"
    except:
        compatible_devices = "Not Available"
    return compatible_devices

Extraction of Connectivity Technology for the Products


# Function to extract the connectivity technology of the product
async def get_connectivity_technology(page):
    # Try to extract the connectivity technology from the page
    try:
        connectivity_technology = await (await page.query_selector("tr.po-connectivity_technology td span.po-break-word")).text_content()
    # If extraction fails, leave connectivity technology as "Not Available"
    except:
        connectivity_technology = "Not Available"
    return connectivity_technology

Extraction of Connector Type for the Products


# Function to extract the connector type of the product
async def get_connector_type(page):
    # Try to extract the connector type from the page
    try:
        connector_type = await (await page.query_selector("tr.po-connector_type td span.po-break-word")).text_content()
    # If extraction fails, leave connector_type as "Not Available"
    except:
        connector_type = "Not Available"
    return connector_type


Request Retry with Maximum Retry Limit


Request retry is a crucial aspect of web scraping as it helps to handle temporary network errors or unexpected responses from the website. The aim is to send the request again if it fails the first time to increase the chances of success.


Before navigating to the URL, the script implements a retry mechanism in case the request timed out. It does so by using a while loop that keeps trying to navigate to the URL until either the request succeeds or the maximum number of retries has been reached. If the maximum number of retries is reached, the script raises an exception. This code is a function that performs a request to a given link and retries the request if it fails. The function is useful when scraping web pages, as sometimes requests may time out or fail due to network issues.


# Function to perform a request and retry the request if it fails, with a maximum of 5 retries
async def perform_request_with_retry(page, link):
    MAX_RETRIES = 5
    retry_count = 0
    while retry_count < MAX_RETRIES:
        try:
            # Make a request to the link
            await page.goto(link)
            # If the request is successful, break the loop
            break
        except:
            retry_count += 1
            if retry_count == MAX_RETRIES:
                # Raise an exception if the maximum number of retries is reached
                raise Exception("Request timed out")
            # Sleep for a random duration between 1 and 5 seconds
            await asyncio.sleep(random.uniform(1, 5))

Here function performs a request to a specific link using the ‘goto’ method of the page object from the Playwright library.When a request fails, the function tries it again up to the allotted number of times. The maximum number of retries is defined by the MAX_RETRIES constant as 5 times. Between the each retry, the function uses the asyncio.sleep method to wait for the random duration from 1 to 5 seconds. This is done to prevent the code from retrying the request too quickly, which could cause the request to fail even more often.The perform_request_with_retry function takes two arguments: page and link. The page argument is the Playwright page object that is used to perform the request and the link argument is the URL to which the request is made.


Extraction and Product Data Saving


In the next step, we call the functions and save the data to an empty list.

# Main function to extract and save product data
async def main():
    # Start an async session with Playwright
    async with async_playwright() as pw:
        # Launch a new browser instance
        browser = await pw.chromium.launch()
        # Open a new page in the browser
        page = await browser.new_page()
        # Navigate to the Amazon deal page
        await perform_request_with_retry(page, 'https://www.amazon.in/gp/goldbox?deals-widget=%257B%2522version%2522%253A1%252C%2522viewIndex%2522%253A0%252C%2522presetId%2522%253A%252215C82F45284EDD496F94A2C368D1B4BD%2522%252C%2522sorting%2522%253A%2522BY_SCORE%2522%257D')
        # Get the links to each product
        product_links = await get_product_links(page)

        # Create an empty list to store the extracted data
        data = []
        # Iterate over the product links
        for link in product_links:
            # Load the product page
            await perform_request_with_retry(page, link)
            # Extract the product information
            
            # Product Name
            product_name = await get_product_name(page)
            
            # Brand
            brand = await get_brand(page)
            
            # Star Rating
            star_rating = await get_star_rating(page)
            
            # Number of Ratings
            num_ratings = await get_num_ratings(page)
            
            # Original Price
            original_price = await get_original_price(page)
            
            # Offer Price
            offer_price = await get_offer_price(page)
            
            # Color
            color = await get_color(page)
            
            # Size
            size = await get_size(page)
            
            # Material
            material = await get_material(page)
            
            # Connectivity Technology
            connectivity_technology = await get_connectivity_technology(page)
            
            # Connector Type
            connector_type = await get_connector_type(page)
            
            # Compatible Devices
            compatible_devices = await get_compatible_devices(page)
            
            
            # Add the extracted data to the list
            data.append((link, product_name, brand, star_rating, num_ratings, original_price, offer_price, color,
                         size, material, connectivity_technology, connector_type, compatible_devices))

        # Create a pandas dataframe from the extracted data
        df = pd.DataFrame(data, columns=['Product Link', 'Product Name', 'Brand', 'Star Rating', 'Number of Ratings', 'Original Price', 'Offer Price', 
                                         'Color', 'Size', 'Material', 'Connectivity_technology', 'Connector_type', 'Compatible_devices'])
        # Save the data to a CSV file
        df.to_csv('product_details5.csv', index=False)

        # Notify the user that the file has been saved
        print('CSV file has been written successfully.')
        # Close the browser instance
        await browser.close()

We use an asynchronous function, ‘main’, that scrapes product information from the Amazon Today's Deals page. The function is initiated by launching a new browser instance - chromium, in our case, using Playwright. This opens up a new page in the browser. We then navigate to the Amazon Today's Deals page using the perform_request_with_retry function. The function requests the link and retries the request if it fails, with a maximum of 5 retries (The number of retried can be changed). This ensures that the request to Amazon Today's Deals page is successful.


Once the Deals page is loaded, we extract the links to each product using the ‘get_product_links’ function defined in the script. Then the scraper iterates over each product link. Then we loads the product page using the ‘perform_request_with_retry function,’. This operation extracts all the information, then stores it as a tuple. The tuple is used to create a Pandas dataframe. The data frame is exported to a CSV file using the to_csv method of the Pandas dataframe.

Finally, we call the ‘main’ Function:


# Entry point to the script
if __name__ == '__main__':
    asyncio.run(main())

The ‘asyncio.run(main())’ statement is used to run the main function as an asynchronous coroutine.


Also Read: Mastering XPath for Web Scraping: A Step-by-Step Tutorial


Conclusion


Scraping data from Amazon's Today's Deals section can be a useful technique to gather information about the products being offered at a discounted price. In this blog post, we explored how to use Playwright Python to scrape data from the musical instruments section of Today's Deals on Amazon. Following the steps outlined in this tutorial, you can easily adapt the code to scrape data from other sections of Amazon or other websites.


However, it is important to note that web scraping is a controversial practice, and it may be prohibited by the website you are scraping from. Always make sure to check the website's terms of service before attempting to scrape data from it, and respect any restrictions or limitations that they may have in place.


Overall, web scraping can be a powerful tool for gathering data and automating tasks, but it should be used ethically and responsibly. By following best practices and being respectful of website policies, you can use web scraping to your advantage and gain valuable insights from the data you collect.


Also Read: Is Web Data Scraping Legal?


Looking to acquire Amazon data for your brand? Contact Datahut, your web scraping experts.

87 views1 comment

Recent Posts

See All

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page