Thasni M A
Scraping Amazon Today's Deals Musical Instruments Data using Playwright Python
Updated: 4 days ago
Amazon is an e-commerce giant that offers a vast range of products, from electronics to groceries. One of its popular sections is "Today's Deals," which features time-limited discounts on a variety of products, including musical instruments. These deals cover a broad range of categories, such as electronics, fashion, home goods, and toys. The discounts on musical instruments can vary from a few percentage points to more than 50% off the original price.
Amazon provides a broad selection of both traditional and electronic musical instruments from top brands and manufacturers. These can include acoustic and electric guitars, drums, keyboards, orchestral instruments, and a variety of accessories such as cases, stands, and sheet music. Additionally, Amazon's "Used & Collectible" section offers customers the chance to purchase pre-owned instruments at discounted prices.
This blog will provide a step-by-step guide on how to use Playwright Python to scrape musical instrument data from Today's Deals on Amazon and save it as a CSV file. We will be extracting the following data attributes from the individual pages of Amazon.
Product URL - The URL gets us to the target page of musical instruments.
Product Name - The name of the musical instruments.
Brand - The brand of musical instruments.
MRP - MRP of the musical instruments.
Offer Price - Offer Price of the musical instruments.
Number of Reviews - The number of reviews of musical instruments.
Rating - The rating of musical instruments.
Size - Size of the musical instruments.
Color - Color of the musical instruments.
Material - Material of the musical instruments.
Compatible Devices - Other devices that are compatible with musical instruments.
Connectivity Technology - The technology using which the musical instruments can be connected.
Connector Type: The type of the connector.
Also Read: Scraping Amazon product listing: All You Need to Know
Playwright Python In this tutorial, we will be using Playwright Python to extract data. Playwright is an open-source tool for automating the web browsing. With the Playwright, you can automate the tasks such as navigating to a web page, filling out the forms, clicking the buttons, and verifying that certain elements are displayed on the page.
One of the important characteristics of Playwright is, it compatibility for many browsers such as Chrome, Firefox, and Safari. As a result, you can create tests that run on several browsers, ensuring improved coverage and lowering the possibility of compatibility problems. Additionally, Playwright has built-in tools for handling the common testing challenges, such as waiting for the elements to load, dealing with the network errors, and debugging the issues in the browser.
Another advantage of Playwright is that it supports parallel testing, , which allows you to run numerous tests simultaneously and greatly speeds up the test suite. This is especially helpful for large or complex test suites that can take long time to run. As a replacement for current web automation tools like Selenium, it is becoming more and more well-liked for its usability, performance, and compatibility for cutting-edge web technologies.
Here's a step-by-step guide for using Playwright in Python to scrape the musical instruments data from Today's Deals on Amazon.
Also Read: How to Scrape Product Information from Costco using Python
Importing Required Libraries
To start our process we will need to import Required libraries that will interact with the website and extract the information we need.
# Importing libraries import random import asyncio import pandas as pd from playwright.async_api import async_playwright
'random' - This library is used for generating the random numbers, which can be useful for generating the test data or randomizing the order of tests.
'asyncio' - This library is used for handling the asynchronous programming in Python, which is necessary when using the asynchronous API of Playwright.
'pandas' - This library is used for data analysis and manipulation. In this tutorial, it may be used for store and manipulate the data that is obtained from the web pages being tested.
'async_playwright' - This is the asynchronous API for Playwright, which is used in this script to automate the browser testing. The asynchronous API allows you to perform the multiple operations concurrently, which can make your tests faster and more efficient.
These libraries are used for automating browser testing using Playwright, including generating test data, handling the asynchronous programming, storing and manipulating data, and for automating browser interactions.
Extraction of Product Links
The second step is extracting the resultant Product links . Product link extraction is the process of collecting and organizing the URLs of products listed on a web page or online platform.
# Function to extract the product links async def get_product_links(page): # Select all elements all_items = await page.query_selector_all('.a-link-normal.DealCardDynamic-module__linkOutlineOffset_2XU8RDGmNg2HG1E-ESseNq') product_links =  # Loop through each item and extract the href attribute for item in all_items: link = await item.get_attribute('href') product_links.append(link) # Return the list of product links return product_links
Here we used the Python function ‘get_product_links’ to extract the resultant product links from the web page. The function is asynchronous, It can manage waiting for lengthy procedures while carrying out numerous tasks at once without affecting the main thread of execution. The function takes a single argument, page, which is an instance of a web page in Playwright. The function uses the ‘query_selector_all' method to select all elements on the resutant page that match the specific CSS selector. This selector is will identify the elements that contain the product links. The function loops through each of the selected elements and uses the ‘get_attribute’ method to extract the href attribute, which contains the URL of the products. The extracted URL is appended to the empty list 'product_links' to store the extracted links.
In this step, we will identify wanted attributes from the Website and extract the Product Name, Brand, Number of Reviews, Rating, Original Price, Offer Price, and Details of each musical instrument.
Extraction of Product Name
The extraction of the product names is a similar process to extraction of the product links. Here our goal is to select the elements on a each web pages that contain the specific product names, and extract the text content of those elements from the web pages.
# Function to extract the product name async def get_product_name(page): # Try to extract the product name from the page try: product_name = await (await page.query_selector("#productTitle")).text_content() # If extraction fails, leave product name as "Not Available" except: product_name = "Not Available" return product_name
Here we used an asynchronous function ‘get_product_name’ to extract the product name from the resultant web pages. The function uses the ‘query_selector’ method to select the element on the each pages that matches the specific CSS selector and this function will also identify the element which contains the product name. The function uses the ‘text_content’ method of the selected element to extract the product name from the page. To handle the errors that occurring during the extraction of the product name form the page , the code uses a ‘try-except’ block. If the function is successfully extracted the product name, then it is returned as a string. If the extraction fails, the function returns the product name as "Not Available" ,which indicate that the product name was not found on the page.
Extraction of Brand of the Products
# Function to extract the brand of the product async def get_brand(page): # Try to extract the brand from the page try: brand = await (await page.query_selector("tr.po-brand td span.po-break-word")).text_content() # If extraction fails, leave brand as "Not Available" except: brand = "Not Available" return brand
Similarly to the extraction of the product name, here we utilized an asynchronous function ‘get_product_brand’ to extarct the correspondi brand of the product from the web page. The function uses the ‘query_selector’ method select the element on the page that matches the specific CSS selector. This selector is used to identify the element that contains the brand of the corresponding products.
Next, the function uses the 'text_content' method of the selected element to extract the brand name from the page. To handle the errors that occurring during the extraction of the brand, the code uses a try-except block. If the brand of the product is successfully extracted, it is returned as a string and If the extraction is failed, the function returns a string "Not Available", which indicates that the brand of the corresponding product was not found on the page.
Similarly, we can extract other attributes such as the MRP, offer price, number of reviews, rating, size, color, material, compatible devices, connectivity technology, and connector type. We can apply the same technique that we used in previous steps to extract the other product attributes as well. For each attribute you want to extract, you would define a separate function that uses the ‘query_selector’ method to select the relevant element on the page, and then use the ‘text_content’ method or a similar method to extract the desired information and also need to modify the CSS selectors used in the functions based on the structure of the web page you are scraping.
Extraction of MRP of the Products
# Function to extract the MRP of the product async def get_original_price(page): # Try to extract the original price from the page try: original_price = await (await page.query_selector(".a-price.a-text-price")).text_content() original_price = original_price.split("₹") # If extraction fails, leave original price as "Not Available" except: original_price = "Not Available" return original_price
Also Read: How to Build an Amazon Price Tracker using Python
Extraction of Offer Price of the Products
# Function to extract the offer price of the product async def get_offer_price(page): # Try to extract the offer price from the page try: offer_price = await (await page.query_selector(".a-price-whole")).text_content() # If extraction fails, leave offer price as "Not Available" except: offer_price = "Not Available" return offer_price
Extraction of the Number of Reviews for the Products
# Function to extract the number of ratings of the product async def get_num_ratings(page): # Try to extract the number of ratings from the page try: ratings_text = await (await page.query_selector("#acrCustomerReviewText")).text_content() num_ratings = ratings_text.split(" ") # If extraction fails, leave number of ratings as "Not Available" except: num_ratings = "Not Available" return num_ratings Extraction of Ratings of the Products
Extraction of Ratings of the Products
# Function to extract the star rating of the product async def get_star_rating(page): # Try to extract the star rating from the page try: star_rating = await (await page.query_selector(".a-icon-alt")).text_content() star_rating = star_rating.split(" ") # If extraction fails, leave star rating as "Not Available" except: star_rating = "Not Available" return star_rating
Extraction of Product Size
# Function to extract the size of the product async def get_size(page): # Try to extract the size from the page try: size = await (await page.query_selector("tr.po-size td span.po-break-word")).text_content() # If extraction fails, leave size as "Not Available" except: size = "Not Available" return size
Extraction of colors of the products
# Function to extract the color of the product async def get_color(page): # Try to extract the color from the page try: color = await (await page.query_selector("tr.po-color td span.po-break-word")).text_content() # If extraction fails, leave color as "Not Available" except: color = "Not Available" return color
Extraction of Materials of the Products
# Function to extract the material of the product async def get_material(page): # Try to extract the material from the page try: material = await (await page.query_selector("tr.po-back.material td span.po-break-word")).text_content() # If extraction fails, leave material as "Not Available" except: material = "Not Available" return material
Extraction of Compatible Devices for the Products
#Function to extract the compatible devices of the product async def get_compatible_devices(page): # Try to extract the compatible devices from the page try: compatible_devices = await (await page.query_selector("tr.po-compatible_devices td span.po-break-word")).text_content() # If extraction fails, leave compatible devices as "Not Available" except: compatible_devices = "Not Available" return compatible_devices
Extraction of Connectivity Technology for the Products
# Function to extract the connectivity technology of the product async def get_connectivity_technology(page): # Try to extract the connectivity technology from the page try: connectivity_technology = await (await page.query_selector("tr.po-connectivity_technology td span.po-break-word")).text_content() # If extraction fails, leave connectivity technology as "Not Available" except: connectivity_technology = "Not Available" return connectivity_technology
Extraction of Connector Type for the Products
# Function to extract the connector type of the product async def get_connector_type(page): # Try to extract the connector type from the page try: connector_type = await (await page.query_selector("tr.po-connector_type td span.po-break-word")).text_content() # If extraction fails, leave connector_type as "Not Available" except: connector_type = "Not Available" return connector_type
Request Retry with Maximum Retry Limit
Request retry is a crucial aspect of web scraping as it helps to handle temporary network errors or unexpected responses from the website. The aim is to send the request again if it fails the first time to increase the chances of success.
Before navigating to the URL, the script implements a retry mechanism in case the request timed out. It does so by using a while loop that keeps trying to navigate to the URL until either the request succeeds or the maximum number of retries has been reached. If the maximum number of retries is reached, the script raises an exception. This code is a function that performs a request to a given link and retries the request if it fails. The function is useful when scraping web pages, as sometimes requests may time out or fail due to network issues.
# Function to perform a request and retry the request if it fails, with a maximum of 5 retries async def perform_request_with_retry(page, link): MAX_RETRIES = 5 retry_count = 0 while retry_count < MAX_RETRIES: try: # Make a request to the link await page.goto(link) # If the request is successful, break the loop break except: retry_count += 1 if retry_count == MAX_RETRIES: # Raise an exception if the maximum number of retries is reached raise Exception("Request timed out") # Sleep for a random duration between 1 and 5 seconds await asyncio.sleep(random.uniform(1, 5))
Here function performs a request to a specific link using the ‘goto’ method of the page object from the Playwright library.When a request fails, the function tries it again up to the allotted number of times. The maximum number of retries is defined by the MAX_RETRIES constant as 5 times. Between the each retry, the function uses the asyncio.sleep method to wait for the random duration from 1 to 5 seconds. This is done to prevent the code from retrying the request too quickly, which could cause the request to fail even more often.The perform_request_with_retry function takes two arguments: page and link. The page argument is the Playwright page object that is used to perform the request and the link argument is the URL to which the request is made.
Extraction and Product Data Saving
In the next step, we call the functions and save the data to an empty list.
# Main function to extract and save product data async def main(): # Start an async session with Playwright async with async_playwright() as pw: # Launch a new browser instance browser = await pw.chromium.launch() # Open a new page in the browser page = await browser.new_page() # Navigate to the Amazon deal page await perform_request_with_retry(page, 'https://www.amazon.in/gp/goldbox?deals-widget=%257B%2522version%2522%253A1%252C%2522viewIndex%2522%253A0%252C%2522presetId%2522%253A%252215C82F45284EDD496F94A2C368D1B4BD%2522%252C%2522sorting%2522%253A%2522BY_SCORE%2522%257D') # Get the links to each product product_links = await get_product_links(page) # Create an empty list to store the extracted data data =  # Iterate over the product links for link in product_links: # Load the product page await perform_request_with_retry(page, link) # Extract the product information # Product Name product_name = await get_product_name(page) # Brand brand = await get_brand(page) # Star Rating star_rating = await get_star_rating(page) # Number of Ratings num_ratings = await get_num_ratings(page) # Original Price original_price = await get_original_price(page) # Offer Price offer_price = await get_offer_price(page) # Color color = await get_color(page) # Size size = await get_size(page) # Material material = await get_material(page) # Connectivity Technology connectivity_technology = await get_connectivity_technology(page) # Connector Type connector_type = await get_connector_type(page) # Compatible Devices compatible_devices = await get_compatible_devices(page) # Add the extracted data to the list data.append((link, product_name, brand, star_rating, num_ratings, original_price, offer_price, color, size, material, connectivity_technology, connector_type, compatible_devices)) # Create a pandas dataframe from the extracted data df = pd.DataFrame(data, columns=['Product Link', 'Product Name', 'Brand', 'Star Rating', 'Number of Ratings', 'Original Price', 'Offer Price', 'Color', 'Size', 'Material', 'Connectivity_technology', 'Connector_type', 'Compatible_devices']) # Save the data to a CSV file df.to_csv('product_details5.csv', index=False) # Notify the user that the file has been saved print('CSV file has been written successfully.') # Close the browser instance await browser.close()
We use an asynchronous function, ‘main’, that scrapes product information from the Amazon Today's Deals page. The function is initiated by launching a new browser instance - chromium, in our case, using Playwright. This opens up a new page in the browser. We then navigate to the Amazon Today's Deals page using the perform_request_with_retry function. The function requests the link and retries the request if it fails, with a maximum of 5 retries (The number of retried can be changed). This ensures that the request to Amazon Today's Deals page is successful.
Once the Deals page is loaded, we extract the links to each product using the ‘get_product_links’ function defined in the script. Then the scraper iterates over each product link. Then we loads the product page using the ‘perform_request_with_retry function,’. This operation extracts all the information, then stores it as a tuple. The tuple is used to create a Pandas dataframe. The data frame is exported to a CSV file using the to_csv method of the Pandas dataframe.
Finally, we call the ‘main’ Function:
# Entry point to the script if __name__ == '__main__': asyncio.run(main())
The ‘asyncio.run(main())’ statement is used to run the main function as an asynchronous coroutine.
Also Read: Mastering XPath for Web Scraping: A Step-by-Step Tutorial
Scraping data from Amazon's Today's Deals section can be a useful technique to gather information about the products being offered at a discounted price. In this blog post, we explored how to use Playwright Python to scrape data from the musical instruments section of Today's Deals on Amazon. Following the steps outlined in this tutorial, you can easily adapt the code to scrape data from other sections of Amazon or other websites.
However, it is important to note that web scraping is a controversial practice, and it may be prohibited by the website you are scraping from. Always make sure to check the website's terms of service before attempting to scrape data from it, and respect any restrictions or limitations that they may have in place.
Overall, web scraping can be a powerful tool for gathering data and automating tasks, but it should be used ethically and responsibly. By following best practices and being respectful of website policies, you can use web scraping to your advantage and gain valuable insights from the data you collect.
Also Read: Is Web Data Scraping Legal?
Looking to acquire Amazon data for your brand? Contact Datahut, your web scraping experts.