top of page
  • Writer's pictureThasni M A

Scraping Amazon Product Category Without Getting Blocked


Scraping Amazon Product Category Without Getting Blocked

Web scraping is a powerful tool for extracting data from the internet, but it can be a daunting task to do it at scale without running into blocking issues. In this tutorial, we'll be sharing tips and tricks to help you scrape Amazon product categories without getting blocked.


To achieve this, we'll be using Playwright, an open-source Python library that enables developers to automate web interactions and extract data from web pages. With Playwright, you can easily navigate through web pages, interact with elements like forms and buttons, and extract data in a headless or visible browser environment. The best part is that Playwright is cross-browser compatible, which means you can test your web scraping scripts across different browsers, such as Chrome, Firefox, and Safari. Plus, Playwright provides robust error handling and retry mechanisms, making it easier to overcome common web scraping challenges like timeouts and network errors.


In this tutorial, we'll walk you through the steps to scrape air fryer data from Amazon using Playwright in Python and save it as a CSV file. By the end of this tutorial, you'll have a good understanding of how to scrape Amazon product categories without getting blocked and how to use Playwright to automate web interactions and extract data efficiently.


We will be extracting the following data attributes from the individual pages of Amazon.

  • Product URL - The URL of the resulting air fryer product.

  • Product Name - The name of the air fryer product.

  • Brand - The brand of the air fryer product.

  • MRP - MRP of the air fryer product.

  • Sale Price - Sale price of the air fryer product.

  • Number of Reviews - The number of reviews of the air fryer product.

  • Ratings - The ratings of the air fryer products.

  • Best Sellers Rank - The rank of the air fryer products which includes Home & Kitchen rank, Air Fryer's rank, and Fat Fryer's rank.

  • Technical Details - The technical details of air fryer products which include information such as wattage, capacity, color, etc.

  • About this item - The description of the air fryer products.

Here's a step-by-step guide for using Playwright in Python to scrape air fryer data from Amazon.


Also Read: How To Scrape Amazon Data Using Python Scrapy


Importing Required Libraries


To start our process, we will need to import a number of Required libraries that will enable us to interact with the website and extract the information we need.


# Import necessary libraries
import re
import random
import asyncio
import datetime
import pandas as pd
from playwright.async_api import async_playwright

Here we imported the various Python modules and libraries that are required for further operations.

  • ‘re’ - The ‘re’ module is used for working with regular expressions.

  • ‘random’ - The ‘random’ module is used for generating the random numbers and it is also useful for generating the test data or randomizing the order of tests.

  • ‘ asyncio’ - The ‘ asyncio’ module is used to handle asynchronous programming in Python, which is necessary when using the asynchronous API of Playwright.

  • ‘ datetime’ - The ‘ datetime’ module is used for working with the dates and times, which offers various functionalities like manipulating and creating date and time objects and formatting them into strings etc .

  • ‘ pandas’ - The ‘ pandas’ library is used for data manipulation and analysis. In this tutorial, it may be used to store and manipulate the data that is obtained from the web pages being tested.

  • ‘ async_playwright’ - The ‘ async_playwright’ module is used for automating web browsers using Playwright, an open-source Node.js library for automation testing and web scraping.


To automate browser testing using Playwright, this script incorporates multiple libraries, which are responsible for generating test data, managing asynchronous programming, manipulating and storing data, and automating browser interactions.


Extraction of Product URLs


The second step is extracting the resultant air fryer product URLs. Product URLs extraction is the process of collecting and organizing the URLs of products listed on a web page or online platform.


Before we start scraping product URLs, it is important to consider some points to ensure that we are doing it in a responsible and effective way:

  • Ensure that our scraped product URLs are in a standardized format; we can follow the format of "https://www.amazon.in/+product name+/dp/ASIN." This format includes the website's domain name, the product name (with no spaces), and the product's unique ASIN (Amazon Standard Identification Number) at the end of the URL. This standardized format makes it easier to organize and analyze the scraped data and also ensures that the URLs are consistent and easy to understand.

  • When scraping data for air fryers from Amazon, it is important to ensure that the scraped data only contains information about air fryers and not accessories that are often displayed alongside them in search results. To achieve this, it may be necessary to filter the data based on specific criteria, such as product category or keywords in the product title or description. By carefully filtering the scraped data, we can ensure that we only retrieve information about the air fryers themselves, which will make the data more useful and relevant for our purposes.

  • When scraping for product URLs, it may be necessary to navigate through multiple pages by clicking on the "Next" button at the bottom of the webpage to access all the results. However, there may be situations where clicking the "next" button will not load the next page, which can cause errors in our scraping process. To avoid this situation, we can implement error-handling mechanisms such as timeouts, retries, and checks to ensure that the next page is fully loaded before scraping its data. By taking these precautions, we can effectively and efficiently scrape all the resultant products from multiple pages while minimizing errors and respecting the website's resources.

By considering these points, we can ensure that we are scraping product URLs in a responsible and effective way while ensuring data quality.

async def get_product_urls(browser, page):
    # Select all elements with the product urls
    all_items = await page.query_selector_all('.a-link-normal.s-underline-text.s-underline-link-text.s-link-style.a-text-normal')
    product_urls = set()
    # Loop through each item and extract the href attribute
    for item in all_items:
        url = await item.get_attribute('href')
        # If the link contains '/ref' 
        if '/ref' in url:
            # Extract the base URL
            full_url = 'https://www.amazon.in' + url.split("/ref")[0]
        # If the link contains '/sspa/click?ie'
        elif '/sspa/click?ie' in url:
            # Extract the product ID and clean the URL
            product_id = url.split('%2Fref%')[0]
            clean_url = product_id.replace("%2Fdp%2F", "/dp/")
            urls = clean_url.split('url=%2F')[1]
            full_url = 'https://www.amazon.in/' + urls
        # If the link doesn't contain either '/sspa/click?ie' or '/ref'
        else:
            # Use the original URL
            full_url = 'https://www.amazon.in' + url

        if not any(substring in full_url for substring in ['Basket', 'Accessories', 'accessories', 'Disposable', 'Paper', 'Reusable', 'Steamer', 'Silicone', 'Liners', 'Vegetable-Preparation', 'Pan', 'parchment', 'Parchment', 'Cutter', 'Tray', 'Cheat-Sheet', 'Reference-Various', 'Cover', 'Crisper', 'Replacement']):
            product_urls.add(full_url)
            # Use add instead of append to prevent duplicates

    # Check if there is a next button
    next_button = await page.query_selector("a.s-pagination-item.s-pagination-next.s-pagination-button.s-pagination-separator")
    if next_button:
        # If there is a next button, click on it
        is_button_clickable = await next_button.is_enabled()
        if is_button_clickable:
            await next_button.click()
            # Wait for the next page to load
            await page.wait_for_selector('.a-link-normal.s-underline-text.s-underline-link-text.s-link-style.a-text-normal')
            # Recursively call the function to extract links from the next page
            product_urls.update(await get_product_urls(browser, page))  
        else:
            print("Next button is not clickable")  

    num_products = len(product_urls)
    print(f"Scraped {num_products} products.")

    return list(product_urls)

Here, we are using the Python function ‘get_product_urls’ to extract product links from a web page. The function uses the Playwright library to automate the browser testing and extract the resultant product URLs from an Amazon webpage.


The function then checks if there is a "next" button on the page. If there is, the function clicks on the button and recursively calls itself to extract URLs from the next page. The function continues doing this until all relevant product URLs have been extracted. Here the function first selects all elements on the webpage that contain product links using a CSS selector. It then initializes an empty set to store unique product URLs. Next, the function loops through each element, extracts the href attribute, cleans the link based on certain conditions, and removes unwanted substrings such as "Basket" and "Accessories".


After cleaning the link, the function checks if it contains any of the unwanted substrings. If not, it adds the cleaned URL to the set of product URLs. Finally, the function returns the list of unique product URLs as a list.


Also Read: 5 Major Challenges That Make Amazon Data Scraping Painful

Amazon Air Fryer Data Extraction


In this step, we will identify which attributes we want to extract from the website and extract the Product Name, Brand, Number of Reviews, Ratings, MRP, Sale Price, Best Sellers Rank, Technical Details, and the About the Amazon air fryer product.


Extracting Product Name

The next step is the extraction of the names of each product from the corresponding web pages. The names of each product are important because they give the customers a quick overview of what each product is, its features, and its intended use. The goal of this step is to select the elements on a web page that contain the product name, and extract the text content of those elements.

async def get_product_name(page):
    try:
        # Find the product title element and get its text content
        product_name_elem = await page.query_selector("#productTitle")
        product_name = await product_name_elem.text_content()
    except:
        # If an exception occurs, set the product name as "Not Available"
        product_name = "Not Available"

    # Remove any leading/trailing whitespace from the product name and return it
    return product_name.strip()

In order to extract the names of products from web pages, we utilize the asynchronous function 'get_product_name', which operates on a single page object. The function first locates the product's title element on the page by calling the 'query_selector()' method of the page object and passing in the appropriate CSS selector. Once the element is found, the function employs the 'text_content()' method to retrieve the text content of the element, which is then stored in the 'product_name' variable.


In cases where the function is unable to find or retrieve the product name of a particular item, it handles exceptions by setting the product name to "Not Available" in the 'product_name' variable. This approach ensures that our web scraping script can continue to run smoothly even if it encounters unexpected errors during the data extraction process.


Extracting Brand Name

When it comes to web scraping, extracting the name of the brand associated with a particular product is an important step in identifying the manufacturer or company that produces the product. The process of extracting brand names is similar to that of product names - we search for the relevant elements on the page using a CSS selector and then extract the text content from those elements.


However, there are a couple of different formats in which the brand information may appear on the page. For instance, the brand name might be preceded by the text "Brand: 'brand name'", or it might appear as "Visit the 'brand name' Store". In order to extract the name of the brand accurately, we need to filter out these extraneous elements and retrieve only the actual brand name.


To achieve this, we can use regular expressions or string manipulation functions in our web scraping script. By filtering out the unnecessary text and extracting only the brand name, we can ensure that our brand extraction process is both accurate and efficient.

async def get_brand_name(page):
    try:
        # Find the brand name element and get its text content
        brand_name_elem = await page.query_selector('#bylineInfo_feature_div .a-link-normal')
        brand_name = await brand_name_elem.text_content()

        # Remove any unwanted text from the brand name using regular expressions
        brand_name = re.sub(r'Visit|the|Store|Brand:', '', brand_name).strip()
    except:
        # If an exception occurs, set the brand name as "Not Available"
        brand_name = "Not Available"

    # Return the cleaned up brand name
    return brand_name

To extract the brand name from the web pages, we can use a similar function to the one we used for extracting the product name. In this case, the function is called 'get_brand_name', and it works by trying to locate the element that contains the brand name using a CSS selector.


If the element is found, the function extracts the text content of that element using the 'text_content()' method, and assigns it to a 'brand_name' variable. However, it's important to note that the extracted text may contain extraneous information such as "Visit", "the", "Store" and "Brand:" that needs to be removed using regular expressions.

By filtering out these unwanted words, we can obtain the actual brand name and ensure that our data is accurate. If the function encounters an exception during the process of finding the brand name element or extracting its text content, it will return the brand name as "Not Available".


By using this function in our web scraping script, we can extract the brand names of the products we are interested in and gain a better understanding of the manufacturers and companies behind these products.


Similarly, we can extract the other attributes such as MRP and Sale price. We can apply the same technique to extract these two attributes.


Extracting MRP of the Products

To accurately evaluate the value of a product, it is necessary to extract the Manufacturer's Retail Price (MRP) of the product from its corresponding web page. This information is valuable for both retailers and customers, as it enables them to make informed decisions about purchases. Extracting the MRP of a product involves a similar process to that of extracting the product name.

async def get_MRP(page):
    try:
        # Get MRP element and extract text content
        MRP_element = await page.query_selector(".a-price.a-text-price")
        MRP = await MRP_element.text_content()
        MRP = MRP.split("₹")[1]
    except:
        # Set MRP to "Not Available" if element not found or text content cannot be extracted
        MRP = "Not Available"
    return MRP


Extracting Sale Price of the Products

The sale price of a product is a crucial factor that can help customers make informed purchasing decisions. By extracting the sale price of a product from a webpage, customers can easily compare prices across different platforms and find the best deal available. This information is especially important for budget-conscious shoppers who want to ensure that they are getting the best value for their money.

async def get_sale_price(page):
    try:
        # Get sale price element and extract text content
        sale_price_element = await page.query_selector(".a-price-whole")
        sale_price = await sale_price_element.text_content()
    except:
        # Set sale price to "Not Available" if element not found or text content cannot be extracted
        sale_price = "Not Available"
    return sale_price

Extracting Product Ratings

The next step in our data extraction process is to obtain the star ratings for each product from their corresponding web pages. These ratings are given by customers on a scale of 1 to 5 stars and can provide valuable insights into the quality of the products. However, it is important to keep in mind that not all products will have ratings or reviews. In such cases, the website may indicate that the product is "New to Amazon" or has "No Reviews". This could be due to various reasons such as limited availability, low popularity or the product being new to the market and not yet reviewed by customers. Nonetheless, the extraction of star ratings is a crucial step in helping customers make informed purchasing decisions.

async def get_star_rating(page):
    try:
        # Find the star rating element and get its text content
        star_rating_elem = await page.wait_for_selector(".a-icon-alt")
        star_rating = await star_rating_elem.inner_text()
        star_rating = star_rating.split(" ")[0]
    except:
        try:
            # If the previous attempt failed, check if there are no reviews for the product
            star_ratings_elem = await page.query_selector("#averageCustomerReviews #acrNoReviewText")
            star_rating = await star_ratings_elem.inner_text()
        except:
            # If all attempts fail, set the star rating as "Not Available"
            star_rating = "Not Available"

    # Return the star rating
    return star_rating

To extract the star rating of a product from a web page, the function 'get_star_rating' is utilized. Initially, the function attempts to locate the star rating element on the page using a CSS selector that targets the element containing the star ratings. The 'page.wait_for_selector()' method is used for this purpose. If the element is successfully located, the function retrieves the inner text content of the element utilizing the 'star_rating_elem.inner_text()' method.


However, if an exception occurs during the process of locating the star rating element or extracting its text content, the function employs an alternate approach to check if there are no reviews for the product. To do this, it attempts to locate the element with the ID that contains the no reviews utilizing the 'page.query_selector()' method. If this element is successfully located, the text content of the element is assigned to the 'star_rating' variable.


If both of these attempts fail, the function enters the second exception block and sets the star rating as "Not Available" without attempting to extract any rating information. This ensures that the user is notified of the unavailability of the star rating for the product in question.


Extracting the Number of Reviews for the Products

Extracting the number of reviews of each product is a crucial step in analyzing the popularity and customer satisfaction of the products. The number of reviews represents the total number of feedback or ratings provided by the customers for a particular product. This information can help customers make informed purchasing decisions and understand the level of satisfaction or dissatisfaction of previous buyers.


However, it's important to keep in mind that not all products may have reviews. In such cases, the website may indicate "No Reviews" or "New to Amazon" instead of the number of reviews on the product page. This could be because the product is new to the market or has not yet been reviewed by customers, or it may be due to other reasons such as low popularity or limited availability.

async def get_num_reviews(page):
    try:
        # Find the number of reviews element and get its text content
        num_reviews_elem = await page.query_selector("#acrCustomerReviewLink #acrCustomerReviewText")
        num_reviews = await num_ratings_elem.inner_text()
        num_reviews = num_ratings.split(" ")[0]
    except:
        try:
            # If the previous attempt failed, check if there are no reviews for the product
            no_review_elem = await page.query_selector("#averageCustomerReviews #acrNoReviewText")
            num_reviews = await no_review_elem.inner_text()
        except:
            # If all attempts fail, set the number of reviews as "Not Available"
            num_reviews = "Not Available"

    # Return the number of reviews
    return num_reviews

The function 'get_num_reviews' plays an important role in extracting the number of reviews for products from web pages. First, the function looks for an element that contains the review count using a CSS selector that targets the element with an ID containing this information. If the function successfully locates this element, it extracts the text content using the 'inner_text' method and stores it in a variable called 'num_reviews'. However, if the initial attempt fails, the function will try to locate an element that indicates there are no reviews for the product.


If this element is found, the function extracts the text content using the 'inner_text()' method and assigns it to the 'num_reviews' variable. In cases where both attempts fail, the function will return "Not Available" as the value of 'num_reviews' to indicate that the review count was not found on the web page.


It's important to note that not all products may have reviews, which could be due to various reasons such as newness to the market, low popularity, or limited availability. Nonetheless, the review count is a valuable piece of information that can provide insights into a product's popularity and customer satisfaction.


Extracting Best Sellers Rank of the products

Extracting the Best Sellers Rank is a crucial step in analyzing the popularity and sales of products on online marketplaces such as Amazon. The Best Sellers Rank is a metric that Amazon uses to rank the popularity of products within their category. This metric is updated hourly and takes into account several factors, including recent sales of the product, customer reviews, and ratings. The rank is displayed as a number, with lower numbers indicating higher popularity and higher sales volume.


For example, when extracting the Best Sellers Rank for air fryer products, we can obtain two values: the Home & Kitchen rank and the Air Fryers rank (or Fat Fryers rank) based on the category in which the product falls. By extracting the Best Sellers Rank, we can gain valuable insights into the performance of the products in the market. This information can help customers choose products that are popular and well-reviewed, allowing them to make informed purchasing decisions.

async def get_best_sellers_rank(page):
    try:
        # Try to get the Best Sellers Rank element
        best_sellers_rank = await (await page.query_selector("tr th:has-text('Best Sellers Rank') + td")).text_content()

        # Split the rank string into individual ranks
        ranks = best_sellers_rank.split("#")[1:]

        # Initialize the home & kitchen and air fryers rank variables
        home_kitchen_rank = ""
        air_fryers_rank = ""

        # Loop through each rank and assign the corresponding rank to the appropriate variable
        for rank in ranks:
            if "in Home & Kitchen" in rank:
                home_kitchen_rank = rank.split(" ")[0].replace(",", "")
            elif "in Air Fryers" or "in Deep Fat Fryers" in rank:
                air_fryers_rank = rank.split(" ")[0].replace(",", "")
    except:
        # If the Best Sellers Rank element is not found, assign "Not Available" to both variables
        home_kitchen_rank = "Not Available"
        air_fryers_rank = "Not Available"

    # Return the home & kitchen and air fryers rank values
    return home_kitchen_rank, air_fryers_rank

The function get_best_sellers_rank plays a crucial role in extracting Best Sellers Rank information from web pages. To begin, the function attempts to locate the Best Sellers Rank element on the page using a specific CSS selector that targets the td element following a th element containing the text "Best Sellers Rank". If the element is successfully located, the function extracts its text content using the text_content() method and assigns it to the best_sellers_rank variable.


Next, the code loops through each individual rank and assigns the corresponding rank to the appropriate variable. This ensures that if the rank contains the string "in Home & Kitchen", it is assigned to the home_kitchen_rank variable. Similarly, if the rank contains the string "in Air Fryers" or "in Deep Fat Fryers", it is assigned to the air_fryers_rank variable. These variables are important as they provide valuable insights into the product's popularity in the specific category.


However, if the Best Sellers Rank element is not found on the page, the function assigns the value "Not Available" to both the home_kitchen_rank and air_fryers_rank variables, indicating that the rank information could not be extracted from the page.


Also Read: Scraping Amazon Best Seller Data using Python: A Step-by-Step Guide


Extracting Technical Details of the products

When browsing through online marketplaces such as Amazon, customers often rely on the technical details provided in product listings to make informed purchasing decisions. These details can offer valuable insights into a product's features, performance, and compatibility. Technical details can vary from product to product but often include information such as dimensions, weight, material, power output, and operating system.


The process of extracting technical details from product listings can be a crucial factor for customers who are looking for specific features or are comparing products. By analyzing and comparing these details, customers can evaluate different products based on their specific needs and preferences, ultimately helping them make the best purchasing decision.

async def get_technical_details(page):
    try:
        # Get table containing technical details and its rows
        table_element = await page.query_selector("#productDetails_techSpec_section_1")
        rows = await table_element.query_selector_all("tr")

        # Initialize dictionary to store technical details
        technical_details = {}

        # Iterate over rows and extract key-value pairs
        for row in rows:
            # Get key and value elements for each row
            key_element = await row.query_selector("th")
            value_element = await row.query_selector("td")

            # Extract text content of key and value elements
            key = await page.evaluate('(element) => element.textContent', key_element)
            value = await page.evaluate('(element) => element.textContent', value_element)

            # Strip whitespace and unwanted characters from value and add key-value pair to dictionary
            value = value.strip().replace('\u200e', '')
            technical_details[key.strip()] = value

        # Extract required technical details (colour, capacity, wattage, country of origin)
        colour = technical_details.get('Colour', 'Not Available')
        if colour == 'Not Available':
            # Get the colour element from the page and extract its inner text
            colour_element = await page.query_selector('.po-color .a-span9')
            if colour_element:
                colour = await colour_element.inner_text()
                colour = colour.strip()

        capacity = technical_details.get('Capacity', 'Not Available')
        if capacity == 'Not Available' or capacity == 'default':
            # Get the capacity element from the page and extract its inner text
            capacity_element = await page.query_selector('.po-capacity .a-span9')
            if capacity_element:
                capacity = await capacity_element.inner_text()
                capacity = capacity.strip()

        wattage = technical_details.get('Wattage', 'Not Available')
        if wattage == 'Not Available' or wattage == 'default':
            # Get the wattage element from the page and extract its inner text
            wattage_elem = await page.query_selector('.po-wattage .a-span9')
            if wattage_elem:
                wattage = await wattage_elem.inner_text()
                wattage = wattage.strip()

        country_of_origin = technical_details.get('Country of Origin', 'Not Available')

        # Return technical details and required fields
        return technical_details, colour, capacity, wattage, country_of_origin

    except:
        # Set technical details to default values if table element or any required element is not found or text content cannot be extracted
        return {}, 'Not Available', 'Not Available', 'Not Available', 'Not Available'


The 'get_technical_details' function plays a crucial role in extracting technical details from web pages to help customers make informed purchasing decisions. The function accepts a webpage object and returns a dictionary of technical details found on the page. The function first tries to locate the technical details table using its ID and extracts each row in the table as a list of elements. It then iterates over each row and extracts key-value pairs for each technical detail.


The function also attempts to extract specific technical details such as color, capacity, wattage, and country of origin using their respective keys. If the value for any of these technical details is "Not Available" or "default", the function attempts to locate the corresponding element on the web page and extract its inner text. If the element is found and its inner text is extracted successfully, the function returns the specific value. In case the function could not extract any of these values, it returns "Not Available" as the default value.


Extracting information about the products

Extracting the "About this item" section from product web pages is an essential step in providing a brief overview of the product's main features, benefits, and specifications. This information helps potential buyers understand what the product is, what it does, and how it differs from similar products on the market. It can also assist buyers in comparing different products and evaluating whether a particular product meets their specific needs and preferences. Obtaining this information from the product listing is crucial for making informed purchasing decisions and ensuring customer satisfaction.

async def get_bullet_points(page):
    bullet_points = []
    try:
        # Try to get the unordered list element containing the bullet points
        ul_element = await page.query_selector('#feature-bullets ul.a-vertical')

        # Get all the list item elements under the unordered list element
        li_elements = await ul_element.query_selector_all('li')

        # Loop through each list item element and append the inner text to the bullet points list
        for li in li_elements:
            bullet_points.append(await li.inner_text())
    except:
        # If the unordered list element or list item elements are not found, assign an empty list to bullet points
        bullet_points = []

    # Return the list of bullet points
    return bullet_points

The function 'get_bullet_points' extracts bullet point information from the web page. It starts by trying to locate an unordered list element that contains bullet points using a CSS selector that targets the 'About this item' element with the ID. If the unordered list About this item element is found, the function gets all the list item elements under it using the 'query_selector_all()' method. The function then loops through each list item element and appends its inner text to the bullet points list. If an exception occurs during the process of finding the unordered list element or the list item elements, the function sets the bullet points as an empty list. Finally, the function returns the list of bullet points.


Request Retry with Maximum Retry Limit

Request retry is a crucial aspect of web scraping as it helps to handle temporary network errors or unexpected responses from the website. The aim is to send the request again if it fails the first time to increase the chances of success.


Before navigating to the URL, the script implements a retry mechanism in case the request timed out. It does so by using a while loop that keeps trying to navigate to the URL until either the request succeeds or the maximum number of retries has been reached. If the maximum number of retries is reached, the script raises an exception. This code is a function that performs a request to a given link and retries the request if it fails. The function is useful when scraping web pages, as sometimes requests may time out or fail due to network issues.


async def perform_request_with_retry(page, url):
    # set maximum retries
    MAX_RETRIES = 5
    # initialize retry counter
    retry_count = 0

    # loop until maximum retries are reached
    while retry_count < MAX_RETRIES:
        try:
            # try to make request to the URL using the page object and a timeout of 30 seconds
            await page.goto(url, timeout=80000)
            # break out of the loop if the request was successful
            break
        except:
            # if an exception occurs, increment the retry counter
            retry_count += 1
            # if maximum retries have been reached, raise an exception
            if retry_count == MAX_RETRIES:
                raise Exception("Request timed out")
            # wait for a random amount of time between 1 and 5 seconds before retrying
            await asyncio.sleep(random.uniform(1, 5))

The function 'perform_request_with_retry' is an asynchronous function used to make a request to a given URL using a page object. Within the loop, the function attempts to make a request to the URL using the 'page.goto()' method with a timeout of 30 seconds. If the request is successful, the loop is broken, and the function exits. If an exception occurs during the request, such as a timeout or network error, the function tries it again up to the allotted number of times. The MAX_RETRIES constant defines the maximum number of retries as 5 times. If the maximum number of retries has been reached, the function raises an exception with the message "Request timed out". If the maximum number of retries has not been reached, the function waits for a random amount of time, between 1 and 5 seconds, using the asyncio.sleep() method before retrying the request.


Extracting and Saving the Product Data

In the next step, we call the functions and save the data to an empty list.


async def main():
    # Launch a Firefox browser using Playwright
    async with async_playwright() as pw:
        browser = await pw.firefox.launch()
        page = await browser.new_page()

        # Make a request to the Amazon search page and extract the product URLs
        await perform_request_with_retry(page, 'https://www.amazon.in/s?k=airfry&i=kitchen&crid=ADZU989EVDIH&sprefix=airfr%2Ckitchen%2C4752&ref=nb_sb_ss_ts-doa-p_3_5')
        product_urls = await get_product_urls(browser, page)
        data = []

        # Loop through each product URL and scrape the necessary information
        for i, url in enumerate(product_urls):
            await perform_request_with_retry(page, url)

            product_name = await get_product_name(page)
            brand = await get_brand_name(page)
            star_rating = await get_star_rating(page)
            num_reviews = await get_num_reviews(page)
            MRP = await get_MRP(page)
            sale_price = await get_sale_price(page)
            home_kitchen_rank, air_fryers_rank = await get_best_sellers_rank(page)
            technical_details, colour, capacity, wattage, country_of_origin = await get_technical_details(page)
            bullet_points = await get_bullet_points(page)

            # Print progress message after processing every 10 product URLs
            if i % 10 == 0 and i > 0:
                print(f"Processed {i} links.")

            # Print completion message after all product URLs have been processed
            if i == len(product_urls) - 1:
                print(f"All information for url {i} has been scraped.")

            # Add the corresponding date
            today = datetime.datetime.now().strftime("%Y-%m-%d")
            # Add the scraped information to a list
            data.append(( today, url, product_name, brand, star_rating, num_reviews, MRP, sale_price, colour, capacity, wattage, country_of_origin, home_kitchen_rank, air_fryers_rank, technical_details, bullet_points))

        # Convert the list of tuples to a Pandas DataFrame and save it to a CSV file
        df = pd.DataFrame(data, columns=['date', 'product_url', 'product_name', 'brand', 'star_rating', 'number_of_reviews', 'MRP', 'sale_price', 'colour', 'capacity', 'wattage', 'country_of_origin', 'home_kitchen_rank', 'air_fryers_rank', 'technical_details', 'description'])
        df.to_csv('product_data.csv', index=False)
        print('CSV file has been written successfully.')

        # Close the browser
        await browser.close()


if __name__ == '__main__':
    asyncio.run(main())

In this Python script, we have utilized an asynchronous function called "main" to extract product information from Amazon pages. The script employs the Playwright library to launch the Firefox browser and navigate to the Amazon page. Subsequently, the "extract_product_urls" function is utilized to extract the URLs of each product from the page and store them in a list called "product_url". The function then loops through each product URL and uses the "perform_request_with_retry" function to load the product page and extract various information such as the product name, brand, star rating, number of reviews, MRP, sale price, best sellers rank, technical details, and descriptions.


The resulting data is then stored as a tuple in a list called "data". The function also provides progress messages after processing every 10 product URLs and a completion message after all the product URLs have been processed. The data is then converted to a Pandas DataFrame and saved as a CSV file using the "to_csv" method. Finally, the browser is closed using the "browser.close()" statement. The script is executed by calling the "main" function using the "asyncio.run(main())" statement, which runs the "main" function as an asynchronous coroutine.


Conclusion

In this guide, we walked you through the step-by-step process of scraping Amazon Air Fryer data using Playwright Python. We covered everything from setting up the Playwright environment and launching a browser to navigating to the Amazon search page and extracting essential information like product name, brand, star rating, MRP, sale price, best seller rank, technical details, and bullet points.


Our instructions are easy to follow and include extracting product URLs, looping through each URL, and using Pandas to store the extracted data in a dataframe. With Playwright's cross-browser compatibility and robust error handling, users can automate the web scraping process and extract valuable data from Amazon listings.


Web scraping can be a time-consuming and tedious task, but with Playwright Python, users can automate the process and save time and effort. By following our guide, users can quickly get started with Playwright Python and extract valuable data from Amazon Air Fryer listings. This information can be used to make informed purchasing decisions or conduct market research, making Playwright Python a valuable tool for anyone looking to gain insights into the world of e-commerce.


At Datahut, we specialize in helping our clients make informed business decisions by providing them with valuable data. Our team of experts can help you acquire the data you need, whether it's for market research, competitor analysis, lead generation, or any other business use case. We work closely with our clients to understand their specific data requirements and deliver high-quality, accurate data that meets their needs.

If you're looking to acquire data for your business, we're here to help. Contact us today to discuss your data needs and learn how we can help you make data-driven decisions that lead to business success.


271 views0 comments

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page