Scraping Amazon Reviews with Playwright and Python
As an Amazon seller or researcher, you may want to know what customers are saying about a particular product, but manually sifting through hundreds or even thousands of reviews can be time-consuming and inefficient. Fortunately, web scraping offers a solution to this problem.
This tutorial will help you learn how to use Playwright and Python for scraping Amazon reviews and gain insights into customer sentiment and opinions about products. We'll start by setting up our environment and installing the necessary software and libraries, including Playwright. Then, we'll dive into the scraping process, where we'll extract reviews from an Amazon product page using Playwright's powerful automation features.
So before we get started with scraping amazon reviews, let's take a quick look at Playwright - a nifty web automation library that can make web scraping a whole lot easier.
Scraping Amazon Reviews with Playwright and Python
If you're already familiar with web scraping tools like BeautifulSoup and Selenium, you'll find it easy to learn Playwright.
Playwright is a Python library that was designed specifically for automating web browsers, with features such as built-in support for multiple browsers like Chromium, Firefox, and WebKit with a single powerful API for automating web page interactions. It is also designed to work efficiently in headless mode and can handle common challenges in web scraping, like dynamic websites, with ease.
How Playwright uses async and await for scraping?
One of the key features of Playwright is its support for asynchronous programming using the keywords async and await.
Asynchronous programming allows for faster execution of code by allowing multiple tasks to run concurrently, thereby increasing the efficiency of the web scraping process. In contrast, synchronous programming executes one task at a time, so if one task takes a long time to complete, it can delay the entire program.
In asynchronous programming, however, there can be a problem with task dependencies. For instance, certain codes may require that the previous code is executed first to avoid errors. For example, if we want to register for a service, we can't click the register button before entering the required details, like the user ID. This is where async and await comes in handy. We can use await before any script that we want to wait for execution before continuing with the rest of the program, and await is usually used before statements to ensure smoother task flow. Async is typically used before functions, and it allows for the creation of non-blocking code that can run more efficiently and without unnecessary delays.
Playwright in Jupyter Notebook
For Jupyter Notebook users, it's essential to be familiar with the async API of Playwright. Although not natively designed for Jupyter notebooks, Playwright can still be utilized thanks to its support for async programming.
If you haven't installed Playwright already, you can do so by running the following code in your terminal:
pip install playwright
Great! Now that you have Playwright installed and a better idea about Playwright, we can get started with scraping. Let's dive into the code and see how we can use Playwright and Python to extract reviews from Amazon product pages.
Scraping with Playwright
Before we dive into the code, let's take a moment to review the information we'll be scraping from Amazon product reviews. We'll be extracting five pieces of information for each review:
Review title: A brief headline for the customer's review of the product
Review body: The main body of detailed feedback on the product
Product color: The color of the product being reviewed
Review date: The date on which the review was posted by the customer
Rating: Numerical score (1-5 stars) given to the reviewed product
These details can provide valuable insights into customer feedback and help inform purchasing decisions. Now that we know what we're looking for let's use Playwright and Python to extract this information from the Amazon website.
Web scraping with Playwright requires the use of certain libraries that facilitate the scraping process. Let's take a closer look at these libraries.
# importing necessary libraries import random import asyncio import pandas as pd from datetime import datetime from playwright.async_api import async_playwright
Random: A built-in Python library used for generating pseudo-random numbers. Here it's used to add a random delay between retries when making web requests.
asyncio: A standard library for writing asynchronous code in Python. It's used to create and manage coroutines in the scraping process here. A coroutine is a type of function that can be paused and resumed, allowing other code to run in the meantime.
Pandas: A popular third-party library for data manipulation and analysis in Python. Here It's used to create a DataFrame to store the extracted review data.
DateTime: A built-in Python library for working with dates and times. In here it's used to parse and format the review date.
async_playwright: A Python library that provides a high-level API for controlling a web browser and automating web scraping tasks.
Functions for Scraping
As a best practice, it's always recommended to organize code into functions, as it can make the code more modular, reusable, and easier to maintain as changes made to one function will not affect the others. Here we are dividing the web scraping process into separate functions that will make it easier to handle various tasks such as requesting web pages, extracting data, and saving the results.
Function to Extract Review Title
# Extract the title of a review from a review element async def extract_review_title(review_element): try: title = await review_element.evaluate("(element) => element.querySelector('[data-hook=\"review-title\"]').innerText") title = title.replace("\n", "") title = title.strip() except: title = "not available" returntitle
The function ‘extract_review_title’ extracts the title of a review from a review element and returns it as a string. It then removes any newline characters and leading whitespace from the title string and returns the cleaned title.
Once we have extracted the title of the review using the extract_review_title function, we can define similar functions to extract other information from the review element, such as the review body, the review date, the rating, and the reviewed product colour.
Function to Extract Review Body
# Extract the body of a review from a review element async def extract_review_body(review_element): try: body = await review_element.evaluate("(element) => element.querySelector('[data-hook=\"review-body\"]').innerText") body = body.replace("\n", "") body = body.strip() except: body = "not available" return body
As mentioned before, the function ‘extract_review_body’ extracts the body of a review from a review element, similar to the extraction of the review title.
Function to Extract Product Colour
# Extract the colour of the product reviewed from a review element async def extract_product_colour(review_element): try: colour = await review_element.evaluate("(element) => element.querySelector('[data-hook=\"format-strip\"]').innerText") colour = colour.replace("Colour: ", "") except: colour = "not available" return colour
This function ‘extract_product_colour’ extracts and returns the colour of the product that was being reviewed or returns "not available" if the colour cannot be found. The replace method is used to remove the "Colour: " prefix from the text, leaving just the actual colour name.
Function to Extract Review Date
# Extract the date of a review from a review element async def extract_review_date(review_element): try: date = await review_element.evaluate("(element) => element.querySelector('[data-hook=\"review-date\"]').innerText") date = date.split()[-3:] date = " ".join(date) date = datetime.strptime(date, '%d %B %Y') date = date.strftime('%d %B %Y') except: date = "not available" return date
The function ‘extract_review_date’ extracts the date of a review from a review element i.e., the date when the customer wrote the review. It then cleans the extracted date by converting it to a datetime object and reformatting it to a desired date string format.
Function to Extract Review Date
# Extract the rating of a review from a review element async def extract_rating(review_element): try: ratings = await review_element.evaluate("(element) => element.querySelector('[data-hook=\"review-star-rating\"]').innerText") except: ratings="not available" return ratings.split()
The function ‘extract_rating’ extracts the rating of a review from a review element and then returns the numerical value of the rating (e.g. "5" for a 5-star rating). The text of the rating element includes additional information beyond just the rating value, so the split method is used to extract only the numerical rating value (e.g., "4.5") from the element's inner text.
Function for Performing Web Requests with Retry
# Perform a request and retries the request if it fails async def perform_request_with_retry(page, link): MAX_RETRIES = 5 retry_count = 0 while retry_count < MAX_RETRIES: try: await page.goto(link) break except: retry_count += 1 if retry_count == MAX_RETRIES: raise Exception("Request timed out") await asyncio.sleep(random.uniform(1, 5))
The perform_request_with_retry function is an asynchronous function that makes a web request using Playwright's page.goto() method. If the request fails, the function retries the request up to 5 times with a random delay between 1 to 5 seconds. If all retries fail, the function raises an exception indicating that the request has timed out. The asyncio.sleep() function is used to introduce the delay between retries, and the random.uniform() function is used to generate a random delay between the specified range.
Function for Extracting all Reviews from Multiple Pages
# Extract all reviews from multiple pages of the URL async def extract_reviews(page): reviews = while True: # Wait for the reviews to be loaded await page.wait_for_selector("[data-hook='review']") # Get the reviews review_elements = await page.query_selector_all("[data-hook='review']") for review_element in review_elements: review_title = await extract_review_title(review_element) review_body = await extract_review_body(review_element) product_colour = await extract_product_colour(review_element) review_date = await extract_review_date(review_element) rating = await extract_rating(review_element) reviews.append((product_colour,review_title,review_body,review_date,rating)) # Find the next page button next_page_button = await page.query_selector("[class='a-last']") if not next_page_button: break # Click the next page button await page.click("[class='a-last']") return reviews
This function extracts all reviews from multiple pages of the URL. It first waits for the reviews to be loaded and then extracts information such as the review title, review body, product colour, review date, and rating for each review element on the page. This is done by calling the previously defined functions ‘extract_review_title’, ‘extract_review_body’, ‘extract_product_colour’, ‘extract_review_date’, and ‘extract_rating’. The extracted data is then appended to a list of reviews.
The function then looks for the next page button and clicks it to navigate to the next page of reviews, and the process is repeated until there are no more reviews to extract. Finally, the function returns a list of tuples containing the extracted data for each review. Overall, this function integrates the previously defined functions to extract all the relevant information from multiple pages of the Amazon product reviews.
Function to Save Extracted Reviews to a CSV File
# Save the extracted reviews to a csv file async def save_reviews_to_csv(reviews): data = pd.DataFrame(reviews, columns=['product_colour','review_title','review_body','review_date','rating']) data.to_csv('amazon_product_reviews15.csv', index=False)
The function ‘save_reviews_to_csv’ takes in a list of reviews as input and saves them to a CSV file named 'amazon_product_reviews15.csv' with columns 'product_colour', 'review_title', 'review_body', 'review_date', and 'rating' using the pandas library.
Asynchronous Web Scraping of Amazon Product Reviews using Playwright
# Asynchronous Web Scraping of Amazon Product Reviews using Playwright async def main(): p = await async_playwright().start() browser = await p.chromium.launch() page = await browser.new_page() url="https://www.amazon.in/boAt-Airdopes-191G-Wireless-Appealing/product-reviews/B09X76VL5L/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews" await perform_request_with_retry(page, url) review = await extract_reviews(page) await save_reviews_to_csv(review) await browser.close() # Execute the scraping and saving of Amazon product reviews await main()
The ‘main’ function is the primary function in this web scraping procedure, which orchestrates the whole process.
This function starts an instance of the Playwright library, launches a headless Chromium browser, and creates a new page to visit the URL of the product reviews. Here the headless browser means that the browser runs without a user interface. This makes the scraping process more efficient and faster as there is no need to render or display the page. Chromium is a popular browser that is used for web scraping due to its speed and efficient memory usage.
And now, the ‘perform_request_with_retry’ function is called to ensure that the request is successful. It ensures that the script retries the request in case of any network errors. Once the request is successful, the ‘extract_reviews’ function is called to extract all the reviews of the product, and the ‘save_reviews_to_csv’ function is called to save the reviews to a CSV file.
Finally, the script closes the browser and completes the asynchronous web scraping process. The ‘main’ function is called at the end of the script to start the web scraping process and extract the reviews from the Amazon product review page.
In conclusion, Playwright has proven to be a fast and efficient tool for web scraping of Amazon product reviews, making it a viable alternative to other popular scraping tools like BeautifulSoup and Selenium. Its asynchronous and headless nature makes it easy to handle multiple requests simultaneously and scrape data quickly.
If you're interested in web scraping or data extraction, Playwright can be an excellent tool to learn and experiment with. It provides a rich set of APIs, robustness, and an excellent developer experience. So, don't hesitate to dive in and explore the vast possibilities with Playwright!
However, if you're looking for more complex scraping requirements or need to scale up your scraping project, it may be beneficial to consider outsourcing your web scraping needs to a professional data extraction service provider like Datahut. With years of experience and expertise in web scraping, Datahut can help you with a range of scraping needs, including product reviews, competitor analysis, pricing data, and more. So, if you're looking for a reliable and efficient web scraping service, contact Datahut today!