top of page
  • Writer's pictureJerom Jo Manthara

How to scrape review data from Amazon and do sentiment analysis




How to scrape review data from Amazon and do sentiment analysis

Most online marketplaces easily overwhelm their customers with diverse choices. While a greater diversity and range of products can expand and satisfy user choice, it can also lead to paralysis by choice, making it challenging to decide on a product.


Marketplaces are filled with premium products offered by high-end brands. While their quality is assured, they might not be what a customer wants. Products of certain brands may lack specific properties, which is what the customer needs.


Thus, customers turn to look at reviews. The customers who've already bought the product can leave a note on what it provides, its best properties, and some of the drawbacks that it may have. These notes are critical and help new customers to check whether the product offers what they want.


In this blog, we look into how to extract the reviews of various Bluetooth headphones offered by various premium brands on Amazon and find what sentiment each review represents. 


Here, we will see how we have made a scraper to extract reviews of five products, each belonging to a separate premium brand. Using sentiment analysis, we will then see how each extracted review is classified as positive, negative, or neutral.


A Dive into Sentiment Analysis


Now that we have examined our aim and what we achieve using our program, let's examine the concept of sentiment analysis before looking at the code.


To find the overall customer preference and product satisfaction, we use the tool of sentiment analysis.


Sentiment analysis is a branch of Artificial Intelligence (AI) known as Natural Language Processing (NLP). It utilizes a combination of techniques to understand the sentiment behind the text. It is a computational technique that involves analyzing text data to determine its sentiment. By leveraging natural language processing (NLP) algorithms, sentiment analysis extracts insights from various contents. Here, we conduct sentiment analysis on the various reviews that we have scraped. The contents are then classified as positive, negative, or neutral.


By categorizing sentiments as positive, negative, or neutral, we gain much more clarity and simplicity.


In this blog, we use sentiment analysis to see how we can check to see if customers are satisfied with the premium Bluetooth headphones offered by high-end brands by using sentiment analysis on reviews that talk about certain properties.


Extracting data using Web Scraping


Earlier, we saw how and why sentiment analysis is done on the reviews extracted from Amazon. So, how are these reviews extracted?


This is where web scraping, the art of extracting information from websites, comes into play.


Web scraping is an automatic method of obtaining large amounts of data from websites. Most of this data is unstructured HTML data, which is then converted into structured data in a spreadsheet or a database for use in various applications. There are many different ways to perform web scraping.


With the introduction of dynamic content in websites, web scraping has also evolved to handle it. This brought about headless browsers and similar advanced concepts. 


One such concept is using headless browsers to scrape data, which has been utilized in our scraper. A headless web browser without a graphical user interface (GUI). This means that it runs without a visible window or tabs. Headless browsers are often used for automated tasks such as web scraping, automated testing, and other interactions with websites where a visible browser is not necessary.


Automation frameworks are often integrated with headless browsers to provide a more robust and efficient way to automate web testing. Headless browsers allow automation frameworks to run tests without opening a visible browser window, which can improve performance and reduce resource usage.


One such automation framework is PlaywrightPlaywright. Playwright is a relatively new automation framework that supports multiple browsers, including Chrome Headless, Firefox Headless, and Edge Headless. It is known for its speed and reliability and is often used for automated testing and web scraping. The use of Playwright for our program will be explained later in the blog, along with some more advanced scraping concepts.


Dissecting the code


Now that we've discussed our script's two critical components let's dive into the code that brings them to life.

Let's begin with the scraper and then move on to the sentiment analyzer.


Scraper


Our scraper finds the review listing page from the product description page of the given product, extracts reviews about specific properties, and filters the reviews accordingly.

This means that the only attribute extracted here is the body of each filtered review.


Importing Libraries


We use some libraries to help us extract and store the data.

Our scraper uses only two libraries: hrequests and pandas.


Pandas: For easy data storage


Pandas offers many easy-to-use data structures, such as DataFrames and Series. The DataFrame is a tabular data structure similar to a spreadsheet. It stores and organizes data in rows and columns. A Pandas Series is like a single column in a spreadsheet, capable of holding one-dimensional data of any type: numbers, text, dates, or even more complex objects. 


In our scraper, the DataFrame structure stores the extracted data from the web page. The Series structure is used to add data to the DataFrame. We then use Pandas's .to_csv() method to export the DataFrame to a CSV file. A CSV file is a text file that stores tabular data in rows and columns, separated by commas.


Hrequests: Web scraping for humans


Hrequests, short for human requests, is a recently developed library that solves most, if not all, of the problems faced while performing web scraping on websites available today.


Hrequests is a high-performance library that implements simple and uncomplicated browser automation. It implements browser automation using the Playwright library, which is the earlier mentioned use of Playwright for our scraper. Hrequests supports Chrome and Firefox extension support. The browser automation performed can either be headless or headful.

 

As explained earlier, a headless browser lacks any GUI and can only be controlled automatically by software. In contrast, a headful browser will have all the features of a headless browser and then the added UI, allowing someone to control the browser manually.


To prevent web scraping and automation tools in general, websites like Amazon add captcha problems and similar fail-safes. Hrequests accesses websites in a way that mimics human cursor movement and typing. It also generates realistic browser headers and prevents websites from recognizing it as an automation tool.


Browser headers are mini messages exchanged between our browser and a website during every website visit. They act as little information packets that tell the server various details about our request, and the server responds with its own set of headers containing relevant information for our browser.


Global constants


After importing the needed libraries, we initialize some global constants in our script. 

The following constants are defined at the beginning :

  • BASE_URL: The home URL of the site we are scraping is Amazon in our case. 

  • PDP_LIST: This list contains five strings, each representing the URL of a different product detail page (PDP) on Amazon. Each string represents the PDP URL of a Bluetooth headphone belonging to a different premium brand.

  • BRAND_LIST: This list corresponds to the brands of the products in the PDP_LIST. 

  • PROPERTIES_LIST: This list defines four categories of product properties that will be analyzed for sound quality, battery, noise cancellation, and comfort.

  • MAX_REVIEW_COUNT_PER_PPTY: This integer limits the number of reviews scraped for each product property.


Main workflow


Let us examine the scraper's main workflow, which is the function that forms its structure.The program loops through each PDP present in the PDP_LIST and extracts data during each iteration.


In each iteration, a new browser session is started and then a get request is sent to the PDP of that iteration. At the end of each iteration, the browser session that was started is closed. The browser session also emulates human behavior by setting the mock_human parameter to True.


We start a new browser session for each iteration, as using the same session can cause bugs and cause our scraper to stop working properly.


In each iteration, the URL to the page containing all the product reviews is found and loaded. The review listing page thus contains sorting options, including a search bar. We use the search bar to search for each of the properties we have defined and extract the reviews obtained in each case.


A loop is initiated to obtain the reviews related to each property. After extracting the reviews, they are stored in a pandas DataFrame, which is later converted into a CSV file. After getting reviews connected to each property, to begin the iteration to find the reviews connected to the next property, the initial review listing page is loaded again to clear the search bar.


After the reviews related to all the properties have been extracted and stored in a DataFrame, the DataFrame is converted into a CSV file whose name is the brand considered in that particular iteration.


Let us look into the helper functions used in the main workflow.


Finding the review listing page


This function finds the link to the review listing page by looking for a <a> tag with the particular 'data-hook' attribute.


We then use the request to grab the list of links present in the <a> tag. Only one link exists in the <a> tag, and the function returns the link.


If the <a> tag is not found, that particular product will be skipped.


Filtering reviews


This function filters the reviews using the search bar on the review listing page. The search bar is found for an <input> element with the particular id. 


After finding the search bar, the property is typed into it, and then the search button is found by looking for an <input> element with the particular 'aria-labeled by' attribute. The search button is then clicked and then the filtered reviews are obtained.


In some cases, the search may not occur properly, so the reviews may not be filtered. Such cases are found by checking if the current URL of the browser has the string 'filterByKeyword' in it. If not found, the browser loads the page again, and then the filtering is done. It is again checked to see if the reviews are filtered correctly. 


This retry can occur a maximum of four times, and if the filter isn't applied correctly any of those times, the property is skipped.


This can also happen if the search bar is not found. In such cases, the page is loaded again, and the scraper tries to search for the search bar. If this also happens four times, then the property is skipped.


Extracting review data


This function obtains each filtered product by looking for a <div> element with the given class. The body of the review is then extracted by finding the text content of the <span> tag with the given 'data-hook' attribute.


The extracted review body is then stored in a list, which, at the end, is returned.

Amazon shows reviews on multiple pages. This means that the reviews are spread through several pages, each with only ten reviews. This means we must move through different pages to obtain the required number of reviews.


The function moves to the next page containing reviews by finding a <li> tag with the given element. This is the next button on the page. The URL present in this tag is taken and loaded. This is the next review page.


Sentiment Analyzer


Now that we have seen how we scrape the needed reviews connected to each property, let's look at how we can use sentiment analysis to see if the reviews give a positive, negative, or even a neutral image and thereby understand if the product has the needed property and if the customers are satisfied by it.


Importing Libraries


We conduct sentiment analysis on the scraped data using the TextBlob and Pandas libraries. Since we have already examined pandas, let's examine the TextBlob library.


TextBlob: Simplified Text Processing


TextBlob is a Python library for processing textual data. It provides a simple API for diving into standard natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.


We use TextBlob to conduct sentiment analysis on each review and see whether it is positive, negative, or even neutral.


Global constants


After importing the needed libraries, we initialize some global constants in our script. 

The following constants are defined at the beginning :

  • BRAND_LIST: This list corresponds to the brands of the products whose reviews we have scraped.


Main workflow


Let's now look into the main workflow of our sentiment analyzer.


The analyzer goes through each data file and conducts sentiment analysis on each column corresponding to a particular property. The total result is saved as a CSV file. A loop is initiated to access each of the data files present. The data files are accessed by using the brand name of the product, which is also used as the name of the CSV file.


In the data file, each property and its corresponding reviews are present as a column. So, each column is accessed one by one.


Sentiment analysis is done on each column value. Data like brand name, property name, number of reviews present, and number of positive, negative, and neutral reviews are stored in a panda's DataFrame object.


After going through all the brands, the data stored in the DataFrame is saved as a CSV file.

Let us look into the helper function used in the main workflow.


Finding the sentiment of each review


The function takes a DataFrame object corresponding to a column of the data file as input and finds the polarity of each review in the column.


A key aspect of sentiment analysis is polarity classification. Polarity refers to the overall sentiment conveyed by a particular text, phrase, or word. This polarity can be expressed as a numerical rating known as a "sentiment score." For example, this score can be between -1 and 1, with 0 representing neutral sentiment. This score could be calculated for an entire text or an individual phrase.


When the polarity is positive, the given text is a positive one; if the polarity is negative, then the given text is negative.


The function returns the count of positive, negative, and neutral reviews.


Amazon Sentiment Analysis for Headphones - Visualization


Advancements in technology have brought great conveniences to people all over the world. One such convenient technology is the Headphones. In audio technology, headphones are an indispensable gadget found almost everywhere. From noisy places like bustling city streets to quiet places like libraries, headphones help us escape a different world.


Going one step further, by removing cords and using Bluetooth technology, headphones have reached a new level of convenience and freedom. With this, the popularity of headphones has risen greatly. This popularity and demand have caused the inflow of a wide variety and range of headphones.


However, with so many options available, choosing the right Bluetooth headphones for our needs can feel overwhelming. Thus, it is necessary to properly differentiate headphones in a simple and easy-to-understand manner.


This blog will examine five Bluetooth headphones, each offered by a different premium brand and available on Amazon. We'll examine whether various qualities are present in these products and whether these qualities satisfy customers.


Thе data for this study is obtainеd through Datahut's wеb scraping platform.


Using this data, we aim to understand whether some of the most needed qualities are present in products offered by premium brands and whether or not customers are satisfied with these qualities. We also look into which of the five brands we consider has the better quality when each quality is considered individually.


Data Overview


The data we used for the analysis is information about various brands, focusing on specific properties like sound quality, battery life, noise cancellation, and comfort.


The data basically contains the result of sentiment analysis done on reviews of a particular product obtained from Amazon. The reviews were selected so that they discussed the properties that we are focusing on.


The data consists of each property and the number of reviews giving positive, negative, and neutral feedback. 


We created a sentiment score for each brand's property using the data present.


Sentiment scores are the measures of how positive or opposing a particular product is. Our sentiment scoring assesses the tone of a transcript on a spectrum of positive 100 to negative 100, with zero being neutral.


This means that products with the needed properties will have a positive value and a high positive value means that the customers are very much satisfied by that property of the product. Conversely, a negative sentiment score would indicate customer dissatisfaction or negative feedback regarding the property in question.


Brand analysis


From the data, we can see that all products have a positive sentiment score for every property. This means all the Bluetooth headphones we considered have all the necessary properties. But this doesn't mean that the customers equally accept them.

Let's examine each brand in detail and then check how it does when a single property is considered.


Remember that when we talk about the brand in the blog, we point to the product that was analyzed and not the brand as a whole.


Sony


  • Sound Quality: Sony received high positive feedback on sound quality, with a sentiment score 96, indicating high customer satisfaction. 

  • Battery: While moderately satisfying, Sony's battery performance received relatively lower ratings than other properties, with a sentiment score 69.

  • Noise Cancellation: Sony's noise cancellation feature received limited reviews, but it achieved a positive sentiment score of 86. As the number of reviews was limited, the score can change once the number of buyers increases.

  • Comfort: Sony's comfort received a sentiment score of 70. But similar to the case of Noise Cancellation, Comfort also had a limited number of reviews.


Sennheiser


Sennheiser's performance across all properties was highly praised, with sentiment scores 100 for battery life, noise cancellation, and comfort. Sennheiser's sound quality had a sentiment score of 85 as well.

This shows that all Sennheiser properties have gained high customer satisfaction. However, there is some doubt about the battery, as there are a limited number of reviews available for it.


Skullcandy


  • Sound Quality: Skullcandy had a sentiment score of 52 for its sound quality. Some negative reviews were mixed with positive ones, making the score 52. 

  • Battery: Skullcandy's battery life gained greater satisfaction, with a sentiment score of 70. 

  • Noise Cancellation: Skullcandy's noise cancellation feature also received positive sentiment, scoring 71. 

  • Comfort: This property has the highest sentiment score among all Skullcandy properties, 89. So, we can say that customers are very comfortable with Skullcandy.

OnePlus


  • Sound quality: OnePlus's sound quality received exceptionally positive feedback, with a sentiment score of 100. Customers were delighted with it.

  • Battery: OnePlus also received high positive reviews for its battery, which earned it a sentiment score of 84.

  • Noise Cancellation: OnePlus's noise cancellation feature also received positive sentiment, scoring 76.

  • Comfort: comfort also received favorable reviews, with a sentiment score 82.

JBL


  • Sound quality: JBL's sound quality received positive feedback, with a sentiment score 88. 

  • Battery: JBL's Battery performance garnered moderate satisfaction, with a sentiment score of 65. 

  • Noise Cancellation: Very Few reviews were available that discussed JBL's noise cancellation feature, but JBL achieved a perfect sentiment score of 100 for noise cancellation. 

  • Comfort: Comfort also received high positive feedback, with a sentiment score 80.


Now that we have examined each brand individually, let's examine how each brand compares with each other when each property is considered.


Sound Quality Comparison of Products from Premium Brands


Sound quality is critical when choosing headphones, as it significantly influences the enjoyment and appreciation of whatever we are listening to. For casual listening, professional audio production, or immersive gaming experiences, high-quality sound reproduction can elevate the overall auditory experience and enhance enjoyment.

  • The graph shows that users are generally satisfied with the sound quality of headphones offered by various premium brands. 

  • Sound quality has a sentiment score range of as low as 52 to as high as 100.

  • OnePlus has the highest sentiment score of 100 for its sound quality, and Skullcandy has the lowest sentiment score of 52.


Battery Comparison of Products from Premium Brands


The battery life of headphones refers to the duration for which they can operate on a single charge. It is an essential factor, especially for us, since we have considered wireless headphones for our analysis.


Headphones with longer battery life are ideal for individuals who lead active lifestyles, such as commuters, travelers, or fitness enthusiasts, as they can enjoy hours of entertainment or communication without worrying about running out of battery.


Battery life, therefore, plays a significant role in the overall user experience of headphones, offering convenience, flexibility, and uninterrupted enjoyment of audio content or communication.


  • The graph shows that even though customers are generally satisfied with the battery, there is still room for improvement.

  • Battery has a sentiment score range of as low as 65 to as high as 100.

  • Sennheiser has the highest satisfaction with sound quality, with a sentiment score of 100, and JBL has the least satisfaction with sound quality, with a sentiment score of 65.

Noise Cancellation Comparison of Products from Premium Brands


Noise cancellation is a feature in headphones that enhances the listening experience by reducing unwanted noise in the surrounding environment. 


By effectively blocking out the surrounding noise, headphones allow users to focus on their audio content without increasing volume levels, reducing the hearing damage risk. Noise cancellation thus offers a significant improvement in the listening experience.

  • The graph shows that customers are greatly satisfied with the noise-cancellation feature of these headphones.

  • Noise cancellation has a sentiment score range of 71 to as high as 100.

  • Sennheiser and JBL have the highest satisfaction with sound quality, with a sentiment score of 100, and Skullcandy has the least satisfaction with sound quality, with a sentiment score of 71.

Comfort Comparison of Products from Premium Brands


Comfortable headphones are essential for providing an enjoyable and fatigue-free listening experience, especially when users must wear headphones for extended periods. Choosing headphones that prioritize comfort ensures that users can focus on tasks or entertainment without distraction or discomfort.

  • From the graph, customers are delighted with the comfort level offered by their headphones.

  • Comfort has a sentiment score range of 70 to as high as 100.

  • Sennheiser has the highest satisfaction with sound quality, with a sentiment score of 100, and Sony has the least satisfaction with sound quality, with a sentiment score of 70.



From our analysis, customers' overall sentiment towards products of premium brands is positive. Although some products came almost close to having perfect quality for all needed features, they didn't reach that level. No products are perfect, and the customer has to decide which product to buy by considering what they are looking for and which brand product satisfies that need.


The above analysis can help users decide which product to buy. If they are looking for overall high quality, Sennheiser is the one. If they care only about sound quality, OnePlus is the way to go. 


In this way customers can choose their products by using our analysis.


Conclusion


In summary, our exploration into sentiment analysis of headphone brands on Amazon reveals the power of data in understanding consumer preferences and market trends. By harnessing the capabilities of web scraping and sentiment analysis, businesses can gain invaluable insights that drive strategic decisions, enhance product development, and improve customer satisfaction.


Want to better understand your customers and build a stronger brand? Datahut can help! We use special tools to collect data online and analyze what people say about your brand. This gives us valuable insights you can use to improve your strategy and connect with your customers on a deeper level. Ready to get started? Contact us today and see how data can help your business thrive!
















22 views0 comments

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page