Are you looking for a method of scraping Amazon reviews and do not know where to begin with? In that case, you may find this blog very useful in scraping Amazon reviews.
In this blog, we will discuss scraping amazon reviews using Scrapy in python. Web scraping is a simple means of collecting data from different websites, and Scrapy is a web crawling framework in python. Web scraping allows the user to manage data for their requirements, for example, online merchandising, price monitoring, and driving marketing decisions.
In case you are wondering whether this process is even legal or not, you can find the answer to this query here.
Before digging into scraping Amazon for product reviews, let us first have a look at a few use-cases of scraping Amazon reviews at the first place.
Why the need for scraping Amazon reviews?
Sentiment Analysis over the product reviews Sentiment analysis can be performed over the reviews scraped from products on Amazon. Such a study helps in identifying the user’s emotion towards a particular product. This can help sellers or even other prospective buyers in understanding the public sentiment related to the product.
Optimizing dropshipping sales Dropshipping is a business type that allows a particular company to work without an inventory or a depository for the storage of its products. You can use web scraping for getting product pricing, user opinions, understanding the needs of the customer, and following up with the trend.
Web scraping for online reputation monitoring It is difficult for large-scale companies to monitor their reputation of products. Web scraping can help in extracting relevant review data which can act as input to different analysis tools to measure user’s sentiment towards the organization.
What is Scrapy?
Scrapy is a web crawling framework for a developer to write code to create, which defines how a particular site (or a group of websites) will be scrapped. The most significant feature is that it is built on Twisted, an asynchronous networking library, which makes the spider performance is very significant.
Let us now have a look at a necessary pipeline for scraping amazon reviews.
Scraping Amazon reviews Pipeline
It is essential to have a holistic idea of the work before you start doing it which in our case is scraping Amazon reviews. Hence, before we begin with the coded implementation with Scrapy, let us have an uber look at the complete pipeline for scraping Amazon reviews. In this section, we will look at the different stages involved in scraping amazon reviews along with their short description. This will give you an overall idea of the task which we are going to do using python in the later section.
Analyzing the HTML structure of the webpage Scraping is about finding a pattern in the web pages and extracting them out. Before starting to write a scraper, we need to understand the HTML structure of the target web page and identify patterns in it. The pattern can be related to the usage of classes, ids, and other HTML elements in a repetitive manner.
Scrapy parser implementation in Python After analyzing the structure of the target web page, we work on the coded implementation in python. Scrapy parser’s responsibility is to visit the targeted web page and extract out the information as per the mentioned rules.
Collection and Storage of Information The parser can dump out the results in any format you wish for be it CSV or JSON. This is the final output while in which your scraped data resides.
Python code implementation for scraping Amazon reviews
We will start by installing Scrapy in our system. There can be two cases here though. If you are using conda, then you can install scrapy from the conda-forge using the following command.
conda install -c conda-forge scrapy
In case you are not using conda, you can use pip and directly install it in your system using the below command.
pip install scrapy
We will start by creating a scrapy project. A scrapy project enables users to collate different components of the crawlers into a single folder. To create a scrapy project use following command.
scrapy startproject amazon_reviews_scraping
Once you have created the project, you will find the following two contents in it. One is a folder that contains your scrapy code, and the other is your spacy configuration file. Spacy configuration while helps in running and deploying the Scrapy project on a server.
Scrapy config file
Once we have the project in place, we need to create a spider. A spider is a chunk of python code that determines how a web page will be scrapped. It is the main component that crawls different web pages and extracts content out of it. In our case, this will be the code chuck that will perform the task of visiting Amazon and scraping Amazon reviews. To create a spider, you can use the following command.
scrapy genspider amazon_review your-link-here
Spider gets created within a spiders folder inside the project directory. Once you go into the scrapy project, you will see a directory structure like the one below.
Scrapy project directory structure
Scrapy files description
Let us understand the Scrapy project structure and supporting files inside in a bit more detail. Main files inside the Scrapy project directory includes
items.py Items are containers that will be loaded with the scraped data.
Middleware.py The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to handle the requests and items that are generated from spiders.
Pipelines.py After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. Each item pipeline component is a Python class.
settings.py It allows one to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.
spiders folder The Spiders is a directory which contains all spiders/crawlers as Python classes. Whenever one runs/crawls any spider, then scrapy looks into this directory and tries to find the spider with its name provided by the user. Spiders define how a certain site or a group of sites will be scraped, including how to perform the crawl and how to extract data from their pages.
For more detailed information on Scrapy components, you can refer to this link.
Analyzing HTML structure of the webpage
Now before we actually start writing spider implementation in python for scraping Amazon reviews, we need to identify patterns in the target web page. Below is the page we are trying to scrape which contains different reviews about the MacBook air on Amazon.
Amazon reviews web page
We start by opening the web page using the inspect-element feature in the browser. There you can see the HTML code of the web page. After a little bit of exploration, I found the following HTML structure which renders the reviews on the web page.
HTML code snippet for Amazon reviews
On the reviews page, there is a division with id cm_cr-review_list. This division multiple sub-division within which the review content resides. We are planning to extract both rating stars and review comments from the web page. We need to one more level deep into one other sub-divisions to prepare a scheme on fetching both star rating and review comments.
Detailed HTML code snippet of reviews
Upon further inspection, we can see that every review subdivision is further divided into multiple blocks. One of these blocks contains required star ratings, and the others include the text of review needed. By looking more closely, we can easily see that rating star division is represented by the class attribute “review-rating” and review texts are represented by the class “review-text”. All we need to do now is just to pick these patterns up using our Scrapy parser.
Defining Scrapy Parser in Python
Now once we have our spider template ready and we have analyzed the pattern on the target web page, we can start writing the logic for the extraction of reviews from Amazon. We begin by extending the Spider class and mentioning the URLs we plan on scraping. Variable start_urls contains the list of the URLs to be crawled by the spider.
Basic Scrapy spider template
Then we need to define a parse function that gets fired up whenever our spider visits a new page. In the parse function, we need to identify patterns in the targeted page structure. Spider then looks for these patterns and extracts them out from the web page.
Below is a code sample of Scrapy parser for scraping Amazon reviews.
Let us now have a look at a necessary pipeline for scraping amazon reviewsng a new class to implement Spide
# Spider name
name = 'amazon_reviews'
# Domain names to scrape
allowed_domains = ['amazon.in']
# Base URL for the MacBook air reviews
myBaseUrl = "https://www.amazon.in/Apple-MacBook-Air-13-3-inch-MQD32HN/product-
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,121):
# Defining a Scrapy parser
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting product star ratings
star_rating = data.css('.review-rating')
# Collecting user reviews
comments = data.css('.review-text')
count = 0
# Combining the results
for review in star_rating:
Storing Scraped Results
Finally, we have successfully built our spider. The only task now left is to run this spider. We can run this spider by using the runspider command. It takes to input the spider file to run and the output file to store the collected results. In the case below, spider file is amazon_reviews.py and the output file is reviews.csv .
scrapy runspider amazon_reviews_scraping/amazon_reviews_scraping/spiders/amazon_reviews.py -o reviews.csv
EDA on Amazon reviews
In this section, we will try to do some exploratory data analysis on the data obtained after scraping Amazon reviews. We will be counting the overall rating of the product along with the most common words used for the product. Using pandas, we can read the CSV containing the scraped data.
import pandas as pd
import matplotlib as plt
summarised_results = dataset["stars"].value_counts()
The above code summarises all the ratings and finds their total count. After that, it plots a bar chart to visualize the findings. We have used the matlplotlib library here to visualize the results.
Distribution of star ratings
Let us now try to visualize some of the keywords that are present in the scraped reviews. We can visualize these keywords using a word cloud. Word cloud works on the principle that most frequent words in the text should be much more prominent and bolder among the set of different words. The code snippet below can help you in making a word cloud in python.
for msg in dataset["comment"]:
msg = str(msg).lower()
words = words+msg+" "
wordcloud = WordCloud(width=3000, height=2500, background_color='white').generate(words)
fig_size = plt.rcParams["figure.figsize"]
fig_size = 14
fig_size = 7
The image below is a word cloud generated by the above code snippet. Words like the laptop, apple, product, and Amazon are represented by much more significant and bolder fonts representing that there are many frequent words used. Furthermore, this word cloud makes sense because we scraped MacBook air’s user reviews from Amazon. Also, you can see words like amazing, good, awesome, and excellent indicating that indeed many of the users actually liked the product.
Datahut as your reliable scraping partner
There are a lot of tools that can help you scrape data yourself. However, if you need professional assistance, companies like Datahut can help you. We have a well-structured and transparent process for the same. We have helped enterprises across various industrial verticals. From assistance to the recruitment industry to retail solutions, Datahut has designed sophisticated solutions for most of these use-cases.
You should join the bandwagon of using data-scraping in your operations before it is too late. It will help you boost the performance of your organization. Furthermore, it will help you derive insights that you might not know currently. This will enable informed decision-making in your business processes.
Using Scrapy, we were able to devise a method for scraping amazon reviews using python.
Additionally, there can be some roadblocks while scraping Amazon reviews as Amazon tends to block IPs if you try scraping Amazon frequently. This can be a hindrance to your work.
In such cases, make sure you are shuffling your IPs periodically and are making less frequent requests to the Amazon server to prevent yourself from blocking out. You can read more about it here.
Moreover, you can use proxy servers that serve as a protection to your home IP from blocking out while scraping Amazon reviews. With Datahut as your web-scraping partner, you will never worry about such issues.