top of page
  • Writer's pictureSandra P

Scraping Amazon Best Seller Data using Python: A Step-by-Step Guide


Scraping Amazon Best Seller Data using Python: A Step-by-Step Guide

Amazon is one of the biggest retailers in the world. With millions of products and millions of customers, it's no wonder that retailers everywhere are trying to figure out how to tap into Amazon's market.


One way you can do this is by taking advantage of their bestseller charts. These charts show which products have been selling well on Amazon, so you can use them as a guide for what products to stock in your own store.


The only problem? Amazon doesn't make it easy for you to get the data behind these charts—at least not by yourself. They've got a whole team dedicated to collecting it!

Also, manually collecting and analyzing this data can be time-consuming and tedious.


But with just a little bit of Python knowledge and some elbow grease, we can do it ourselves. By using Python, we can automate the process of scraping Amazon's best-seller data and quickly extract the data we need.


In this tutorial, we'll teach you how to scrape Amazon's bestseller data using Python.

We will be discussing the process of scraping and analyzing data of the products in the Computers & Accessories category from the Amazon Best Seller site. The same process can be used to scrape data from any other category on this site. Scraping data from the Amazon Best Sellers list can give you an idea of which products are currently in high demand, what features are popular among customers, what the average price range for different types of products is etc.



Scraping process

We can use web scraping tools and techniques to gather data on the most popular and highly rated products in this category. The Amazon site is a dynamic website that adapts to the specific needs of each individual user by providing personalized content. This means that different users visiting the same website may see different content based on their preferences and needs. To extract information from a dynamic website using Python, one can use a headless browser, such as Selenium, which allows one to navigate and interact with a website as a user would, but without having a graphical interface. This allows you to programmatically control a web browser and automate the process of scraping dynamic content.


How to scrape data from Amazon Best Sellers using Python


Using Python packages like BeautifulSoup and Selenium, we can scrape Amazon Best Seller data from the Amazon website for the category Computers & Accessories.


To use Selenium and Python for web scraping, you need to install both the Selenium package and a web driver. The web driver is a software tool that enables Selenium to control and access a web browser. There are different web drivers available for different types of browsers, including Chrome, Firefox, and Safari. Once you have installed the Selenium package and a web driver, you can begin using them to scrape data from websites.


Data Attributes


The attributes or features we intend to gather from the website are:


  • Product Url : The link to the product.

  • Ranking : The ranking of the product within the overall list of best-selling products in the category of Computers & Accessories on Amazon.

  • Product Name : The name of the product.

  • Brand : The brand name of the product.

  • Price ( in Dollars ): Price of the product in Dollars.

  • Number of Ratings : Number of ratings the product has got.

  • Star Rating : The star rating the product has got.

  • Size : Size of the product.

  • Color : Color of the product.

  • Hardware Interface : The hardware interface of the product.

  • Compatible Devices : Other devices that are compatible with the product.

  • Connectivity Technology : The technology using which the product can be connected.

  • Connector Type : The type of the connector.

  • Data Transfer Rate : The rate at which the product transfers the data.

  • Mounting Type : The method to attach the product.

  • Special Features : Any additional feature that the product has.

  • Date First Available : The date when the product was first made available for purchase on Amazon.


Importing Required Libraries


To start our process of scraping Amazon Best Seller data, we will need to import a number of libraries that will enable us to interact with the website and extract the information we need.


To use the functionality provided by these libraries, they must be installed and imported on your system. In case any of these libraries are not installed, you can use the pip command to install them. The following code can be used to import all of these libraries, so that you can start using them in your script:


# Importing libraries
import time
import random
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
pd.options.mode.chained_assignment = None
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

  • The time library, a python library that provides various time-related functions. It allows you to get the current time, convert between different time representations, and perform other time-related operations.

  • The random library, library that provides various random number generators and other functions for generating random numbers and sequences. It allows you to generate random numbers, select random elements from a list, shuffle lists, and perform other random operations.

  • The pandas library, a powerful and widely used open-source library in Python for data manipulation and analysis. It provides data structures and data analysis tools for handling and manipulating numerical tables and time series data

  • The BeautifulSoup module from the bs4 library, a Python library for pulling data out of HTML and XML files. It allows developers to parse and navigate through HTML and XML documents in a more readable and efficient way.

  • The Selenium library, a powerful tool for automating web browsers through programs and performing browser automation. It is functional for all browsers and works on all major operating systems.

  • The webdriver module from the Selenium library, a package for interacting with a web browser that allows you to automate browser actions, such as clicking buttons, filling out forms, and navigating pages.

  • Various extensions of the webdriver module, such as Keys and By, provide additional classes and methods that can be used to interact with web pages in more complex ways.


In order to control a web browser and interact with the Amazon Best Sellers website using Selenium, we must first create an instance of the web driver. The following code can be used to create a Selenium browser instance and specify which browser you want to use:


driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Writing Functions


Defining functions as reusable pieces of code can help make your code more readable and maintainable. By creating functions, you can make your script more organized and easier to understand. It allows you to break down your script into smaller, more manageable pieces and focus on one specific task at a time. Additionally, user-defined functions can be reused throughout your script, helping to reduce the amount of redundant code and make your script more efficient.


Function to introduce delays:


In order to avoid overwhelming the website with too many requests in a short period of time, it can be useful to introduce random delays between requests. One way to achieve this is by using a function that will pause the execution of the next piece of code for a random number of seconds between 3 and 10. This function can be called whenever a delay is needed in the script, allowing us to add variety to our requests and make them less predictable. By adding random delay, we can make sure that our script is less likely to be detected and blocked or throttled by the website. This function can be written as follows:


# Function to delay some process
def delay():
    time.sleep(random.randint(3, 10))

Function to deal with lazy loading:


When scraping data from dynamic websites, we may run into the issue of lazy loading, where additional content is not loaded until it is required by the user.


The Amazon Best Sellers page for the category of Computers & Accessories displays a list of around 100 ranked products that are split into two pages, each showing 50 products. However, when the page first loads, only around 30 products are visible, with the remaining products being loaded later on as the user scrolls through the page. This is known as lazy loading.


To ensure that all of the data on the page can be accessed, we can use the Keys class from the webdriver module to scroll down the page, this will force the page to load additional content.


One way to do this is to send the page-down key to the browser repeatedly, this will scroll the page down, triggering the loading of new products. To do this, we can use the following function:


# Scrolling down the page in order to overcome Lazy Loading
def lazy_loading():
    element = driver.find_element(By.TAG_NAME, 'body')
    count = 0
    while count < 20:
        element.send_keys(Keys.PAGE_DOWN)
        delay()
        count += 1

Function to fetch the links and ranks of products:


The function starts by calling BeautifulSoup on the content variable, which contains the source code of the webpage, and passing 'html.parser' as the parser to use. Then it uses the find method to find the first element that matches the div tag with an attribute "class" and the value "p13n-desktop-grid". This element should contain all the product sections on the page.


Then it uses the find_all method to find all the div elements that have an attribute id with the value "gridItemRoot", each element representing a single product on the page.


For each product section, it finds all a tags with attribute 'tabindex'=-1, it then check the start of product link if starts with 'https:' it directly append to the product_links list, otherwise it appends 'https://www.amazon.com' before the product_link to form a valid url and append to the product_links list.


Also the function appends the rank of the product to the ranking list by using find method to find span tag with an attribute class and the value "zg-bdg-text" and returns its text content via .text method.


# Function to fetch the product links of products
def fetch_product_links_and_ranks():
    content = driver.page_source
    homepage_soup = BeautifulSoup(content, 'html.parser')

    all_products = homepage_soup.find('div', attrs={"class": "p13n-desktop-grid"})
    for product_section in all_products.find_all('div', {'id': 'gridItemRoot'}):
        for product_link in product_section.find_all('a',{'tabindex':'-1'}):
            if product_link['href'].startswith('https:'):
                product_links.append(product_link['href'])
            else:
                product_links.append('https://www.amazon.com' + product_link['href'])
        ranking.append(product_section.find('span',{'class': 'zg-bdg-text'}).text)
        

Function to extract the page content:


This function uses Selenium's WebDriver to obtain the HTML source code of the current webpage, and then parses the HTML using BeautifulSoup and the 'html.parser' module.


# Function to extract content of the page
def extract_content(url):
    driver.get(url)
    page_content = driver.page_source
    product_soup = BeautifulSoup(page_content, 'html.parser')
    return product_soup

Function to extract product name:


This is a function called extract_product_name() that uses the BeautifulSoup library to extract the name of a product.


The function starts by using the find method to find the first element that matches the div tag with an attribute "id" and the value "titleSection". This element should contain the product name. Then the function uses the text attribute to get all the text in the element and strip method to remove any leading and trailing whitespaces.


Then it assigns the name of the product to the 'product name' column of data frame in the product index.


In case the try block fails, that means product name is not available and it assigns a string 'Product name not available ' to the 'product name' column of data frame in the product index.


# Function to extract product name
def extract_product_name(soup):
    try:
        name_of_product = soup.find('div', attrs={"id": "titleSection"}).text.strip()
        data['product name'].iloc[product] = name_of_product

    except:
        name_of_product = 'Product name not available '
        data['product name'].iloc[product] = name_of_product

Function to extract brand name:


This code defines a function called extract_brand() to extract the brand name of a product.


The function starts by using the find method to find the first element that matches the “a” tag with an attribute "id" and the value "bylineInfo". This element should contain the brand name of the product. Then it uses the text attribute to get the text in the element, calls split(':') on the result to separate the brand name from the text before it, and finally calls strip() to remove any leading and trailing whitespaces. Then it assigns the brand of the product to the 'brand' column of data frame in the product index.


In case the try block fails, that means the brand name is not available at the first location and it checks for the brand name at other location by using find_all method to find all elements that match the tr tag with an attribute "class" and the value "a-spacing-small po-brand". This element could contain the brand name of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces and then calls split(' ') on the result and takes the last element from it to get the brand name. Then it assigns the brand of the product to the 'brand' column of data frame in the product index.


In case if none of the try-except block fails, it means brand data is not available and it assigns a string 'Brand data not available' to the 'brand' column of data frame in the product index.


# Function to extract brand name
def extract_brand(soup):
    try:
        brand = soup.find('a', attrs={"id": "bylineInfo"}).text.split(':')[1].strip()  #one location where brand data could be found
        data['brand'].iloc[product] = brand

    except:
        if soup.find_all('tr', attrs={'class': 'a-spacing-small po-brand'}):  #other location where brand data could be found
            brand = soup.find_all('tr', attrs={'class': 'a-spacing-small po-brand'})[0].text.strip().split(' ')[-1]
            data['brand'].iloc[product] = brand
        else:
            brand = 'Brand data not available'
            data['brand'].iloc[product] = brand
            

Function to extract price:


This function called extract_price() is to extract the price of a product.


The function starts by using the find method to find the first element that matches the span tag with an attribute "class" and the value "a-price a-text-price a-size-medium apexPriceToPay". This element should contain the price of the product. Then it uses the text attribute to get the text in the element, calls split('$') on the result to separate the price from the currency symbol, and finally takes the last element from the result which will contain the price. Then it assigns the price of the product to the 'price(in dollar)' column of data frame in the product index.


In case the try block fails, that means the price is not available and it assigns a string 'Price data not available' to 'price(in dollar)' column of data frame in the product index.


# Function to extract price
def extract_price(soup):
    try:
        price = soup.find('span', attrs={"class": "a-price a-text-price a-size-medium apexPriceToPay"}).text.split('$')[
            -1]
        data['price(in dollar)'].iloc[product] = price

    except:
        price = 'Price data not available'
        data['price(in dollar)'].iloc[product] = price

Function to extract size:


This is a function called extract_size() to extract the size of a product.


The function starts by using the find method to find the first element that matches the span tag with an attribute "id" and the value "inline-twister-expanded-dimension-text-size_name". This element should contain the size of the product. Then it uses the text attribute to get the text in the element and calls strip() to remove any leading and trailing whitespaces. Then it assigns the size of the product to the 'size' column of data frame in the product index.


In case the try block fails, that means the size is not available and it assigns a string 'Size data not available' to the 'size' column of data frame in the product index.


# Function to extract size
def extract_size(soup):
    try:
        size = soup.find('span', attrs={"id": "inline-twister-expanded-dimension-text-size_name"}).text.strip()
        data['size'].iloc[product] = size

    except:
        size = 'Size data not available'
        data['size'].iloc[product] = size
        

Function to extract star rating:


This code defines a function called extract_star_rating() to extract the star rating of a product.


The function first initialises the variable star to None and then loops over two different locations where the star rating could potentially be found. The two locations are represented by the classes 'a-icon a-icon-star a-star-4-5' and 'a-icon a-icon-star a-star-5' respectively.


It uses the find_all() method of the soup object to find all elements with the class name as the loop variable, and it assigns the result to the variable stars. Then it loops through the stars and it checks the first element of the text of the element, if it is not empty, it assigns it to the ‘star’ variable and breaks the loop. If the current location does not have the star rating, it will check the next location.


If the star rating was found, the function assigns the value of the star rating to the data frame in the 'star rating' column at the index product.


If an exception is raised, the function assigns 'Star rating data not available' to the data frame in the 'star rating' column at the index product.


# Function to extract star rating
def extract_star_rating(soup):
    try:
        star = None
        for star_rating_locations in ['a-icon a-icon-star a-star-4-5', 'a-icon a-icon-star a-star-5']:
            stars = soup.find_all('i', attrs={"class": star_rating_locations})
            for i in range(len(stars)):
                star = stars[i].text.split(' ')[0]
                if star:
                    break
            if star:
                break
        
    except:
        star = 'Star rating data not available'
        
    data['star rating'].iloc[product] = star   

Function to extract the number of ratings:


This function called extract_num_of_ratings() is to extract the number of ratings of a product.


The function starts by using the find method to find the first element that matches the span tag with an attribute "id" and the value "acrCustomerReviewText". This element should contain the number of ratings of the product. Then it uses the text attribute to get the text in the element, calls split(' ') on the result to separate the number of ratings from the text before it, and finally takes the first element from the result which will contain the number of ratings. Then it assigns the number of ratings of the product to the 'number of ratings' column of data frame in the product index.


In case the try block fails, that means the number of ratings is not available and it assigns a string 'Number of rating not available' to the 'number of ratings' column of data frame in the product index.


# Function to extract number of ratings
def extract_num_of_ratings(soup):
    try:
        star = soup.find('span', attrs={"id": "acrCustomerReviewText"}).text.split(' ')[0]
        data['number of ratings'].iloc[product] = star

    except:
        star = 'Number of rating not available'
        data['number of ratings'].iloc[product] = star
        

Function to extract color:


This is a function called extract_color() to extract the color of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-color". This element should contain the color of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the color from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the color of the product. Then it assigns the color of the product to the 'color' column of data frame in the product index.


In case the try block fails, that means the color is not available and it assigns a string 'Color not available' to the 'color' column of data frame in the product index.


# Function to extract color
def extract_color(soup):
    try:
        color = soup.find('tr', attrs={'class': 'a-spacing-small po-color'}).text.strip().split('  ')[1].strip()
        data['color'].iloc[product] = color

    except:
        color = 'Color not available'
        data['color'].iloc[product] = color
        

Function to extract hardware interface:


This code defines a function called extract_hardware_interface() to extract the hardware interface of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-hardware_interface". This element should contain the hardware interface of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the hardware interface from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the hardware interface of the product. Then it assigns the hardware interface of the product to the 'hardware interface' column of data frame in the product index.


In case the try block fails, that means the hardware interface is not available and it assigns a string 'Hardware interface data not available' to the 'hardware interface' column of data frame in the product index.


# Function to extract hardware interface
def extract_hardware_interface(soup):
    try:
        hardware_interface = \
        soup.find('tr', attrs={"class": "a-spacing-small po-hardware_interface"}).text.strip().split('  ')[1].strip()
        data['hardware interface'].iloc[product] = hardware_interface

    except:
        hardware_interface = 'Hardware interface data not available'
        data['hardware interface'].iloc[product] = hardware_interface

Function to extract compatible devices:


This function called extract_compatible_devices() is to extract the compatible devices of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-compatible_devices". This element should contain the compatible devices of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the compatible devices from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the compatible devices of the product. Then it assigns the compatible devices of the product to the 'compatible devices' column of dataframe in the product index.


In case the try block fails, that means the compatible devices are not available and it assigns a string 'Compatible devices data not available' to the 'compatible devices' column of data frame in the product index.


# Function to extract compatible devices
def extract_compatible_devices(soup):
    try:
        compatible_devices = \
        soup.find('tr', attrs={"class": "a-spacing-small po-compatible_devices"}).text.strip().split('  ')[1].strip()
        data['compatible devices'].iloc[product] = compatible_devices

    except:
        compatible_devices = 'Compatible devices data not available'
        data['compatible devices'].iloc[product] = compatible_devices
        

Function to extract data transfer rate:


This is a function called extract_data_transfer_rate() to extract the data transfer rate of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-data_transfer_rate". This element should contain the data transfer rate of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the data transfer rate from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the data transfer rate of the product. Then it assigns the data transfer rate of the product to the 'data transfer rate' column of data frame in the product index.


In case the try block fails, that means the data transfer rate is not available and it assigns a string 'Data transfer rate data not available' to the 'data transfer rate' column of data frame in the product index.


# Function to extract data transfer rate
def extract_data_transfer_rate(soup):
    try:
        data_transfer_rate = \
        soup.find('tr', attrs={"class": "a-spacing-small po-data_transfer_rate"}).text.strip().split('  ')[1].strip()
        data['data transfer rate'].iloc[product] = data_transfer_rate

    except:
        data_transfer_rate = 'Data transfer rate data not available'
        data['data transfer rate'].iloc[product] = data_transfer_rate
        

Function to extract mounting type:


This code defines a function called extract_mounting_type() to extract the mounting type of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-mounting_type". This element should contain the mounting type of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the mounting type from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the mounting type of the product. Then it assigns the mounting type of the product to the 'mounting type' column of data frame in the product index.


In case the try block fails, that means the mounting type is not available and it assigns a string 'Mounting type data not available' to the 'mounting type' column of data frame in the product index.


# Function to extract mounting type
def extract_mounting_type(soup):
    try:
        mounting_type = soup.find('tr', attrs={"class": "a-spacing-small po-mounting_type"}).text.strip().split('  ')[
            1].strip()
        data['mounting type'].iloc[product] = mounting_type

    except:
        mounting_type = 'Mounting type data not available'
        data['mounting type'].iloc[product] = mounting_type
        

Function to extract special features:


This function called extract_special_features() is to extract the special feature of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-special_feature". This element should contain the special features of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the special features from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the special features of the product. Then it assigns the special features of the product to the 'special features' column of data frame in the product index.


In case the try block fails, that means the special features are not available and it assigns a string 'Special features data not available' to the 'special features' column of data frame in the product index.


# Function to extract special features
def extract_special_features(soup):
    try:
        special_feature = \
        soup.find('tr', attrs={"class": "a-spacing-small po-special_feature"}).text.strip().split('  ')[1].strip()
        data['special features'].iloc[product] = special_feature

    except:
        special_feature = 'Special features data not available'
        data['special features'].iloc[product] = special_feature
        

Function to extract connectivity technology:


This is a function called extract_connectivity_technology() to extract the connectivity technology of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-connectivity_technology". This element should contain the connectivity technology of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the connectivity technology from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the connectivity technology of the product. Then it assigns the connectivity technology of the product to the 'connectivity technology' column of data frame in the product index.


In case the try block fails, that means the connectivity technology is not available and it assigns a string 'Connectivity technology data not available' to the 'connectivity technology' column of data frame in the product index.


# Function to extract connectivity technology
def extract_connectivity_technology(soup):
    try:
        connectivity_technology = \
        soup.find('tr', attrs={"class": "a-spacing-small po-connectivity_technology"}).text.strip().split('  ')[
            1].strip()
        data['connectivity technology'].iloc[product] = connectivity_technology

    except:
        connectivity_technology = 'Connectivity technology data not available'
        data['connectivity technology'].iloc[product] = connectivity_technology
        

Function to extract connector type:


This code defines a function called extract_connector_type() to extract the connector type of a product.


The function starts by using the find method to find the first element that matches the tr tag with an attribute "class" and the value "a-spacing-small po-connector_type". This element should contain the connector type of the product. Then it uses the text attribute to get the text in the element, calls strip() to remove any leading and trailing whitespaces, calls split(' ') on the result to separate the connector type from the text before it, and finally takes the second element from the result and calls strip() to remove any leading and trailing whitespaces which will contain the connector type of the product. Then it assigns the connector type of the product to the 'connector type' column of data frame in the product index.


In case the try block fails, that means the connector type is not available and it assigns a string 'Connector type data not available' to the 'connector type' column of data frame in the product index.


# Function to extract connector type
def extract_connector_type(soup):
    try:
        connector_type = soup.find('tr', attrs={"class": "a-spacing-small po-connector_type"}).text.strip().split('  ')[
            1].strip()
        data['connector type'].iloc[product] = connector_type

    except:
        connector_type = 'Connector type data not available'
        data['connector type'].iloc[product] = connector_type
        

Function to extract date first available:


This function called extract_date_first_available() is to extract the date first available of a product.


The function starts by using the find_all method to find all the elements that matches the th tag with an attribute "class" and the value "a-color-secondary a-size-base prodDetSectionEntry" and "td" tag with an attribute "class" and the value "a-size-base prodDetAttrValue". These elements contain the key-value pair of the product details. Then the function loops through the list of product_details_keys and checks if the text contains 'Date First Available' in it.

When 'Date First Available' is found in the text, it extracts the corresponding value and assigns it to the variable date_first_available. Then it assigns the date_first_available of the product to the 'date first available' column of data frame in the product index.


In case the try block fails, that means the date first available is not available and it assigns a string 'Date first available data not available' to 'date first available' column of data frame in the product index.


# Function to extract date first available
def extract_date_first_available(soup):
    try:
        product_details_keys = soup.find_all('th', attrs={"class": "a-color-secondary a-size-base prodDetSectionEntry"})
        product_details_values = soup.find_all('td', attrs={"class": "a-size-base prodDetAttrValue"})
        for detail_key in range(len(product_details_keys)):
            if 'Date First Available' in product_details_keys[detail_key].text:
                date_first_available = product_details_values[detail_key - 2].text
                if '20' not in date_first_available:
                    date_first_available = product_details_values[detail_key].text
        data['date first available'].iloc[product] = date_first_available

    except:
        date_first_available = 'Date first available data not available'
        data['date first available'].iloc[product] = date_first_available
        

Fetching the product URL s


Now that all the necessary functions have been defined, it is time to execute them in sequence to complete the task.


The code starts by initialising two empty lists, product_links and ranking, which will later be used to store the links to the products and their ranking respectively.


It then uses a for loop to iterate over the range of pages (1 to 2) in which the products are divided into. Inside the for loop, it defines a variable start_url which is a string that contains the URL of the page that it wants to scrape. It then uses the Selenium's get() method to navigate to that URL.


After that, it calls the lazy_loading() function which is used to overcome lazy loading and to load all products before extracting links. Finally, it calls the function fetch_product_links_and_ranks() which extracts the links of the products and their ranking from the HTML source code of the page.


It then appends the product links and ranking to their respective lists. The links list will be used to navigate to individual product pages for data extraction and ranking list for ranking of the products.


# Fetching the product links of all items
product_links = []
ranking=[]
for page in range(1,3):               # to iterate over the 2 pages in which the products are divided into
    start_url = f'https://www.amazon.com/Best-Sellers-Computers-Accessories/zgbs/pc/ref=zg_bs_pg_{page}?_encoding=UTF8&pg={page}'
    driver.get(start_url)
    lazy_loading()                     # to overcome lazy loading
    fetch_product_links_and_ranks()    # to fetch the links to products 
     

Data frame Initialisation


We can create a dictionary to store the data we are collecting by creating keys as the column names, and initialising them as empty lists. We can then use this dictionary to create a Pandas data frame, which we can name "data''. Once the data frame is created, we can then populate it with the information we have collected by assigning the product_links list to the 'product_url' column and the ranking list to the 'ranking' column.


# Creating a dictionary of the required columns
data_dic = {'product url': [],'ranking': [], 'brand': [], 'product name': [],
            'number of ratings': [], 'size': [], 'star rating': [], 'price(in dollar)': [], 'color': [],
            'hardware interface': [], 'compatible devices': [], 'connectivity technology': [], 'connector type': [], 'data transfer rate':[], 'mounting type': [], 'special features':[], 'date first available':[]}


# Creating a data frame with those columns
data = pd.DataFrame(data_dic)


# Assigning the scraped links and rankings to the columns 'product url' and 'ranking'
data['product url'] = product_links
data['ranking'] = ranking

Extraction of required features


For each of the links in the 'product_url' column, the content of the page is obtained by calling the 'extract_content' function. Then, the specific details are extracted from the page by calling relevant functions one by one for the fields that are required to be extracted.


for product in range(len(data)):
    product_url = data['product url'].iloc[product]
    product_content = extract_content(product_url)

    # brands
    extract_brand(product_content)

    # product_name
    extract_product_name(product_content)

    # price
    extract_price(product_content)

    # size
    extract_size(product_content)

    # star rating
    extract_star_rating(product_content)

    # number of ratings
    extract_num_of_ratings(product_content)

    # color
    extract_color(product_content)

    # hardware interface
    extract_hardware_interface(product_content)

    # compatible devices
    extract_compatible_devices(product_content)

    # data transfer rate
    extract_data_transfer_rate(product_content)

    # mounting type
    extract_mounting_type(product_content)

    # special features
    extract_special_features(product_content)

    # connectivity technology
    extract_connectivity_technology(product_content)

    # connector type
    extract_connector_type(product_content)

    # date first available
    extract_date_first_available(product_content)
    

Saving the data into a CSV file


The data frame that has been generated is then saved as a CSV file for later use or further analysis.


# saving the resultant data frame as a csv file
data.to_csv('amazon_best_sellers.csv')

Insights from the scraped data


Now that we have our data ready, we can use it to perform various types of analyses, and extract meaningful insights from it. Some insights that I could infer are:


  • The top 100 best-selling products have prices that fall within the range of $5 to $900. By grouping them into different categories, it appears that the majority of these products fall under the "Budget-friendly" category, with prices less than $180. That is, we can assume that affordability is a major factor that affects popularity. In the remaining ones, most are in the “Premium” category, with price range greater than $720. And the ones in “Premium” are mostly of popular brands like ‘Apple’, ‘Acer’ etc. So we can assume that the popularity of the products in “Premium” is due to the brand value. Also, there are no products that lie in the category “Expensive”.


  • The products on the bestseller list have star ratings between 4.3 and 4.8, indicating that none of them have a rating below average. The majority of the products have ratings in the range of 4.4 to 4.6.


  • The number of ratings the products have received varies between 255 to 999497.



Conclusion


In this blog, we have demonstrated how to use Selenium and BeautifulSoup to scrape Amazon Best Seller data from the Amazon web page for the category of Computers and Accessories. We have collected various features like ranking, product name, brand name, star rating, price, connector type, date first available, etc of the products in the best-sellers list. This data can be used to gain insights into market trends, pricing strategies, and customer preferences.


Additionally, by automating the process of data collection, we can easily monitor these trends over time. This is a great way to stay informed about your competition and make data-driven decisions for your own business. With the help of web scraping and data analysis, you can now make more informed decisions with the data.


Ready to boost your business with accurate Amazon data? Look no further! Contact Datahut, your trusted web scraping experts, today and let us help you extract the data you need to make informed decisions



3,695 views0 comments

Comments


Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page