top of page

How to Scrape Product Data from Amazon US?

  • Writer: Shahana farvin
    Shahana farvin
  • 18 minutes ago
  • 24 min read
How to Scrape Product Data from Amazon US?

Introduction


Ever tried shopping for vlogging equipment on Amazon? It's overwhelming. You've got thousands of microphones, cameras, and tripods to choose from, and manually comparing them all would take forever. That's exactly why I built this web scraping system - to automatically collect and organize all that product data so you can actually make informed decisions.


This project shows you how to build a complete two-phase scraping system that systematically extracts vlogging equipment data from Amazon. We're talking about transforming scattered product information into a clean, structured database that you can actually analyze. The first phase quickly collects all the product URLs we need, while the second phase dives deep into each product page to extract detailed information like prices, ratings, and specifications.


I chose this two-phase approach because it's more reliable and efficient than trying to do everything at once. If something goes wrong during the detailed extraction, you haven't lost all your URL collection work. Plus, it's much easier to handle Amazon's complex JavaScript-heavy pages when you break the process into focused chunks.


We'll be using modern Python tools that can handle real web applications - Playwright for browser automation, BeautifulSoup for parsing HTML, and SQLite for data storage. The end result is a professional-grade scraping system that respects Amazon's servers while giving you the data you need. Whether you're researching equipment purchases, analyzing market trends, or just learning advanced scraping techniques, this guide will show you exactly how to build something that actually works in the real world.


Phase 1: Collecting Amazon Product URLs for Vlogging Gadgets


Welcome to the first phase of our Amazon web scraping adventure! This script is all about gathering product URLs from Amazon search results, specifically targeting vlogging equipment like microphones, cameras, and tripods.


Think of this as the preliminary mission before the main event. We're systematically collecting every product URL we can find across multiple categories of vlogging gear. It's like walking through Amazon's virtual aisles and writing down the location of every interesting product we see.


The beauty of this approach is that once we have all the URLs stored in our database, we can take our time with the detailed scraping in phase two. We're building a solid foundation that will make the next phase much more efficient and organized.


Setting Up Our Tools


Let's start with the imports and basic setup. Every web scraping project needs its essential tools, and ours is no different.

import sqlite3
import random
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from playwright.sync_api import sync_playwright

# Categories and search URLs
categories = {
    "microphone": "https://www.amazon.com/s?k=microphones+for+vlogging",
    "camera": "https://www.amazon.com/s?k=camera+for+vlogging",
    "tripod": "https://www.amazon.com/s?k=tripod+for+vlogging"
}

We're using several libraries here, each with a specific purpose. SQLite handles our database storage - it's like having a filing cabinet where we can organize all our collected URLs. BeautifulSoup is our HTML parser, helping us extract information from web pages. The urllib.parse module helps us work with URLs properly.


Playwright is our browser automation tool. Unlike simple HTTP requests, Playwright controls an actual browser, which means it can handle JavaScript and behave more like a real person browsing Amazon. This is crucial because Amazon's pages rely heavily on JavaScript.


The categories dictionary defines what we're looking for. Each category has a search URL that takes us directly to Amazon's search results for that type of vlogging equipment.


Loading User Agents for Stealth


Web scraping requires being respectful and avoiding detection. One way to do this is by rotating user agents - the identifier that tells websites what browser and device you're using.

def load_user_agents(file_path="user_agents.txt"):
    """
    Load user agent strings from a text file for browser rotation.
    
    This function reads a text file containing user agent strings (one per line)
    and returns them as a list. User agent rotation helps avoid detection by
    making requests appear to come from different browsers/devices.
    
    Args:
        file_path (str, optional): Path to the text file containing user agents.
                                 Defaults to "user_agents.txt".
    
    Returns:
        list: A list of user agent strings with empty lines filtered out.
    
    Raises:
        FileNotFoundError: If the specified file doesn't exist.
        IOError: If there's an error reading the file.
    
    """
    with open(file_path, "r") as f:
        return [line.strip() for line in f if line.strip()]

user_agents = load_user_agents()

This function reads a text file containing different user agent strings. Each line represents a different browser or device. When we make requests, we randomly pick one of these identifiers, making our scraper appear to come from different sources. It's like changing your disguise each time you visit a store.


The function filters out empty lines and returns a clean list of user agents. We load these once at the start of our script and use them throughout the scraping process.


Database Setup


Before we start collecting URLs, we need somewhere to store them. We're using SQLite, which is perfect for this kind of project because it doesn't require a separate server.

def setup_database():
    """
    Initialize SQLite database and create the products_url table.
    
    Creates a SQLite database file named "amazon_vlogging.db" and sets up
    the products_url table to store scraped product information. The table
    has an auto-incrementing ID, category field, and URL field with unique
    constraint to prevent duplicates.
    
    Returns:
        tuple: A tuple containing (connection, cursor) objects for database operations.
               - connection (sqlite3.Connection): Database connection object
               - cursor (sqlite3.Cursor): Database cursor for executing queries
    
    Table Schema:
        - id: INTEGER PRIMARY KEY AUTOINCREMENT
        - category: TEXT (product category like 'microphone', 'camera', etc.)
        - url: TEXT UNIQUE (product page URL, duplicates ignored)
  
    """
    conn = sqlite3.connect("amazon_vlogging.db")
    cur = conn.cursor()

    cur.execute('''
        CREATE TABLE IF NOT EXISTS products_url (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            category TEXT,
            url TEXT UNIQUE
        )
    ''')

    conn.commit()
    return conn, cur

This function creates our database file and sets up a table called products_url. Think of this table as a spreadsheet with three columns: an ID number that increases automatically, the category (like "microphone" or "camera"), and the actual URL.


The UNIQUE constraint on the URL column is important - it prevents us from storing the same product URL twice. If we try to insert a duplicate URL, SQLite will simply ignore it, which saves us from having to check for duplicates manually.


Extracting Product URLs from Search Results


Now comes the heart of our URL collection process. Amazon's search results pages contain links to individual products, and we need to extract all of them.

def extract_product_urls(html):
    """
    Extract product page URLs from Amazon search results HTML.
    
    Parses the HTML content of an Amazon search results page and extracts
    all product page URLs. Uses BeautifulSoup to find product image links
    which contain the product page URLs as href attributes.
    
    Args:
        html (str): Raw HTML content of an Amazon search results page.
    
    Returns:
        list: A list of unique, fully-qualified product page URLs.
              Duplicates are automatically removed.
    
    Technical Details:
        - Targets: span[data-component-type="s-product-image"] > a elements
        - Converts relative URLs to absolute URLs using Amazon's base URL
        - Removes duplicate URLs using set() conversion
    
    """
    soup = BeautifulSoup(html, "html.parser")
    links = []
    for tag in soup.select('span[data-component-type="s-product-image"] > a'):
        partial = tag.get("href")
        if partial:
            full_url = urljoin("https://www.amazon.com", partial)
            links.append(full_url)
    return list(set(links))  # remove duplicates

This function takes the raw HTML of a search results page and finds all the product links. We're specifically looking for links inside product image spans - these are the clickable product images that take you to individual product pages.


Amazon often uses relative URLs (like "/dp/B08ABC123"), so we use urljoin to convert them into complete URLs. The function returns a list of unique URLs by converting to a set and back to a list, which automatically removes any duplicates we might have found on the same page.


Setting ZIP Code for Consistent Results


Amazon shows different prices and availability based on your location. To get consistent results, we need to set a specific ZIP code before scraping.

def apply_zip(page, zip_code="56901"):
    """
    Set delivery ZIP code on Amazon using browser automation.
    
    Navigates to Amazon's homepage and programmatically sets the delivery
    ZIP code through the location selector interface. This affects product
    availability, pricing, and shipping options in search results.
    
    Args:
        page (playwright.sync_api.Page): Playwright page object for browser automation.
        zip_code (str, optional): ZIP code to set for delivery location.
                                Defaults to "56901".
    
    Returns:
        None
    
    Raises:
        Exception: Catches and prints any errors during ZIP code setting process.
                  Script continues execution even if ZIP setting fails.
    
    Process Flow:
        1. Navigate to Amazon homepage
        2. Click on location selector (glow-ingress-line2)
        3. Fill ZIP code input field
        4. Submit ZIP code update
        5. Handle optional confirmation dialog
    
    """
    try:
        page.goto("https://www.amazon.com", timeout=60000)
        page.wait_for_selector("span#glow-ingress-line2", timeout=10000)
        page.click("span#glow-ingress-line2")

        page.wait_for_selector("input#GLUXZipUpdateInput", timeout=10000)
        page.fill("input#GLUXZipUpdateInput", zip_code)
        page.click("#GLUXZipUpdate > span > input")
        page.wait_for_timeout(3000)

        try:
            page.click("span.a-button-inner > input[name='glowDoneButton']")
        except:
            pass

        print(f"📍 ZIP code set to {zip_code}")
    except Exception as e:
        print("❌ Failed to set ZIP:", e)

This function navigates to Amazon's homepage and simulates clicking on the location selector. It then fills in the ZIP code field and submits the form. The process mimics what you'd do manually when changing your delivery location on Amazon.


We use try-except blocks because the ZIP code setting process can vary slightly depending on your account status or Amazon's interface changes. If something goes wrong, we print an error but continue with the scraping process.


The Main Scraping Function


Now we bring everything together in our main scraping function. This is where the magic happens - we systematically go through each category and collect all the product URLs.

def scrape_urls_with_zip():
    """
    Main scraping function that collects product URLs from Amazon search results.
    
    This is the primary orchestration function that coordinates the entire scraping
    process. It sets up the database, configures the browser with ZIP code settings,
    and systematically scrapes product URLs from multiple categories across all
    available pages.
    
    Returns:
        None: Results are saved directly to the SQLite database.
    
    Process Overview:
        1. Initialize database connection and cursor
        2. Launch Playwright browser with random user agent
        3. Set ZIP code for consistent pricing/availability
        4. Iterate through each product category
        5. For each category, scrape all pages of search results
        6. Extract and store product URLs in database
        7. Handle pagination automatically
        8. Clean up browser resources
    
    Features:
        - Automatic pagination handling
        - Random delays between requests (3-5 seconds)
        - User agent rotation for anti-detection
        - Cookie persistence across category scraping
        - Duplicate URL prevention via database constraints
        - Comprehensive error handling and logging
    
    Database Operations:
        - Creates products_url table if not exists
        - Inserts URLs with category labels
        - Uses INSERT OR IGNORE to prevent duplicates
        - Commits after each page to prevent data loss
    
    Anti-Detection Measures:
        - Random user agent selection
        - Variable delays between requests
        - Cookie preservation
        - Realistic browsing patterns

    Raises:
        Exception: Various exceptions may occur during scraping (network issues,
                  page structure changes, etc.). Most are caught and logged
                  without stopping the entire process.
    """
    conn, cur = setup_database()

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        user_agent = random.choice(user_agents)
        context = browser.new_context(user_agent=user_agent, locale="en-US")
        page = context.new_page()

        apply_zip(page, zip_code="56901")
        cookies = context.cookies()
        page.close()
        context.close()

We start by setting up our database connection and launching a browser. The headless=False parameter means we can see the browser window as it works - helpful for debugging and understanding what's happening.


After setting the ZIP code, we save the browser cookies. These cookies contain our location preference and session information, which we'll reuse for each category to maintain consistency.

        for category, url in categories.items():
            print(f"\n🔍 Scraping category: {category}")
            context = browser.new_context(user_agent=user_agent, locale="en-US")
            context.add_cookies(cookies)
            page = context.new_page()

            page.goto(url, timeout=60000)
            page.wait_for_timeout(random.uniform(3000, 5000))

            all_urls = set()

For each category, we create a fresh browser context but add our saved cookies. This gives us a clean slate while maintaining our location settings. We navigate to the search URL and wait a random amount of time between 3-5 seconds. This random delay makes our scraper behave more like a human user.

            while True:
                html = page.content()
                product_urls = extract_product_urls(html)
                print(f"✅ Found {len(product_urls)} product URLs on this page")
                all_urls.update(product_urls)

                for link in product_urls:
                    try:
                        cur.execute("INSERT OR IGNORE INTO products_url (category, url) VALUES (?, ?)", (category, link))
                    except Exception as e:
                        print("⚠️ DB Insert Error:", e)
                conn.commit()

The main scraping loop processes each page of search results. We extract the product URLs, add them to our running set of URLs, and save them to the database. The INSERT OR IGNORE statement means duplicate URLs won't cause errors - they'll simply be skipped.

                 try:
                    next_button = page.query_selector("a.s-pagination-next")
                    if next_button:
                        next_href = next_button.get_attribute("href")
                        if next_href:
                            next_url = urljoin("https://www.amazon.com", next_href)
                            print("➡️ Going to next page...")
                            page.goto(next_url)
                            page.wait_for_timeout(random.uniform(3000, 5000))
                            continue
                except:
                    pass

                print(f"🔚 No more pages in category: {category}")
                break

            print(f"📦 Total unique URLs collected for {category}: {len(all_urls)}")
            page.close()
            context.close()

        browser.close()
    conn.close()
    print("✅ Done! Product URLs saved to products_url table.")

To handle pagination, we look for the "Next" button on each page. If we find it, we extract its URL and navigate to the next page. If there's no next button or we encounter an error, we assume we've reached the end of the search results for that category.


Running the Scraper


The final piece ties everything together with a simple entry point:

if __name__ == "__main__":
    """
    Entry point for the Amazon product URL scraper.
    
    Executes the main scraping function when the script is run directly.
    This allows the script to be imported as a module without automatically
    starting the scraping process.
    
    Usage:
        python amazon_scraper.py
    
    Prerequisites:
        - user_agents.txt file with user agent strings
        - Required packages: playwright, beautifulsoup4, sqlite3
        - Playwright browser binaries installed
    """
    scrape_urls_with_zip()

This is a Python convention that ensures our scraping function only runs when the script is executed directly, not when it's imported as a module. It's like having a main function that kicks off the entire process.


What We've Built


This script creates a systematic approach to collecting product URLs from Amazon. It handles the complexity of browser automation, manages database storage, and respects Amazon's systems by using realistic delays and user agent rotation.


The end result is a SQLite database filled with product URLs, organized by category. Each URL represents a potential vlogging gadget that we can analyze further in the second phase of our project.


The scraper is designed to be robust, handling errors gracefully and providing clear feedback about its progress. It's also respectful of Amazon's servers, using appropriate delays and behaving like a real user would.


This foundation sets us up perfectly for the next phase, where we'll visit each collected URL and extract detailed product information like prices, ratings, and descriptions.


Phase 2: Extracting Detailed Product Information


Now that we have all our product URLs safely stored in the database, it's time for the main event - extracting detailed information from each product page. This phase takes our collection of URLs and transforms them into a rich dataset of product details.


Think of phase one as creating a map of all the stores we want to visit, and phase two as actually going into each store and carefully examining the products. We'll gather prices, ratings, product specifications, and everything else that makes each product unique.


Setting Up for Detail Extraction


Our second script starts with familiar territory - the same imports and user agent loading we used before, but now we're focusing on data extraction rather than URL collection.

import sqlite3
import time
import random
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright

# Load user agents
def load_user_agents(file_path="user_agents.txt"):
    """
    Load user agent strings from a text file for browser rotation.
    
    This function reads a text file containing user agent strings (one per line)
    and returns them as a list. User agent rotation helps avoid detection by
    making requests appear to come from different browsers/devices during scraping.
    
    Args:
        file_path (str, optional): Path to the text file containing user agents.
                                 Defaults to "user_agents.txt".
    
    Returns:
        list: A list of user agent strings with empty lines filtered out.
    
    Raises:
        FileNotFoundError: If the specified file doesn't exist.
        IOError: If there's an error reading the file.
    
    """
    with open(file_path, "r") as f:
        return [line.strip() for line in f if line.strip()]

user_agents = load_user_agents()

Everything remains the same - we still need BeautifulSoup for parsing HTML, Playwright for browser automation, and our user agents for stealth.


Expanding Our Database Schema


This phase requires a more advanced database setup. We need to track our scraping progress and store much more detailed information about each product.

def setup_database():
    """
    Initialize SQLite database and create/modify tables for product detail scraping.
    
    Sets up the database schema for storing detailed product information scraped
    from Amazon product pages. Creates new tables and modifies existing ones as needed.
    Adds a 'scraped' column to the existing products_url table to track scraping progress.
    
    Returns:
        tuple: A tuple containing (connection, cursor) objects for database operations.
               - connection (sqlite3.Connection): Database connection object
               - cursor (sqlite3.Cursor): Database cursor for executing queries
    
    Database Schema Created/Modified:
        products_url table:
            - Adds 'scraped' column (INTEGER DEFAULT 0) to track processing status
        
        product_details table:
            - id: INTEGER PRIMARY KEY AUTOINCREMENT
            - category: TEXT (product category)
            - url: TEXT (product page URL)
            - title: TEXT (product title/name)
            - price: TEXT (current price)
            - original_price: TEXT (original/list price before discount)
            - discount: TEXT (discount percentage or amount)
            - details: TEXT (JSON string of product specifications)
            - rating: TEXT (customer rating)
        
        error_urls table:
            - id: INTEGER PRIMARY KEY AUTOINCREMENT
            - url: TEXT UNIQUE (URLs that failed to scrape)
            - error: TEXT (error message description)
    """
    conn = sqlite3.connect("amazon_vlogging.db")
    cur = conn.cursor()

    cur.execute("PRAGMA table_info(products_url)")
    if "scraped" not in [col[1] for col in cur.fetchall()]:
        print("🔧 Adding 'scraped' column...")
        cur.execute("ALTER TABLE products_url ADD COLUMN scraped INTEGER DEFAULT 0")

First, we add a "scraped" column to our existing products_url table. This acts like a checklist - we can mark each URL as processed so we don't waste time scraping the same product twice. The PRAGMA command lets us check what columns already exist before trying to add new ones.

    cur.execute('''
        CREATE TABLE IF NOT EXISTS product_details (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            category TEXT,
            url TEXT,
            title TEXT,
            price TEXT,
            original_price TEXT,
            discount TEXT,
            details TEXT,
            rating TEXT
        )
    ''')

    cur.execute('''
        CREATE TABLE IF NOT EXISTS error_urls (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE,
            error TEXT
        )
    ''')

We create two new tables. The product_details table stores all the valuable information we'll extract from each product page - titles, prices, ratings, and specifications. The error_urls table keeps track of any URLs that fail to scrape, along with the error messages. This helps us debug problems and retry failed URLs later.


Building Our HTML Parsing Arsenal


The heart of phase two is a collection of specialized parsing functions. Each function knows how to extract one specific piece of information from Amazon's product pages.

def parse_title(soup):
    """
    Extract product title from Amazon product page HTML.
    
    Searches for the main product title element using Amazon's standard
    product title selector. The title is typically displayed prominently
    at the top of the product page.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML.
    
    Returns:
        str or None: Product title text with whitespace stripped, or None if not found.
    
    Technical Details:
        - Target selector: "span#productTitle"
        - Uses get_text(strip=True) to clean whitespace
    
    """
    tag = soup.select_one("span#productTitle")
    return tag.get_text(strip=True) if tag else None

def parse_price(soup):
    """
    Extract current price from Amazon product page HTML.
    
    Searches for the current/sale price element in Amazon's pricing display.
    This is typically the prominently displayed price that customers see.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML.
    
    Returns:
        str or None: Current price text (e.g., "$99.99"), or None if not found.
    
    Technical Details:
        - Target selector: Complex selector for core price display
        - Looks for screen reader accessible price text
        - May include currency symbols and formatting
    
    """
    tag = soup.select_one("#corePriceDisplay_desktop_feature_div > div.a-section.a-spacing-none.aok-align-center.aok-relative > span.aok-offscreen")
    return tag.get_text(strip=True) if tag else None

def parse_original_price(soup):
    """
    Extract original/list price from Amazon product page HTML.
    
    Searches for the original price (list price) before any discounts.
    This price is typically crossed out or shown smaller when there's a sale.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML.
    
    Returns:
        str or None: Original price text (e.g., "$129.99"), or None if not found.
    
    Technical Details:
        - Target selector: Complex nested selector for basis price
        - Often displayed as strikethrough text
        - Only visible when product is on sale
    
    """
    tag = soup.select_one("#corePriceDisplay_desktop_feature_div > div.a-section.a-spacing-small.aok-align-center > span > span.aok-relative > span.a-size-small.a-color-secondary.aok-align-center.basisPrice > span > span.a-offscreen")
    return tag.get_text(strip=True) if tag else None

These functions use CSS selectors to pinpoint exactly where Amazon displays different pieces of information. The selectors look complex, but they're just very specific addresses that tell us exactly where to find each piece of data on the page.


Amazon often hides the actual price text in elements with the "aok-offscreen" class - these are invisible to users but accessible to screen readers and our scrapers. This is why we target these specific elements rather than the visible price displays.

def parse_discount(soup):
    """
    Extract discount percentage from Amazon product page HTML.
    
    Searches for the discount percentage or savings amount displayed
    when a product is on sale. Usually shown as a percentage off.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML.
    
    Returns:
        str or None: Discount text (e.g., "-31%"), or None if not found.
    
    Technical Details:
        - Target selector: Price savings percentage element
        - Typically displays percentage or dollar amount saved
        - Only visible when product has active discount
    
    """
    tag = soup.select_one("#corePriceDisplay_desktop_feature_div > div.a-section.a-spacing-none.aok-align-center.aok-relative > span.a-size-large.a-color-price.savingPriceOverride.aok-align-center.reinventPriceSavingsPercentageMargin.savingsPercentage")
    return tag.get_text(strip=True) if tag else None

def parse_rating(soup):
    """
    Extract customer rating from Amazon product page HTML.
    
    Searches for the average customer rating typically displayed
    as stars or numerical rating near the product title.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML.
    
    Returns:
        str or None: Rating text (e.g., "4.5 out of 5 stars"), or None if not found.
    
    Technical Details:
        - Target selector: "#acrPopover > span.a-declarative > a > span"
        - May include star rating and text description
        - Located in the product overview section
    
    """
    tag = soup.select_one("#acrPopover > span.a-declarative > a > span")
    return tag.get_text(strip=True) if tag else None

The discount and rating functions work the same way - they look for specific elements that Amazon uses to display this information. Not every product has a discount or rating, so we return None when these elements don't exist.


Extracting Product Specifications


One of the most valuable parts of product data is the specifications table that Amazon displays for most products. This contains detailed information like brand, model, dimensions, and features.

def extract_table_data(html):
    """
    Extract structured product specifications from Amazon's product info table.
    
    Parses Amazon's standard product information table to extract key-value pairs
    of product specifications such as brand, model, dimensions, features, etc.
    Handles truncated values by preferring full text when available.
    
    Args:
        html (str): Raw HTML content containing the product specifications table.
    
    Returns:
        List[Dict[str, str]]: A list of dictionaries where each dictionary contains
                             a single key-value pair representing a product specification.
    
    Technical Details:
        - Targets table with class 'a-normal a-spacing-micro'
        - Extracts key from 'a-span3' class cells
        - Extracts value from 'a-span9' class cells
        - Prioritizes full text from 'a-truncate-full' spans over truncated text
        - Handles cases where full value is hidden behind "Show more" functionality
    
    Table Structure:
        - Keys are typically: Brand, Model, Color, Dimensions, Weight, etc.
        - Values can be text, measurements, or feature descriptions
        - Some values may be truncated with expandable "Show more" options
 
    Returns:
        Empty list if no table found or parsing fails.
    """
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table', class_='a-normal a-spacing-micro')
    data = []

    if not table:
        return data

    rows = table.find_all('tr')
    for row in rows:
        key_td = row.find('td', class_='a-span3')
        value_td = row.find('td', class_='a-span9')

        if key_td and value_td:
            key = key_td.get_text(strip=True)

            # Prefer full hidden value if available
            full_value_span = value_td.find('span', class_='a-truncate-full')
            if full_value_span:
                value = full_value_span.get_text(strip=True)
            else:
                value = value_td.get_text(strip=True)

            data.append({key: value})

    return data

This function tackles Amazon's product specifications table, which has a standard structure but some tricky aspects. Amazon sometimes truncates long specification values with a "Show more" button. We handle this by looking for the hidden full text first, then falling back to the visible truncated text if needed.


The function returns a list of dictionaries, where each dictionary contains one specification. For example, we might get [{"Brand": "Sony"}, {"Model": "XYZ-123"}, {"Weight": "2.5 pounds"}].


Orchestrating the Data Extraction


The parse_product_info function brings all our individual parsing functions together into one comprehensive extraction process.

def parse_product_info(html):
    """
    Comprehensive product information parser for Amazon product pages.
    
    Orchestrates the extraction of all relevant product information from
    an Amazon product page HTML by calling individual parsing functions
    and combining the results into a structured dictionary.
    
    Args:
        html (str): Complete HTML content of an Amazon product page.
    
    Returns:
        dict: Dictionary containing all extracted product information with keys:
            - title (str or None): Product title
            - price (str or None): Current price
            - original_price (str or None): Original/list price before discount
            - discount (str or None): Discount percentage or amount
            - rating (str or None): Customer rating
            - details (List[Dict]): List of product specification dictionaries
    
    Data Processing:
        - Creates BeautifulSoup object for HTML parsing
        - Calls individual parsing functions for each data element
        - Combines results into unified data structure
        - Handles missing elements gracefully (returns None for missing data
    
    """
    soup = BeautifulSoup(html, "html.parser")
    return {
        "title": parse_title(soup),
        "price": parse_price(soup),
        "original_price": parse_original_price(soup),
        "discount": parse_discount(soup),
        "rating": parse_rating(soup),
        "details" : extract_table_data(html)
    }

This function takes the raw HTML of a product page and returns a dictionary containing all the information we could extract. It creates one BeautifulSoup object and passes it to all the parsing functions, making the process efficient and organized.


Setting ZIP Code for Consistent Results


Amazon shows different prices and availability based on your location. To get consistent results, we need to set a specific ZIP code before scraping. So we used same function here which used in phase 1.

def apply_zip(page, zip_code="56901"):
    """
    Set delivery ZIP code on Amazon using browser automation.
    
    Navigates to Amazon's homepage and programmatically sets the delivery
    ZIP code through the location selector interface. This affects product
    availability, pricing, and shipping options throughout the session.
    The ZIP code setting persists through cookies for subsequent requests.
    
    Args:
        page (playwright.sync_api.Page): Playwright page object for browser automation.
        zip_code (str, optional): ZIP code to set for delivery location.
                                Defaults to "56901".
    
    Returns:
        None
    
    Raises:
        Exception: Catches and prints any errors during ZIP code setting process.
                  Script continues execution even if ZIP setting fails.
    
    Process Flow:
        1. Navigate to Amazon homepage with 60-second timeout
        2. Wait for and click location selector (glow-ingress-line2)
        3. Wait for ZIP code input field to appear
        4. Fill ZIP code input field with provided value
        5. Submit ZIP code update form
        6. Wait 3 seconds for processing
        7. Handle optional "Continue" confirmation dialog
        8. Print success/failure message
    
    Side Effects:
        - Sets location cookies that affect pricing and availability
        - Changes default shipping location for the browser session
        - May trigger location-based content personalization
    
    Note:
        The ZIP code affects product availability, pricing, tax calculations,
        and shipping options. Using a consistent ZIP code across scraping
        sessions ensures data consistency.
    """
    try:
        page.goto("https://www.amazon.com", timeout=60000)
        page.wait_for_selector("span#glow-ingress-line2", timeout=10000)
        page.click("span#glow-ingress-line2")

        page.wait_for_selector("input#GLUXZipUpdateInput", timeout=10000)
        page.fill("input#GLUXZipUpdateInput", zip_code)
        page.click("#GLUXZipUpdate > span > input")
        page.wait_for_timeout(3000)

        # If "Continue" button appears after ZIP, click it
        try:
            page.click("span.a-button-inner > input[name='glowDoneButton']")
        except:
            pass

        print(f"📍 ZIP code set to {zip_code}")
    except Exception as e:
        print("❌ Failed to set ZIP:", e)

The Main Scraping Orchestration


Now we reach the conductor of our data extraction orchestra - the main scraping function that coordinates everything.

def scrape_with_zip_zipcode():
    """
    Main orchestration function for scraping detailed product information from Amazon.
    
    This is the primary function that coordinates the entire product detail scraping
    process. It retrieves unscraped product URLs from the database, sets up browser
    automation with location-specific settings, and systematically scrapes detailed
    product information from each URL.
    
    Returns:
        None: Results are saved directly to the SQLite database tables.
    
    Process Overview:
        1. Initialize database connection and check for unscraped URLs
        2. Launch Playwright browser with random user agent
        3. Set ZIP code (56901) for consistent location-based pricing
        4. Save cookies for reuse across product page visits
        5. Iterate through each unscraped product URL
        6. Extract comprehensive product information
        7. Store results in product_details table
        8. Mark URLs as scraped to prevent reprocessing
        9. Handle and log errors for failed URLs
        10. Clean up browser resources
    
    Database Operations:
        - Queries products_url table for unscraped entries (scraped = 0)
        - Inserts detailed product data into product_details table
        - Logs failed URLs and error messages in error_urls table
        - Updates scraped flag to 1 for processed URLs
        - Commits after each product to prevent data loss
    
    Error Handling:
        - Catches exceptions during page loading and parsing
        - Logs errors with URL and exception details
        - Continues processing remaining URLs after failures
        - Stores failed URLs for later investigation
    
    Anti-Detection Measures:
        - Random user agent selection for each session
        - Cookie persistence to maintain session state
        - Random delays (3-5 seconds) between requests
        - Consistent ZIP code for location-based consistency
        - Realistic browsing patterns with proper timeouts
    
    Data Extracted Per Product:
        - Product title and description
        - Current price and original price
        - Discount information
        - Customer ratings
        - Detailed product specifications table
        - Category classification
        - Product page URL
    
    Performance Considerations:
        - Processes URLs sequentially to avoid overwhelming Amazon's servers
        - Uses browser context reuse with cookie persistence
        - Implements proper timeouts for page loading
        - Includes random delays for realistic browsing simulation
    
    
    Prerequisites:
        - Existing products_url table with URLs to scrape
        - user_agents.txt file with user agent strings
        - Playwright browser binaries installed
        - Stable internet connection for Amazon access
    
    Raises:
        Various exceptions may occur during scraping:
        - Network timeouts and connection errors
        - Page structure changes breaking selectors
        - Database operation errors
        - Browser automation failures
        
        Most exceptions are caught and logged without stopping the process.
    """
    conn, cur = setup_database()
    cur.execute("SELECT id, category, url FROM products_url WHERE scraped = 0")
    products = cur.fetchall()
    print(f"🔄 Found {len(products)} unscraped URLs")

We start by connecting to our database and finding all the URLs that haven't been scraped yet. This is where our "scraped" column comes in handy - we can easily resume scraping from where we left off if the process gets interrupted.

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        user_agent = random.choice(user_agents)
        context = browser.new_context(
            user_agent=user_agent,
            locale="en-US"
        )

        page = context.new_page()
        apply_zip(page, zip_code="56901")

        # Save cookies to reuse for each product
        cookies = context.cookies()
        page.close()
        context.close()

Just like in phase one, we set up our browser with a random user agent and configure the ZIP code. The key difference is that we save the cookies after setting the ZIP code and then close the initial browser context. We'll reuse these cookies for each product page, maintaining our location settings without having to reset the ZIP code every time.

        for pid, category, url in products:
            print(f"🔍 Scraping: {url}")
            context = browser.new_context(
                user_agent=user_agent,
                locale="en-US"
            )
            context.add_cookies(cookies)
            page = context.new_page()

            try:
                page.goto(url, timeout=60000)
                page.wait_for_timeout(3000)
                html = page.content()
                product = parse_product_info(html)

                if not product["title"]:
                    raise Exception("Missing title")

For each product URL, we create a fresh browser context but add our saved cookies. This gives us the benefits of a clean slate while maintaining our ZIP code settings. We navigate to the product page, wait a moment for everything to load, and then extract all the product information.


The check for a missing title is our quality control - if we can't even find a product title, something is probably wrong with the page, and we should treat it as an error.

                cur.execute('''
                    INSERT INTO product_details (category, url, title, price, original_price, discount, details, rating)
                    VALUES (?, ?, ?, ?, ?, ?,?,?)
                ''', (category, url, product["title"], product["price"], product["original_price"],product["discount"],str(product["details"]),product["rating"]))
                print(f"✅ {product['title'][:60]}")
            except Exception as e:
                print(f"❌ Error scraping {url}: {e}")
                cur.execute("INSERT OR IGNORE INTO error_urls (url, error) VALUES (?, ?)", (url, str(e)))

            cur.execute("UPDATE products_url SET scraped = 1 WHERE id = ?", (pid,))
            conn.commit()

            page.close()
            context.close()
            time.sleep(random.uniform(3, 5))

        browser.close()
    conn.close()
    print("✅ Finished scraping all with ZIP 56901")

When extraction succeeds, we save all the product details to our database. The specifications are stored as a string representation of the list of dictionaries - we can parse this back into structured data when we need to analyze it later.


If something goes wrong, we log the error and the URL that failed. Either way, we mark the URL as scraped so we don't try to process it again. We commit the database changes immediately to avoid losing progress if the script crashes.


Finally, we close the browser context and wait a random amount of time before moving to the next product. This random delay makes our scraping pattern less detectable and more respectful of Amazon's servers.


Running the Complete System


The entry point ties everything together with a simple execution guard:

if __name__ == "__main__":
    """
    Entry point for the Amazon product detail scraper.
    
    Executes the main scraping function when the script is run directly.
    This allows the script to be imported as a module without automatically
    starting the scraping process.
    
    Usage:
        python amazon_detail_scraper.py
    
    Prerequisites:
        - Existing amazon_vlogging.db with products_url table populated
        - user_agents.txt file with user agent strings  
        - Required packages: playwright, beautifulsoup4, sqlite3
        - Playwright browser binaries installed (playwright install)
    
    Expected Workflow:
        1. Run URL collection script first to populate products_url table
        2. Run this script to scrape detailed product information
        3. Check product_details table for results
        4. Review error_urls table for any failed scraping attempts
    """
    scrape_with_zip_zipcode()

This ensures that our scraping function only runs when we execute the script directly, not when it's imported as a module.


The Complete Picture


By the end of this phase, we have transformed our collection of product URLs into a comprehensive database of product information. Each product now has detailed pricing, ratings, specifications, and categorization - everything we need for meaningful analysis of the vlogging equipment market.


The two-phase approach gives us flexibility and robustness. We can collect URLs in bulk when Amazon's servers are responsive, then take our time with the detailed extraction. If something goes wrong during detail extraction, we haven't lost our URL collection work.


Our database now contains a rich dataset ready for analysis, visualization, or any other insights we want to extract from Amazon's vlogging equipment marketplace. The structured approach makes it easy to extend the system with additional data points or to adapt it for different product categories.


Conclusion


We've built something pretty impressive here. What started as a simple idea to make vlogging equipment research easier turned into a complete data extraction system that can handle thousands of products across multiple categories. Our two-phase approach proved its worth - we can collect URLs quickly and then take our time with the detailed extraction, handling errors gracefully along the way.


The database we created contains everything you'd want to know about vlogging equipment: prices, discounts, ratings, detailed specifications, and proper categorization. But the real value here isn't just the data - it's the system we built to get it. This architecture can be adapted for any kind of product research, price monitoring, or market analysis you might need.


Throughout this project, we maintained ethical scraping practices with proper delays and respectful request patterns. The code is modular and well-documented, making it easy to extend or modify for different use cases. We've demonstrated how to handle complex e-commerce sites, manage large-scale data extraction, and build fault-tolerant systems that can recover from interruptions.


The techniques you've learned here - browser automation, robust error handling, database design for scraping, and anti-detection strategies - are valuable skills that apply to countless other data extraction projects. Whether you're tracking competitor prices, monitoring product availability, or building recommendation systems, this foundation gives you everything you need to succeed.


What's next? You could set up scheduled runs to track price changes over time, build visualization dashboards to spot trends, or even expand the system to scrape multiple e-commerce sites for comprehensive market coverage. The structured data you now have opens up endless possibilities for analysis and insights.


The best part about this project is that it solves a real problem while teaching professional-grade techniques. You're not just extracting data - you're building sustainable systems that provide lasting value. That's the difference between amateur scraping and the kind of work that actually matters in the real world.


AUTHOR


I’m Shahana, a Data Engineer at Datahut, where I specialize in building smart, scalable data pipelines that transform messy web data into structured, usable formats—especially in domains like retail, e-commerce, and competitive intelligence.


At Datahut, we help businesses across industries gather valuable insights by automating data collection from websites, even those that rely on JavaScript and complex navigation. In this blog, I’ve walked you through a real-world project where we created a robust web scraping workflow to collect product information efficiently using Playwright, BeautifulSoup, and SQLite. Our goal was to design a system that handles dynamic pages, pagination, and data storage—while staying lightweight, reliable, and beginner-friendly.


If your team is exploring ways to extract structured product or pricing data at scale—or if you're just curious how web scraping can support smarter decisions—feel free to connect with us using the chat widget on the right. We’re always excited to share ideas and build custom solutions around your data needs.


FAQs


1. What is Amazon product data scraping? Amazon product data scraping is the process of automatically extracting product-related information such as titles, prices, reviews, ratings, sellers, and ASINs from Amazon’s product listings. It helps businesses analyze trends, monitor competitors, and make data-driven decisions.


2. Is it legal to scrape product data from Amazon US?

Scraping Amazon data for personal or research purposes is generally acceptable, but using automated bots to access data without Amazon’s permission may violate their terms of service. It’s best to use ethical and compliant scraping practices, such as scraping publicly available data responsibly or using Amazon’s official APIs.


3. What tools can I use to scrape Amazon product data?

Popular tools and libraries include Python’s BeautifulSoup, Scrapy, Selenium, and Playwright. These tools can automate data extraction from web pages and handle dynamic content effectively.


4. What kind of data can I extract from Amazon US?

You can extract data points such as:

  • Product title and description

  • Price and discounts

  • Ratings and reviews

  • Seller information

  • ASIN and category details

  • Availability and shipping options


5. Why should businesses scrape Amazon product data?

Businesses use Amazon product data scraping to:

  • Track competitor pricing and discounts

  • Identify top-performing products

  • Monitor customer sentiment

  • Analyze market demand

  • Optimize product listings and pricing strategies

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page