top of page

Scraping Amazon’s Menstrual Cup Data Using Playwright and curl-cffi: A Beginner-Friendly Guide to E-Commerce Product Analysis

  • Writer: Anusha P O
    Anusha P O
  • 5 minutes ago
  • 48 min read
Scraping Amazon’s Menstrual Cup Data Using Playwright and curl-cffi: 
A Beginner-Friendly Guide to E-Commerce 
Product Analysis

When thinking about menstrual cups, they are more than just a reusable alternative to pads or tampons—they represent convenience, sustainability, and personal health. On Amazon, one of the largest online marketplaces in the world, a wide range of menstrual cups is available, catering to different sizes, materials, and preferences. By scraping menstrual cup data from Amazon’s website, including product titles, brands, prices, and reviews, it is possible to uncover insights about which products are most popular, how buyers respond to different designs, and which features matter most to users. Similar to exploring a curated store, this data shows patterns in consumer choices and helps identify trends in the growing market for menstrual hygiene products. In the sections ahead, the blog will walk through how this Amazon data was collected, what key details were extracted, and what the numbers reveal about menstrual cup usage and preferences across different buyers, giving a clear picture of this essential segment of women’s wellness products.


From URLs to Insights: The Two-Stage Scraping Process for Amazon Data


Before diving into analysis, the first step in working with Amazon’s menstrual cup dataset was to build a system that could reliably collect product links and then extract meaningful information from each page. Instead of treating scraping as a single step, I approached it as a small two-stage journey—first gathering every valid URL from the search results, and then visiting those pages to extract clean, structured data. This helped create a steady, well-organized workflow where each part supported the next, much like laying a strong foundation before building the rest of the structure.


  1. URL Collection From Amazon’s Menstrual Cup Section


When working with Amazon’s menstrual cup listings , the first challenge is not collecting product details—it’s simply finding all the product URLs hidden across the pages. Although Amazon doesn’t use endless scrolling in the same way some sites do, it still loads content dynamically, displays pagination that changes based on filters, and sometimes even reorders results during repeated visits. To handle this variability, an automated URL-scraping system using Python tools like Playwright, asyncio, and Playwright Stealth was built . These tools work together to open the Amazon results page, behave like a real visitor, and move through the listings one page at a time without raising suspicion. Since Playwright allows browser-level interactions, the script could scroll naturally, wait for elements to load, and extract the product link from each listing in a clean, structured manner.

Because real-world websites often introduce delays or network hiccups, the script was designed with gentle pauses using the random module to imitate human browsing. Meanwhile, logging quietly kept track of each step so I could revisit the process later if needed. Every link collected was stored inside a SQLite database, rather than a simple text file, making retrieval more reliable and reducing duplication. If a product appeared under multiple URLs—something that happens frequently on Amazon—the script checked the database first before saving anything new. By the end of this phase, I had a clear, well-organized set of product URLs ready for deeper analysis, almost like collecting the raw ingredients before beginning the actual recipe.


  1. Structured Data Extraction From Each Product Page


With a complete list of URLs ready, the next stage was to visit each product link individually and extract meaningful information from the page. This part required a slightly different approach because product pages on Amazon often load parts of their content asynchronously, and some sections appear only after certain scripts finish running. To ensure nothing important was missed, I used a combination of Playwright and curl_cffi, allowing the scraper to switch between a full browser and a lightweight request method depending on the situation. Playwright handled pages that needed dynamic loading, while curl_cffi provided speed on simpler pages—making the workflow flexible and efficient.


Once the HTML was fetched, BeautifulSoup took charge of parsing the page, reading through titles, brand names, prices, ratings, reviews, descriptions, and the small but meaningful aspects such as quality, comfort, and ease of use. To keep the process clean, each product’s extracted data was stored as structured JSON, and also saved into SQLite so that no information was lost even if the program stopped unexpectedly. Tools like urllib.parse and urlparse helped decode and clean Amazon’s often complex URLs, while modules like re and json ensured that the scraped text could be shaped into clean, readable data ready for analysis.


This second phase felt like opening each product page one by one and carefully noting the details in a notebook, but done with the precision and consistency of automation. By the end, every menstrual cup product—filtered, cleaned, and structured—was ready for insights, comparisons, and further storytelling through data. Together, these steps form a complete cycle of intelligent scraping: discovering product URLs first, then harvesting the information they hold in a way that is reliable, human-like, and well-organized.


  1. Turning Messy Amazon Data into Meaningful Information with OpenRefine


When you first scrape data from a website like Amazon India , like the menstrual cup search results , it feels rewarding—you’ve just gathered a large collection of product details, complete with titles, brands, prices, ratings, and customer impressions. But just like most real-world datasets, the raw output is rarely clean. It often arrives mixed with unrelated products, repeated entries, symbols that computers can’t interpret easily, and columns that need more structure before analysis becomes meaningful.


A good way to think about this is to imagine returning from a grocery store with a big basket of fruits. They look bright and colorful, but before you can actually eat them, you still need to wash, sort, and cut them. Data cleaning in OpenRefine works exactly like that—slowly transforming raw, uneven information into something neat, organized, and ready to work with. OpenRefine’s interface makes this process almost conversational, helping you clean step by step.


While cleaning the Amazon menstrual cup dataset, the first thing I noticed was that not every product in the results was actually a menstrual cup. Some listings were only for washers, some for sterilizers, and some even for pouches. These items may be related, but they would introduce noise into the analysis, so I filtered out those URLs right at the start. Another issue was duplication—the same product often appeared under different URLs, which is common on large marketplaces. OpenRefine’s clustering options helped identify and remove these duplicates so that only unique products remained.


Price cleaning was another important step. Amazon usually displays prices with the Indian Rupee symbol (₹) and commas, which look fine to us but make calculations complicated for machines. By removing symbols and formatting the values into plain numbers, the price column became clean, consistent, and ready for comparison. A similar approach was used for the brand data. Many product titles included brand names mixed with extra words, so I used OpenRefine to separate these into a structured brand column with clear, meaningful values.


One of the more interesting parts of the cleaning process involved the aspects column, which contains phrases like “quality: positive” or “ease of use: negative.” In their raw form, these look like scattered text, but by flattening them into a tidy structure—where each aspect becomes a clear field—it becomes much easier to analyze customer sentiment later. This step almost felt like unfolding a crumpled sheet of paper: the information was always there, just waiting to be organized properly.


Data cleaning may not feel as exciting as scraping or visualizing trends, but it forms the foundation for trustworthy insights. Once the menstrual cup dataset was cleaned, the entire picture became clearer—now the analysis can focus on real patterns, genuine comparisons, and reliable conclusions, instead of being distracted by noise. In the end, OpenRefine doesn’t just clean the data; it brings out the story hidden inside it.



Essential Tools Behind the Scraper: The Python Libraries That Make Everything Work


When you look at a working scraper from the outside, it often seems simple—just run a script and data appears. But behind that small command lies a group of tools quietly doing their part, almost like a backstage team that keeps a theater production running smoothly. In this project, the scraper relied on a set of Python libraries that worked together in harmony, each one handling a specific responsibility. At the heart of the workflow was asyncio. It allowed the program to perform multiple tasks at once, so while one Amazon product page was loading, another could already be parsed. This created a natural sense of flow, especially when working with a large list of URLs. Pairing with asyncio was Playwright , which acted as the browser window of our script. It opened pages, scrolled through dynamic content, and captured HTML—even on sections that would normally load only when a user interacts with the site. To make Playwright behave more like a human visitor, Playwright Stealth blended in small browser adjustments, reducing the chances of triggering Amazon’s anti-bot systems.


Once the HTML was fetched, BeautifulSoup stepped in as the gentle cleaner, reading through the raw markup and helping extract only the pieces we needed—titles, brands, prices, and aspects. But sometimes even Playwright can be too heavy for smaller requests, which is where curl_cffi offered speed, sending fast, lightweight HTTP calls whenever possible. Behind the scenes, tools like urllib.parse, urlparse, and unquote quietly ensured that every URL was decoded and formatted correctly. Data also needs a safe place to live, and sqlite3 provided exactly that—a simple, file-based database that works without any additional setup. It became our storage shelf where URLs and product records were saved in an organized, query-friendly manner. Throughout the process, logging acted like a diary, noting each success and failure, while random introduced small, natural-looking pauses to make the scraper feel less robotic. Even modules like re, Path, and contextmanager played their part, helping with text cleaning, file handling, and resource management in a clean and predictable way.


Together, these libraries created a balanced workflow: fast, steady, and resilient. None of them works alone; instead, they fit together like pieces of a puzzle, turning a simple script into a dependable scraper that can handle thousands of Amazon URLs without losing track. By understanding how each tool contributes, anyone—from an intern to a beginner—can appreciate the invisible structure that holds a real-world scraping project together.


STEP 1: Gathering All Product Links from Amazon’s Menstrual Cup Category


Importing Libraries

# IMPORTS

import asyncio
import sqlite3
import random
import logging
import re
import urllib.parse
from pathlib import Path
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

When you’re just starting out with web scraping, the first steps can feel a little overwhelming. There are tools, libraries, environments, and a whole lot of new words. But once you slow it down and look at each part one piece at a time, the picture becomes much clearer. At first glance, this block might look like a list of random ingredients. But just like a recipe, each one has a purpose. And the magic happens when they work together.


We begin by bringing in modules like asyncio, which helps Python handle tasks without waiting around, almost like letting your program multitask gracefully, while sqlite3 gives you a built-in way to store scraped data locally, similar to keeping your notes organized in a single notebook rather than scattered files. You’ll also notice we import random and logging, which may sound simple, but they quietly play big roles: randomness helps you rotate actions like user agents to appear more natural online, and logging keeps a transparent record of what your script is doing behind the scenes—much like a diary you can check later when something breaks. The re module steps in to help with pattern matching when you need to clean or extract text, and urllib.parse helps safely manage and format URLs so your scraper doesn’t stumble on messy links. Then comes Path from pathlib, which offers an easy way to handle file paths without worrying about operating-system differences, making your project feel tidier and more predictable. Finally, we have imports from Playwright—async_playwright and the stealth module—both of which allow the script to open and browse websites the way a human would, loading pages smoothly while trying to avoid detection from strict websites . Together, these imports set the foundation for a scraper that is simple to understand, friendly for beginners, and strong enough to grow into a more advanced project as your skills improve.


Setting Up the Scraper Configuration

# CONFIGURATION 

START_URL = "https://www.amazon.in/s?k=menstrual+cup"

"""str: The main Amazon search page URL from where the scraper starts collecting product links."""

DB_PATH = "/home/anusha/Desktop/DATAHUT/Amazon_cup/DATA/amazon_menstrual_cups.db"

"""str: File path of the SQLite database where all scraped product URLs and product details will be stored."""

USER_AGENT_FILE = "/home/anusha/Desktop/DATAHUT/chewy-and-petco/user_agents.txt"

"""str: Path to the text file containing a list of user agents. The scraper randomly picks one user agent to avoid detection."""

LOG_FILE = "/home/anusha/Desktop/DATAHUT/Amazon_cup/LOG/scraper_log.log"

"""str: File path for saving all log messages (errors, success messages, warnings). Helpful for debugging and tracking scraper activity."""

Setting up the configuration for a web-scraping script often feels like arranging the starting points of a small adventure, and the variables in this section give the scraper clear instructions on where to begin, where to save information, and how to disguise itself while exploring pages. The START_URL points directly to an Amazon search page for menstrual cups, acting as the entry gate from which product links are discovered. The DB_PATH then guides the script toward a specific SQLite database stored on the local system, allowing all collected product details to be organized safely in one place, much like labeling a box so it can be found later . The USER_AGENT_FILE plays an equally important role because it stores different browser identities, and the scraper quietly picks one at random each time to appear more natural while visiting Amazon—similar to how different people have different browsing patterns. Meanwhile, the LOG_FILE creates a dedicated home for all log messages, collecting errors, warnings, and activity notes so the script’s progress stays transparent and easy to troubleshoot . Together, these simple configuration lines act like a roadmap, storage room, disguise kit, and diary for the scraper, giving the entire process a sense of order and continuity before the actual crawling work begins.


Adding Logging to Monitor the Scraper’s Performance

# LOGGING

logging.basicConfig(
    filename=LOG_FILE,
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
)
logger = logging.getLogger(__name__)

"""Sets up the logging system for the scraper.This helps track what the script is doing, including errors, warnings, and progress.All logs will be saved to the file defined in LOG_FILE.Logger object used throughout the script to record messages in the log file."""

Setting up logging in a scraping script often feels like placing a quiet observer in the background, someone who carefully notes what happens at every stage, and the configuration here does exactly that by directing Python’s logging system to write all activity into the file defined by LOG_FILE. The logging.basicConfig call shapes this observer’s behavior by telling it where to save messages, what level of detail to record, and how each line should look once written; beginners often find it helpful to think of this as keeping a diary for the script, one that records important moments like errors, warnings, and general progress updates in a neatly formatted style. The line logger = logging.getLogger(__name__) simply retrieves a logger that the rest of the script can use, allowing every function to add notes to the same diary without needing any extra setup. Once the logger is in place, the script gains a sense of transparency, making it easier to revisit what happened during long scraping sessions and offering clear clues when something behaves unexpectedly, much like a trail of footprints left behind after a long walk through a forest of web pages.


Setting Up the SQLite Database for URL Storage

# DB SETUP
 
def init_db():

    """
    Create (or connect to) the SQLite database and set up the required table.

    This function:
    - Connects to the database file at DB_PATH.
    - Creates the 'product_urls' table if it does not already exist.
      This table stores every product URL scraped from Amazon.
    - Returns the database connection object so the rest of the script can use it.

    Returns:
        sqlite3.Connection: A connection to the SQLite database.
    """
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()

    cur.execute("""
        CREATE TABLE IF NOT EXISTS product_urls(
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE
        );
    """)
    conn.commit()
    return conn

Creating the database setup function in a scraping project often feels like preparing a dedicated storage shelf before collecting anything, and the init_db function serves that purpose by opening a connection to the SQLite file defined in DB_PATH and ensuring the necessary table is ready to hold product URLs. When the function calls sqlite3.connect(DB_PATH), it either opens the existing database or quietly creates a new one if it is not already present, and this simple behavior is one reason beginners often find SQLite approachable. Once the connection is established, the script prepares a cursor and executes a CREATE TABLE IF NOT EXISTS command, which guarantees that a table named product_urls is available without worrying about errors if it already exists. This table holds two fields—an automatically generated ID and a unique URL—forming a clean structure for storing links gathered from Amazon as the scraper moves through different pages. After committing the change, the function returns the connection so the rest of the script can continue using it, allowing later parts of the code to insert, update, or read data without setting up the database again. In many ways, this small function lays the foundation for the entire scraping process, similar to setting up a notebook before beginning research, ensuring that everything collected has a proper and organized place to stay.


Preparing Clean Amazon URLs Before Scraping

# CLEAN AMAZON URL

def clean_url(raw):
 
   """

    Clean and normalize Amazon product URLs.

    This function checks if the given URL is a sponsored redirect 

    (Amazon uses links like /sspa/click for ads). If yes, it extracts 

    the real product URL from the "url=" parameter.  
    If the URL is already a normal Amazon product link, it simply returns it as-is.

    Args:

        raw (str): The original URL collected from the page.

    Returns:

        str: A clean, direct product URL without tracking or redirect parameters.

    """

    if not raw:
        return raw

    # Sponsored redirect:
    # /sspa/click?...&url=%2Fbrand-product%2Fdp%2FASIN%2Fref...
    if "sspa/click" in raw and "url=" in raw:
        parsed = urllib.parse.urlparse(raw)
        qs = urllib.parse.parse_qs(parsed.query)

        if "url" in qs:
            true_url = qs["url"][0]
            decoded = urllib.parse.unquote(true_url)
            return decoded

    # Normal product link → keep EXACTLY
    return raw

Cleaning Amazon URLs can feel a bit like removing extra stickers from a package before storing it, and the clean_url function is designed to do exactly that by taking the raw link collected from the page and checking whether it is a straightforward product link or a sponsored redirect that Amazon often uses for advertisements. When the function receives a URL, it first makes sure the value is not empty and then looks for signs of a redirect, especially the pattern often seen in links beginning with /sspa/click, which usually contains the “real” product link hidden inside the url= parameter. Using tools from Python’s urllib.parse module, the function breaks the redirected link into its parts, extracts the original product URL, and decodes it so it becomes readable again, much like opening a folded note to reveal the actual message inside. If the URL doesn’t contain any sponsored indicators, it is returned exactly as it is, since many Amazon product pages follow a predictable structure that works without any adjustments. By the time the function finishes, each link becomes clean and trustworthy, making the scraping process smoother and the stored URLs easier to handle later, creating a natural flow in the pipeline where messy inputs are quietly converted into neat and usable product links.


Saving Product URLs into the Database

# SAVE

def save_url(conn, url):

"""
Save a product URL into the database.
This function tries to insert the given URL into the 'product_urls' table.  

- If the URL is new, it gets inserted and a log message is created. 
- If the URL already exists, it is skipped (because of INSERT OR IGNORE).  
- Any errors during saving are logged for debugging.  
 
Args:
conn (sqlite3.Connection): Active connection to the SQLite database.
url (str): The product URL to save.
"""

    try:
        cur = conn.cursor()
        cur.execute("INSERT OR IGNORE INTO product_urls(url) VALUES (?)", (url,))
        conn.commit()

        if cur.rowcount > 0:
            logger.info(f"Saved URL: {url}")
        else:
            logger.info(f"Duplicate skipped: {url}")

    except Exception as e:
        logger.exception(f"Error saving URL: {url} | {e}")

Saving each product link into a database may seem like a small step in the scraping process, yet the save_url function shows how important it is to store information carefully and avoid unnecessary duplicates while collecting data from a large site like Amazon. When this function receives a database connection and a URL, it prepares a simple SQL command—INSERT OR IGNORE—which gently attempts to add the link into the product_urls table without causing an error if the same link was already stored earlier, much like placing an item on a shelf only if it is not already there. After running the command and committing the change, the function checks whether a new row was actually added, and based on that, a message is written through the logger to indicate whether the link was successfully saved or skipped because it already existed; this makes it easy to trace how the scraper progresses over time. In case something unexpected happens—perhaps because of a malformed link or a temporary database issue—the except block catches the error and records a detailed message for debugging, helping identify and solve problems without interrupting the entire scraping workflow. Through this small function, the script gains a reliable and organized way to store every valuable URL it discovers, creating a continuous sense of structure as the scraper moves from collecting links to processing them later.

Loading User Agents to Keep the Scraper Undetected

# USER AGENT LOADER 

def load_user_agents():
    with open(USER_AGENT_FILE, "r") as f:
        agents = [ua.strip() for ua in f.readlines() if ua.strip()]
    return agents

"""
 Load user agents from the text file.
This function reads the USER_AGENT_FILE and returns a list of user agent strings. These user agents help the scraper mimic different browsers so Amazon is less likely to block or detect the scraping activity.

Returns:

list[str]: A list of cleaned user agent strings.
 """        

Loading user agents might sound like a technical detail, but the load_user_agents function gives the scraper an important ability: the chance to appear like different browsers each time it visits Amazon, reducing the chances of getting blocked and making the entire scraping process run more smoothly. The function simply opens the file specified by USER_AGENT_FILE, reads each line, removes any extra spaces, and returns a clean list of user agent strings; this list acts almost like a collection of digital disguises, allowing the scraper to rotate between identities the way a person might try different doorways to avoid drawing attention. By returning a neatly prepared list, the function creates a smooth hand-off to the part of the script responsible for selecting a random user agent during requests, forming a natural connection between this early setup step and the later stages where pages are actually fetched. This simple routine, though small in appearance, becomes an essential part of the scraper’s overall strategy, giving it the flexibility to blend in and continue gathering data without interruptions.


How the Scraper Collects Product URLs from Amazon Search Results

#  SCRAPER 

async def scrape_page(page, conn):

    """
    Scrape all product URLs from the current Amazon search results page.

    This function:
    - Finds all product link elements on the page using a CSS selector.
    - Extracts the "href" value of each link.
    - Cleans sponsored redirect URLs using clean_url().
    - Converts relative URLs into full Amazon product URLs.
    - Saves each URL to the database using save_url().
    - Adds a short random delay between link processing to mimic human behavior.

    Args:
        page (playwright.async_api.Page): The current Playwright browser page.
        conn (sqlite3.Connection): Database connection used to save product URLs.
    """
     
    logger.info("Scraping current page...")

    product_links = page.locator('a.a-link-normal.s-line-clamp-3.s-link-style.a-text-normal')

    count = await product_links.count()
    logger.info(f"Found {count} product links")

    for i in range(count):
        try:
            href = await product_links.nth(i).get_attribute("href")
            if not href:
                continue

            cleaned = clean_url(href)

            # Ensure full URL
            full_url = urllib.parse.urljoin("https://www.amazon.in", cleaned)

            save_url(conn, full_url)

            await asyncio.sleep(random.uniform(0.5, 0.8))

        except Exception as e:
            logger.exception(f"Error processing link {i}: {e}")

The scrape_page function works like a careful collector moving through an Amazon search results page, gathering product links one by one and preparing each of them so they can be stored properly for later use, and its flow becomes easier to understand once the steps are seen as part of a single smooth process rather than separate technical tasks. When the scraper arrives on a page, it starts by locating all elements that match Amazon’s product link pattern, using a CSS selector/html content that points specifically to the titles shown in the search results; this selector may look complex at first glance, but it simply tells Playwright which pieces of the page represent actual product entries. After counting how many such links exist, the function loops through each one, extracting the href attribute, which is the raw link Amazon provides. Some of these links may be redirected or cluttered with tracking information, so the clean_url function is called to tidy them up. Once cleaned, the link is turned into a full Amazon URL using urllib.parse.urljoin, ensuring it always starts with the proper domain instead of being left in a relative form. The function then hands this prepared link to save_url, which stores it safely in the database. A short, random pause is added between each processed link to mimic natural browsing speed, which reduces suspicion when scraping large websites. If anything unexpected happens, the error is logged through the existing logging setup, allowing issues to be understood later without breaking the overall scraping session. Each step here flows into the next, creating a quiet rhythm where the script observes the page, gathers links, cleans them, stores them, and moves on, much like someone working steadily through a list without losing focus.


Handling Pagination Across Amazon Search Results

# PAGINATION 
async def pagination_loop(page, conn):

    """
    Loop through all Amazon search result pages and scrape each one.

    This function:
    - Starts from page 1 and keeps scraping until there are no more pages.
    - Calls scrape_page() on every page to extract product URLs.
    - Detects the “Next” button and clicks it to move forward.
    - Stops when the Next button is missing or disabled.
    - Adds small delays to act like a human user and avoid detection.

    Args:
        page (playwright.async_api.Page): Active Playwright page used for navigation.
        conn (sqlite3.Connection): Database connection used for saving URLs.
    """

    page_number = 1

    while True:
        logger.info(f" Processing page {page_number}")

        await scrape_page(page, conn)

        next_btn = page.locator("a.s-pagination-next")

        if await next_btn.count() == 0:
            logger.info("❌ No next button found — stopping pagination")
            break

        disabled = await next_btn.get_attribute("aria-disabled")
        if disabled == "true":
            logger.info("❌ Next button disabled — scraping finished")
            break

        logger.info("➡ Clicking next page...")
        await next_btn.click()

        await asyncio.sleep(random.uniform(1, 2))

        await page.wait_for_load_state("load")
        page_number += 1

The pagination_loop function guides the scraper through Amazon’s multi-page search results in a steady, predictable manner, almost like turning the pages of a long catalog one by one, making sure nothing is missed along the way. It begins on the first page and immediately calls scrape_page, which collects all the product links from that section, and once that page has been fully processed, the function looks for Amazon’s familiar “Next” button to determine whether there is another set of results waiting to be explored. If the button is not present or marked as disabled, it becomes clear that the last page has been reached, and the loop stops naturally without forcing the script to continue. But if the button is active, the function clicks it, waits briefly for the new page to load—much like a user pausing while a site refreshes—and then proceeds to scrape the next page in the exact same way. Each cycle includes a small, intentional delay so the scraper behaves more like a real person browsing through search results, which reduces the chances of triggering Amazon’s automated checks. Over time, this loop builds a quiet rhythm: scrape, check for the next page, move forward, and repeat until the entire trail of results has been followed from start to finish, forming a continuous flow that keeps the scraping process both organized and reliable.


Controlling the Scraping Workflow with the Main Function

# MAIN 

async def main():

    """
    Main entry point of the Amazon scraper.

    This function:
    - Initializes the database and loads user agents.
    - Launches Playwright with a random user agent (to avoid detection).
    - Opens the Amazon search URL defined in START_URL.
    - Applies Playwright Stealth to look more like a real browser.
    - Starts the pagination loop to scrape all product URLs from every page.
    - Closes the browser and database after finishing.

    It controls the full scraping workflow from start to finish.
    """
    
    conn = init_db()
    user_agents = load_user_agents()
    
    logger.info(" Scraper started")

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            args=["--disable-blink-features=AutomationControlled"]
        )

        ua = random.choice(user_agents)
        context = await browser.new_context(
            user_agent=ua,
            viewport={"width": 1280, "height": 800}
        )
        logger.info(f"Using User-Agent: {ua}")

        page = await context.new_page()
        await stealth_async(page)

        logger.info(f"Opening: {START_URL}")
        await page.goto(START_URL, wait_until="load", timeout=60000)

        await asyncio.sleep(random.uniform(2, 4))

        await pagination_loop(page, conn)

        await browser.close()
        conn.close()

    logger.info(" Scraping completed!")

The main function acts as the central controller of the entire scraping process, guiding each step in a smooth sequence so everything works together without confusion, much like how a well-organized workflow begins with preparation, moves through the task itself, and ends by cleaning up properly. It first sets up the database that will store all collected URLs and then reads the list of user agents from the local file, which helps the scraper behave more like a regular browser. Once the preparation is complete, the function launches Playwright and assigns a random user agent so the scraper blends in more naturally while visiting the Amazon search results page defined by the START_URL, and a short pause gives the browser time to load fully, similar to how a person waits for a page to settle before scrolling. After that, the scraper enters the pagination loop, which patiently moves through the full set of search result pages, gathering URLs from each one through the earlier functions; this creates a connected chain of steps rather than isolated pieces of work. When the loop finishes, the browser and database connection are both closed to avoid leaving loose ends, and a final log message marks the completion of the task. By moving through each phase in a steady and predictable rhythm—setup, browsing, scraping, and closure—the main function keeps the workflow simple and approachable, even for someone just beginning to understand how automated scraping scripts operate.


Entry Point

# ENTRY POINT 

"""
Runs the scraper when this file is executed directly.
This triggers the main() function and starts the entire scraping process.
"""

if __name__ == "__main__":
    asyncio.run(main())

When a Python script reaches the if name == "__main__": line, it simply means the file has been opened intentionally to run the program, and this tiny section quietly becomes the gateway that starts everything; calling asyncio.run(main()) is like turning the key in an engine, allowing the main scraper function to take over, prepare the browser, and begin collecting data step by step. This entry point keeps the project organized by making sure the scraping workflow runs only when the file itself is executed, not when it is imported somewhere else, similar to how a door opens only when the right key is used. This small entry block may look simple, but it plays an important role in keeping the scraper predictable, readable, and easy to extend as the project grows.



Step 2: From Links to Comprehensive Product Information


Importing Libraries

# IMPORTS

import asyncio
import random
import sqlite3
import json
import logging
import re
from contextlib import contextmanager
from typing import Optional, List, Tuple, Dict
from bs4 import BeautifulSoup
from urllib.parse import urlparse, unquote
from playwright.async_api import async_playwright, Page
from playwright_stealth import stealth_async
from curl_cffi import requests

In many scraping projects, different libraries handle different parts of the job, and seeing names like curl_cffi for the first time can feel a little unfamiliar, but it helps to think of it as a faster, more flexible version of the usual request tools, designed to mimic real browser traffic more closely so that websites treat the scraper like a normal visitor. Alongside it, modules such as urllib.parse and BeautifulSoup quietly take care of tasks like cleaning messy URLs and interpreting raw HTML, while type-hints like Optional or Dict bring clarity to how data flows through the script. The moment these imports appear at the top of a file, the code silently prepares itself with all the tools needed for the steps ahead, similar to how a workspace is set up before starting a task. Even though these imports may look like a simple list, they lay the foundation for the entire scraper, allowing the later functions to focus on the actual logic without repeatedly rewriting these essential features.


Defining Paths, User Agents, and Log Settings

# CONFIG / CONSTANTS

"""Holds all file paths and important settings used by the scraper.
These constants make the script easier to configure and reuse.
"""
DB_PATH = "/home/anusha/Desktop/DATAHUT/Amazon_cup/DATA/amazon_menstrual_cups.db"

"""str: Path to the SQLite database where scraped product URLs will be stored."""

USER_AGENT_FILE = "/home/anusha/Desktop/DATAHUT/chewy-and-petco/user_agents.txt"

"""str: Path to a text file containing multiple user-agent strings.
The scraper picks one randomly to reduce blocking by Amazon."""

LOG_FILE = "/home/anusha/Desktop/DATAHUT/Amazon_cup/LOG/amazon_scraper_merged.log"

"""str: Path to the log file where the scraper will record all errors and progress."""


Setting up a scraper often begins with gathering a few core details in one place, and that is what these configuration constants aim to do by keeping paths and settings neatly organized so the rest of the script stays clean and easy to read. The database path simply tells the program where to store collected product information, much like pointing someone to the correct drawer before filing documents. The user-agent file plays an equally helpful role by holding different browser identities that the scraper can rotate through, a small trick that makes requests appear more natural. The log file path takes care of another practical need by giving the scraper a place to record errors and progress, making it easier to revisit what happened during long runs. Although these lines might seem simple at first glance, they quietly set the foundation for the entire workflow, allowing the rest of the code to focus on the actual scraping rather than repeatedly hunting for paths or settings.


How Simple Request Headers Help a Scraper Communicate Better

# HEADERS

"""Default HTTP headers used when sending requests.
These help make the scraper look more like a normal browser request.
"""

HEADERS = {
    "content-type": "text/html",
    "Connection": "keep-alive",
    "cache-control": "no-transform",
}

When a scraper sends a request to a website, the server often expects it to behave like a typical browser, and that’s where these default headers come in, acting like a small introduction that tells the site what kind of content is being requested and how the connection should be handled. The content-type header simply explains the format being asked for, while Connection: keep-alive helps maintain a stable link so the scraper doesn’t reconnect repeatedly, which can slow things down. The cache-control setting adds another layer of clarity by asking for fresh content rather than something stored from a previous visit. Even though these lines may look tiny, they quietly shape how the scraper communicates with the server, making the interaction smoother and more reliable as the rest of the script runs.


Understanding the Core Settings That Guide the Scraper’s Behavior

# RETRIES & SCRAPER CONSTANTS

"""General settings that control scraper behavior.

RETRIES: How many times to retry a failed request.
PLAYWRIGHT_TIMEOUT: Maximum wait time (in milliseconds) for Playwright page loads.
MIN_HTML_LEN: Minimum HTML size required to consider a response valid.
SLEEP_RANGE: Random delay range (in seconds) between actions to mimic human behavior.
TABLE_NAME: Name of the main database table that stores product details.
URLS_TABLE: Name of the table that stores all collected product URLs.
"""

RETRIES = 3
PLAYWRIGHT_TIMEOUT = 120_000  # ms
MIN_HTML_LEN = 5_000
SLEEP_RANGE = (3.0, 4.0)
TABLE_NAME = "product_cffi_3"
URLS_TABLE = "product_urls"

A scraper often needs a few guiding rules to handle the unpredictable nature of websites, and these constants help set that foundation by defining how many times a request should be attempted again when something goes wrong, how long Playwright should wait for a page to load, and even the minimum amount of HTML needed to consider a response useful. The retry count keeps the script from giving up too quickly when a site responds slowly, while the timeout prevents it from waiting forever on a frozen page. A small random pause, controlled by the sleep range, adds a touch of natural behavior so the scraper doesn’t look too mechanical, and the table names simply point the script to where product details and collected URLs should be stored inside the database.


 Keeping Track of the Scraper’s Journey with Simple Logging

# LOGGING SETUP

"""Configures logging for the scraper.

This setup:
- Saves all logs to the file defined in LOG_FILE.
- Uses 'append' mode so logs are added instead of overwritten.
- Records timestamps, log levels, and messages.
Logging helps track progress and debug errors while scraping.
"""

logging.basicConfig(
    filename=LOG_FILE,
    filemode="a",
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)
logging.info("🚀 Merged Scraper Started")

A scraper benefits greatly from a clear record of what happens during its run, and this logging setup creates that trail by writing every important message into a dedicated log file without deleting previous entries. Each line is stored with a timestamp and a label that describes the type of message, making it easier to understand when something succeeded or when an error needs attention. This simple structure turns the log file into a quiet companion that keeps track of each step, helping beginners trace issues without feeling overwhelmed.


Why a Scraper Needs User Agents and How This Utility Helps

# UTILITIES

def load_user_agents(path: str = USER_AGENT_FILE) -> List[str]:

    """
    Load user-agent strings from a text file.

    This function:
    - Reads each line from the user-agent file.
    - Cleans empty lines or spaces.
    - Returns a list of valid user-agent strings.
    - If the file is missing or empty, it logs a warning and returns a safe fallback user agent.

    Args:
        path (str): Path to the user-agent file.

    Returns:
        List[str]: A list of user-agent strings for rotating during scraping.
    """
      
    try:
        with open(path, "r") as f:
            ualist = [ua.strip() for ua in f if ua.strip()]
            if not ualist:
                raise ValueError("User agent file is empty")
            return ualist
    except Exception as e:
        logging.warning(f"Could not load user agents ({e}), using fallback UA.")
        return [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/120.0 Safari/537.36"
        ]

USER_AGENTS = load_user_agents()

"""List[str]: Loaded user agents used for rotating requests and avoiding detection."""

This small utility focuses on gathering user-agent strings from a simple text file, almost like collecting different “browser masks” that help a scraper blend in while requesting pages online. The function reads each line carefully, ignores empty entries, and returns a clean list that can be rotated later during scraping. If something goes wrong—maybe the file is missing or contains no usable data—the code gently switches to a safe fallback user agent and records the issue through logging, keeping the flow of the script steady instead of stopping unexpectedly.


Cleaning and Standardizing Amazon Product URLs

# URL CLEANING

def clean_amazon_url(raw_url: str, asin: Optional[str]) -> str:
    """
    Generate clean canonical URL:
    https://amazon.in/<slug>/dp/ASIN/

    If slug is missing → return:
    https://amazon.in/dp/ASIN/
    Create a clean, canonical Amazon product URL using the product ASIN.

    This function:
    - Extracts the base domain from the original URL.
    - Attempts to find a readable slug (product title section) from the URL path.
    - Builds a clean URL in one of these formats:
        • https://amazon.in/<slug>/dp/ASIN/
        • https://amazon.in/dp/ASIN/   (if slug cannot be found)
    - If ASIN is missing, the original URL is returned unchanged.

    Args:
        raw_url (str): The original messy or redirected Amazon product URL.
        asin (str or None): The extracted ASIN value of the product.

    Returns:
        str: A clean, standardized Amazon product URL.
    
    """
    parsed = urlparse(raw_url)
    domain = f"{parsed.scheme}://{parsed.netloc}"

    # If ASIN not found → return raw
    if not asin:
        return raw_url

    parts = raw_url.split("/")
    slug = None

    # Try to extract <title-slug>
    for p in parts:
        if p and p not in ["dp", "gp", "product", asin] and len(p) > 3:
            slug = p
            break

    if slug:
        return f"{domain}/{slug}/dp/{asin}/"

    return f"{domain}/dp/{asin}/"

This URL-cleaning function helps transform long, cluttered Amazon links into simple, consistent versions by breaking the original address into parts and keeping only the essential pieces, such as the domain and the product’s ASIN. The logic looks through the URL for a readable slug—usually a hint of the product name—and if one is found, the function rebuilds the link in a tidy format that is easier to store or reuse later; if no slug is available, the code still creates a clean fallback using only the ASIN. This concepts such as parsing and path segments are explained in detail, making it easier to understand how this function turns a messy link into a predictable, standardized one without interrupting the natural flow of the scraping process.


A Simple and Safe Way to Handle Database Connections

# DATABASE CONTEXT MANAGER

@contextmanager
def db_conn(path: str = DB_PATH):
    """
    A simple context manager for opening and closing the database connection.

    This helper:
    - Opens a SQLite connection to the given DB path.
    - Lets you run any database operations inside the 'with' block.
    - Automatically commits changes when finished.
    - Ensures the connection is always closed, even if an error occurs.

    Args:
        path (str): Path to the SQLite database file.

    Yields:
        sqlite3.Connection: An active database connection.
    """
    conn = sqlite3.connect(path)
    try:
        yield conn
        conn.commit()
    finally:
        conn.close()

This small database helper creates a safe workspace for interacting with a SQLite file, allowing the code to open a connection, run queries, and close everything cleanly without extra effort. The idea is similar to borrowing a tool for a moment and returning it once the job is done, ensuring nothing is left half-open or forgotten. By placing database actions inside a with block, the function quietly handles tasks like committing changes or closing the connection, even if an unexpected error appears along the way.


Setting Up the SQLite Database for Storing Product Data

# DATABASE: init + helpers

def init_db():
   
    """
    Initialize the SQLite database used for scraping.

    This function creates two tables if they do not already exist:

    1. product_url  
       - Stores all product URLs.
       - Ensures a 'scraped' column exists to track scraping status.

    2. product_cffi2  
       - Stores scraped product data.
       - Keeps only selected fields: url, title, brand, price, and mrp.

    It safely checks for missing columns, creates them when needed, 
    and logs each setup step for easier debugging.
    """

    with db_conn() as conn:
        cur = conn.cursor()

        # product_url
        cur.execute(f"""
            CREATE TABLE IF NOT EXISTS {URLS_TABLE}(
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT UNIQUE
            )
        """)
        # ensure scraped column exists
        cur.execute(f"PRAGMA table_info({URLS_TABLE})")
        existing = {r[1] for r in cur.fetchall()}
        if "scraped" not in existing:
            try:
                cur.execute(f"ALTER TABLE {URLS_TABLE} ADD COLUMN scraped INTEGER DEFAULT 0")
                logging.info("Added 'scraped' column to product_url")
            except Exception as e:
                logging.warning(f"Could not add scraped column: {e}")

        # combined product table - ONLY KEEPING SPECIFIED FIELDS
        cur.execute(f"""
            CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT UNIQUE,
                title TEXT,
                brand TEXT,
                price TEXT,
                mrp TEXT
            )
        """)

        logging.info(f"DB initialized, tables: {URLS_TABLE}, {TABLE_NAME}")

Setting up the database begins with a small helper that quietly creates the tables needed for storing product links and the final scraped details, making sure everything is ready before the main scraper starts collecting information. The function opens a connection through the context manager defined earlier, creates the URL table if it is missing, and adds a “scraped” column when the database does not already have it, which helps the script remember what has been processed. It then prepares another table to hold the cleaned product data—only essential fields like the title, brand, and pricing—so the final dataset stays tidy and easy to handle. Each action is logged for clarity, giving a clear trail in case something needs to be checked later.


Fetching Only the URLs That Are Still Waiting to Be Scraped

# DATABASE: fetchers & savers

def fetch_pending_urls(limit: Optional[int] = None) -> List[Tuple[int, str]]:
   
 """
    Fetch product URLs that are not yet scraped.

    This function reads the `product_url` table and returns all rows
    where the `scraped` column is 0, meaning the URL still needs to be processed.

    Args:
        limit (int, optional): Maximum number of URLs to return.
            - If a positive number is provided, only that many URLs are fetched.
            - If None or invalid, all pending URLs are returned.

    Returns:
        List[Tuple[int, str]]: A list of (id, url) pairs representing URLs
        waiting to be scraped.
    """

    with db_conn() as conn:
        cur = conn.cursor()
        q = f"SELECT id, url FROM {URLS_TABLE} WHERE scraped = 0"
        if isinstance(limit, int) and limit > 0:
            q += f" LIMIT {limit}"
        cur.execute(q)
        return cur.fetchall()

A beginner often wonders how a scraper knows which links still need attention, and this small function offers a simple way to manage that flow by quietly checking the database and pulling out only the URLs that are yet to be processed, much like picking unread messages from an inbox. The idea is straightforward: the database keeps a table named product_url, and inside it sits a column called scraped, which marks each link as either done or pending. Whenever the function runs, it opens a safe connection using db_conn()—a helper often explained earlier in the project—and gently reads only the rows where the scraped value is 0, meaning they are untouched and ready for processing. If a limit is provided, such as when testing or handling a small batch, the query simply adds a cap; otherwise, it gathers everything remaining. The result is returned as a clean list of pairs containing each URL and its corresponding ID, making the next parts of the pipeline easier to handle. This kind of step forms the backbone of many scraping workflows, because managing progress reliably is just as important as fetching the data itself.


How Product Details Are Stored Using a Smart Upsert Method

# SAVE PRODUCT DATA WITH UPSERT LOGIC  

def save_product(data: Dict[str, Optional[str]]):
    """
    Save product information into the database with an UPSERT (insert-or-update) logic.

    This function is designed to store scraped product data efficiently while
    preventing accidental overwriting of already existing non-null values.

    How it works:
    1. The function first checks whether the `url` key exists in `data`, because
       the `url` is used as the unique identifier for each product.
       If missing, the function logs an error and stops.

    2. It ensures the database has a row for this URL:
         - If the URL is not present, INSERT a new row (using `INSERT OR IGNORE`)
         - If it already exists, the insert is ignored (no error)

    3. It updates selected columns (`title`, `brand`, `price`, `mrp`) using
       SQLite's `COALESCE()`:
         - `COALESCE(new_value, existing_value)` means:
             • If new value is NOT NULL → update the column  
             • If new value is NULL → keep the existing non-null value  
       This protects existing good data from being overwritten by incomplete
       scraped data.

    Behavior
    -----------
    - Inserts a new row if the URL does not exist.
    - Updates only non-null fields.
    - Skips overwriting existing values with NULL.
    - Logs each save/update action for debugging and tracking.

    """
    url = data.get("url")
    if not url:
        logging.error("save_product called without url")
        return

    # ONLY KEEPING SPECIFIED FIELDS
    cols = ["title", "brand", "price", "mrp"]

    with db_conn() as conn:
        cur = conn.cursor()
        # ensure row exists
        cur.execute(f"INSERT OR IGNORE INTO {TABLE_NAME} (url) VALUES (?)", (url,))
        # build update statement with COALESCE
        set_parts = []
        params = []
        for c in cols:
            set_parts.append(f"{c} = COALESCE(?, {c})")
            params.append(data.get(c))
        params.append(url)
        sql = f"UPDATE {TABLE_NAME} SET {', '.join(set_parts)} WHERE url = ?"
        cur.execute(sql, params)
        logging.info(f"Saved/updated product: {url}")

Storing scraped product details becomes much easier when a function gently handles both new entries and updates, and this save_product function does exactly that by treating the product’s URL as its identity and building the rest of the process around it. The moment the function receives a dictionary of product information, it first checks whether a URL is present, because that single value decides where the data belongs in the database; without it, the function simply logs an error and steps back. Once a URL is confirmed, the database connection created through db_conn() helps ensure that a row exists for that URL by using an INSERT OR IGNORE statement, and this approach quietly avoids duplicate-row errors. After the row is secured, the function prepares an update that relies on SQLite’s COALESCE() feature, allowing each new piece of data—title, brand, price, or mrp—to be added only when it is not missing; if a value is None, the database keeps the older non-empty value instead of replacing it. This small detail becomes very helpful when some pages fail to load fully or when partial data is all that is available during a scrape. Each update runs smoothly without disturbing existing information, and the process fits naturally with earlier steps like fetching pending URLs, forming a clear workflow that moves from discovering product links to storing their details safely.


Marking URLs as Completed in the Scraping Process

# MARK URL AS SCRAPED

def mark_scraped(row_id: int):

    """
    Mark a URL record as scraped.

    This function updates the `scraped` column in the URLs table for the given ID.
    It is used to track scraping progress so that the scraper does not reprocess
    URLs that were successfully completed.

    Parameters
    ----------
    row_id : int
        The primary key (ID) of the URL row to update.
        """

    with db_conn() as conn:
        cur = conn.cursor()
        cur.execute(f"UPDATE {URLS_TABLE} SET scraped = 1 WHERE id = ?", (row_id,))
        logging.info(f"Marked scraped → ID {row_id}")

Tracking progress becomes much easier when each URL clearly shows whether it has already been processed, and the mark_scraped function handles this step in a simple, predictable way that even beginners can follow without confusion. The moment this function receives an ID, it opens a database connection through the same db_conn() context manager that earlier parts of the system rely on, creating a smooth flow from fetching pending URLs to saving product details and finally marking them as completed. Inside the connection, a small SQL update statement sets the scraped column to 1 for the matching row, which acts like flipping a switch that tells the scraper not to revisit that link again. This idea is similar to maintaining a checklist during a long task: once an item is marked as done, it never needs attention again unless something goes wrong. By fitting naturally with earlier helper functions—like the one that fetches pending URLs and the one that stores scraped product details—the mark_scraped function becomes part of a continuous story: URLs are discovered, processed, safely stored, and finally checked off, creating a clear and predictable cycle that keeps the scraping process organized from start to finish.


Handling Reliable Fetching with curl-cffi

# FETCHERS

def fetch_using_curl(url: str, ua: str, retries: int = RETRIES) -> Optional[str]:

    """
    Fetch HTML content using curl-cffi (requests with browser impersonation).

    This function sends an HTTP GET request to the given URL using a specified
    user agent string. It attempts multiple retries to handle temporary failures
    such as network issues, short responses, or non-200 status codes.


    What This Function Does
    ------------------------
    - Builds custom headers including the provided user agent.
    - Sends GET requests using curl-cffi’s impersonation mode (Chrome-like).
    - Logs success or failure for debugging.
    - Ensures the HTML is long enough (anti-bot protection often returns short pages).
    - Tries again when:
        * The status code is not 200  
        * HTML length is too small  
        * A network or request exception occurs  

    """    
    headers = {"User-Agent": ua, **HEADERS}
    for attempt in range(1, retries + 1):
        try:
            r = requests.get(url, impersonate="chrome", headers=headers, timeout=30)
            status = getattr(r, "status_code", None)
            if status == 200 and getattr(r, "text", None) and len(r.text) >= MIN_HTML_LEN:
                logging.info(f"cURL success [{len(r.text)} bytes]: {url}")
                return r.text
            logging.debug(f"cURL short/non-200 (status={status}) attempt {attempt}: {url}")
        except Exception as e:
            logging.warning(f"cURL attempt {attempt} error for {url}: {e}")
    return None

Fetching web pages reliably can be tricky, especially when sites try to block automated access, and the fetch_using_curl function provides a beginner-friendly way to handle this by using curl-cffi, a Python library that acts like a real browser. This function starts by taking a URL and a user-agent string, which tells the website which browser is being simulated—this is important because sites often serve different content or block requests that don’t look like they come from a real user. Inside the function, custom headers are built to resemble a standard browser request, and a loop tries multiple times to fetch the page in case of temporary network issues or bot protection that can return very short HTML pages. Each attempt sends a GET request using curl-cffi’s impersonation mode, which mimics Chrome, and the response is checked to ensure the status code is 200 and the HTML length is above a minimum threshold, which avoids saving incomplete or error pages. If the page is successfully fetched, it logs the success with the size of the response, and if it fails, warnings or debug messages record the status and attempt number for easier troubleshooting. This approach fits naturally with other parts of a scraper, such as database helpers like fetch_pending_urls and save_product, creating a smooth workflow from discovering URLs to safely storing product data .


How Playwright Helps Load Dynamic Amazon Pages

# PLAYWRIGHT FETCHER

async def playwright_fetch(page: Page, url: str, retries: int = RETRIES) -> Optional[str]:
    """
    Fetch full HTML content using Playwright with retries.

    This function loads a webpage using Playwright and waits for important
    Amazon elements to appear. It helps bypass anti-bot blocks, dynamic loading,
    and JavaScript-rendered content. If loading fails, it retries several times.

    What This Function Does
    ------------------------
    - Opens the URL using Playwright’s Chromium engine.
    - Waits until the network is idle (no pending requests).
    - Adds extra wait time for Amazon’s dynamic content to finish loading.
    - Waits for several key Amazon selectors (title, bullets, details).
    - Validates the HTML length to ensure the page is fully loaded.
    - Retries when:
        * Page load timeout occurs
        * HTML is too short (blocked page)
        * Any unexpected error happens

    """

    for attempt in range(1, retries + 1):
        try:
            resp = await page.goto(
                url,
                timeout=PLAYWRIGHT_TIMEOUT,
                wait_until="networkidle"       #  Wait until no network requests
            )

            # Additional waits for Amazon dynamic elements
            await page.wait_for_load_state("networkidle")
            await page.wait_for_timeout(3000)  # Wait 3 seconds extra

            # Wait for essential selectors
            selectors = [
                "#productTitle",
                "#feature-bullets",
                "#bylineInfo",
                "#detailBulletsWrapper_feature_div",
                "#prodDetails",
            ]

            for sel in selectors:
                try:
                    await page.wait_for_selector(sel, timeout=5000)
                except:
                    pass  # ignore if not found (not mandatory fields)

            html = await page.content()
            if html and len(html) >= MIN_HTML_LEN:
                logging.info(f"Playwright success [{len(html)} bytes]: {url}")
                return html

            logging.warning(f"Playwright short HTML attempt {attempt} for {url}")
            await asyncio.sleep(2.0)

        except Exception as e:
            logging.warning(f"Playwright attempt {attempt} error for {url}: {e}")
            await asyncio.sleep(2.0)

    return None

Using playwright_fetch provides a modern way to fetch web pages that load content dynamically with JavaScript, which is common on sites like Amazon, and this function is designed to handle those challenges in a beginner-friendly way. It starts by taking a Playwright Page object and a URL, then tries multiple times to load the page while waiting until network activity has settled, ensuring that all dynamic elements are loaded. After the main page load, the function pauses a few extra seconds and checks for essential elements like the product title, features, and details, but it doesn’t fail if some of these selectors are missing, which makes it flexible across different product pages. Once the page is fully loaded and the HTML content meets a minimum length requirement, it logs the success with the size of the page, which helps track scraping progress. If an attempt fails due to network issues or incomplete content, it waits a short period and retries, providing a robust way to handle transient errors. This approach works smoothly alongside other tools in the scraper, like fetch_using_curl for simpler requests and database helpers such as save_product to store clean data.


A Simple Helper for Safely Extracting Text

# PARSING HELPERS

def safe_text(soup: BeautifulSoup, selector: str) -> Optional[str]:
    """
    Safely extract text from a BeautifulSoup selector.

    This helper function looks for the first element matching the given CSS
    selector and returns its cleaned (strip=True) text. If the element does not
    exist, it returns None instead of raising an error.

    Why This Is Useful
    ------------------
    - Prevents crashes when selectors are missing.
    - Makes parsing more reliable for sites like Amazon with inconsistent HTML.
    - Keeps code clean by avoiding repeated try/except blocks.
    """
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else None

The safe_text function is a simple yet powerful helper for extracting text from HTML using BeautifulSoup, which is especially useful when scraping pages that may or may not have certain elements, like product descriptions or titles. It takes a BeautifulSoup object representing the parsed HTML and a CSS selector string, then looks for the first matching element. If the element exists, it returns the cleaned text without extra spaces; if it’s missing, it safely returns None instead of causing an error, which helps prevent the scraper from crashing on inconsistent pages. This approach fits seamlessly with other parts of the scraper, like playwright_fetch or fetch_using_curl, because it allows the dynamic or static HTML content to be parsed reliably, and the results can then be stored in the database using functions like save_product.


How Product Information Is Pulled from <tr> Rows

# PARSE KEY-VALUE ROWS FROM <TR>

def parse_key_value_rows(soup: BeautifulSoup) -> dict:
    
     """
    Extract key–value information from table rows (<tr>).

    This function scans all <tr> elements in the HTML and looks for <th> (key)
    and <td> (value) pairs. It cleans both the key and value and converts the
    keys into a consistent lowercase format. The result is returned as a
    dictionary.

    This is commonly used for Amazon product detail tables where information
    like “Brand”, “Manufacturer”, “Item Weight”, etc., appears in table rows.

    What This Function Does
    -----------------------
    - Loops through every <tr> element in the document.
    - For each row, extracts:
        * <th> → key  
        * <td> → value
    - Skips rows that don’t have both <th> and <td>.
    - Cleans key/value text (removes extra spaces, newlines).
    - Normalizes keys by:
        * Lowercasing them
        * Removing trailing colons (e.g., “Brand:” → “brand”)
    - Stores result in a dictionary.

    Why This Helps
    --------------
    Amazon has multiple variations of detail tables.  
    This function creates a consistent structured output regardless of layout.
    """
    
    out = {}
    for tr in soup.select("tr"):
        th = tr.find("th")
        td = tr.find("td")
        if not th or not td:
            continue
        key = th.get_text(separator=" ", strip=True)
        val = td.get_text(separator=" ", strip=True)
        if not key:
            continue
        k = re.sub(r"\s+:\s*$", "", key.strip().lower())
        out[k] = val
    return out

The parse_key_value_rows function offers a gentle way to turn messy product detail tables into clean, structured data by scanning through each <tr> element in the HTML and picking out the <th> and <td> pairs that usually hold important product facts like brand names, item weights, or manufacturer details. It checks every row safely, skips incomplete ones, and carefully cleans both the key and value so the final result feels consistent and easy to use. The key text is converted to lowercase, extra spaces are trimmed, and trailing colons are removed, which means “Brand:” and “Brand” are treated the same. Each cleaned key and value is then stored in a simple dictionary, allowing the rest of the scraper—whether it uses playwright_fetch, fetch_using_curl, or helper functions like safe_text—to rely on predictable data instead of dealing with different table formats. This becomes particularly helpful on pages like Amazon’s, where the layout can shift but the basic idea remains the same. The overall flow of this function makes the data extraction step feel like a natural continuation of the scraping process, turning raw table rows into readable information without adding unnecessary complexity.


How List Items Are Parsed Into Structured Product Details

# PARSE LIST ITEMS FOR KEY-VALUE PAIRS

def parse_list_items(soup: BeautifulSoup) -> dict:

    """
    Extract key–value data from <li> list items.

    This function looks for Amazon-style list entries where the key is usually
    inside a bold <span> (often using the class "a-text-bold"), followed by the
    value in the remaining text of the <li>. It cleans the key, removes the
    bold element, and then extracts the value that follows.

    How It Works
    ------------
    - Iterates over all <li> elements in the document.
    - Looks for a bold <span> inside the <li> to treat as the “key”.
        * First tries `.a-text-bold`
        * If not found, searches ANY class containing "a-text-bold"
    - Extracts and cleans the key (removes trailing colon, lowercases text).
    - Removes the bold <span> from the <li> so only the value remains.
    - Extracts the remaining text as the “value”.
    - Stores the key → value pair in the output dictionary.

    Why This Is Useful
    ------------------
    Amazon frequently embeds product details in bullet-list formats.
    Keys are bold, values follow after.  
    This function normalizes that into a clean dictionary for easy storage.
    """

    out = {}
    for li in soup.select("li"):
        # find bold span commonly used on Amazon: <span class="a-text-bold">Key :</span>
        bold = li.select_one(".a-text-bold")
        if not bold:
            # try finding a span with class containing 'a-text-bold'
            bold = li.find(lambda t: t.name == "span" and t.get("class") and any("a-text-bold" in cls for cls in t.get("class")))
        if not bold:
            continue
        key = bold.get_text(" ", strip=True)
        key = re.sub(r"\s*:\s*$", "", key).strip().lower()
        try:
            bold.extract()
        except Exception:
            pass
        val = li.get_text(" ", strip=True)
        if val:
            out[key] = val
    return out

The parse_list_items function plays a helpful role when product details appear inside bullet lists, especially on pages like Amazon where a bold label is often followed by the actual information, and the goal of this function is to gently separate these two parts and turn them into clean key–value pairs that a scraper can easily understand. It moves through each <li> element one by one, looks for the bold span that usually marks the key, cleans it by trimming extra spaces and removing trailing colons, and then removes that bold span so the remaining text becomes the value. This approach works well with Amazon-style layouts where bold text such as “Manufacturer:” or “Item Weight:” is placed at the start, followed by the details. The outcome is stored in a simple dictionary, making it easy for other parts of the scraper—whether they rely on playwright_fetch, fetch_using_curl, or helpers like safe_text and parse_key_value_rows—to work with consistent data. This creates a smooth flow where the raw HTML first gets fetched, then cleaned, and finally organized into structured information, helping beginners see how every step naturally leads into the next without adding unnecessary technical difficulty.


Finding the ASIN: How the Code Locates Amazon’s Product ID

# EXTRACT ASIN

def extract_asin(soup: BeautifulSoup, html: str) -> Optional[str]:

    """
    Extract the ASIN (Amazon Standard Identification Number) from a product page.

    This function attempts multiple reliable methods to locate the ASIN, since
    Amazon places it in different locations depending on the product layout.

    Extraction steps:
      1. Check product detail tables (<tr> rows) using parsed key–value pairs.
      2. Look for hidden input fields such as:
            <input id="ASIN" value="B09ABC1234"> 
            <input name="ASIN" ...>
      3. Use a regex fallback to search directly in the raw HTML.


    Why this is helpful
    -------------------
    Amazon pages use multiple layouts, so relying on a single selector often fails.
    This function increases extraction accuracy by checking all common ASIN locations.
    """

    # 1) from product details
    kv = parse_key_value_rows(soup)
    for kname, v in kv.items():
        if "asin" in kname:
            m = re.search(r"([A-Z0-9]{10})", v)
            if m:
                return m.group(1)
            return v
    # 2) input fields
    el = soup.select_one("#ASIN") or soup.select_one("input[name='ASIN']")
    if el:
        return el.get("value") or el.get_text(strip=True)
    # 3) regex
    m = re.search(r"ASIN[^A-Z0-9]*([A-Z0-9]{10})", html, re.I)
    if m:
        return m.group(1)
    return None

Understanding how the extract_asin function works becomes much easier when thinking of an Amazon product page as a large room full of scattered clues, and the ASIN is the small but important label that identifies the product, similar to a unique ID on a warehouse box; the function patiently walks through different corners of this room to locate that label, starting with the product-detail section where Amazon often places key information inside neat table rows, and this part is decoded using a helper like parse_key_value_rows, which gathers structured data from HTML just like explained earlier in internal sections of the project; if the ASIN hides somewhere else, the function then checks input fields inside the page—tags such as <input id="ASIN" value="...">—because Amazon sometimes stores important data in hidden fields meant for internal site use; and when both of these methods come up empty, the function uses a regex search directly on the raw HTML string, which acts like scanning the entire page for any text that matches the ASIN pattern, ensuring nothing is missed; this multi-step approach works well because Amazon frequently changes layouts, and relying on one fixed selector is rarely enough, so the function quietly adapts by checking several likely spots before giving up, the heart of this function is simply about being thorough—reading the page carefully, searching step by step, and returning the ASIN only when a confident match is found, making the overall scraper more reliable even when Amazon updates page designs.


Step-by-Step Logic Behind Amazon Product Parsing

# GATHER DETAILS FROM SOUP

def gather_details_from_soup(soup: BeautifulSoup, html: str) -> dict:

    """
    Extract key product details from an Amazon product page.

    This function combines multiple parsing helpers to collect important
    fields such as title, brand, price, MRP, and ASIN. It reads data from
    different parts of the page (tables, lists, and direct selectors), ensuring
    that values are captured even when Amazon changes layouts.

    What this function does:
      - Builds a dictionary of key–value pairs from <tr> rows.
      - Extracts list-based key–value data from <li> elements.
      - Defines a helper (`get_field`) to search for values using multiple
        possible field names (e.g., “brand”, “manufacturer”).
      - Grabs specific product fields using CSS selectors and fallbacks.
      - Extracts ASIN using multiple techniques.

    """

    kv = parse_key_value_rows(soup)
    li_map = parse_list_items(soup)

    def get_field(possible: List[str]) -> Optional[str]:
        for name in possible:
            ln = name.lower()
            if ln in kv:
                return kv[ln]
            if ln in li_map:
                return li_map[ln]
        return None

    # Only extracting specified fields
    title = safe_text(soup, "#productTitle")
    brand = safe_text(soup, "#bylineInfo") or get_field(["brand", "manufacturer"])
    price = safe_text(soup, "span.a-price-whole") or safe_text(soup, ".a-price .a-offscreen")
    mrp = safe_text(soup, "span.a-text-price span[aria-hidden='true']") or get_field(["mrp", "list price", "price"])

    # Extract ASIN only for URL cleaning
    asin = extract_asin(soup, html) or get_field(["asin"])

    return {
        "title": title,
        "brand": brand,
        "price": price,
        "mrp": mrp,
        "asin": asin  # Only used for URL cleaning
    }

The gather_details_from_soup function works like a careful reader that goes through an Amazon product page and slowly picks up the details that matter, using both structured sections and scattered elements in the HTML to build a clear set of values; it begins by calling parse_key_value_rows and parse_list_items, which act like small helpers that organize information found inside table rows and list items, making it easier to look up fields later without searching the whole page again, and then defines a tiny method named get_field that patiently checks different possible names for the same detail—because one product may label the brand as “Brand” while another uses “Manufacturer”—so the function remains flexible even when layouts vary; once these foundations are ready, the code turns its attention to visible elements using safe_text, reading the product title from #productTitle, the brand from #bylineInfo, and the price from selectors such as .a-price .a-offscreen, with the function gently switching to fallback options when something is missing; the MRP is gathered the same way, sometimes through a selector and sometimes from the earlier dictionaries, depending on the page structure; finally, the ASIN is extracted with extract_asin, which checks multiple places inside the HTML before giving up, making it useful for tasks like URL cleaning where this identifier is needed, by the time the function finishes, it returns a simple dictionary holding the title, brand, price, MRP, and ASIN, bringing together all the pieces it collected in a way that feels like watching a puzzle come together naturally, without abrupt jumps or isolated steps.


Scraping a Single Product Page with Dual-Fetch Strategy

# Scrape single product (dual fetch)

async def scrape_product(page: Page, url: str, ua: str) -> Optional[dict]:

    """
    Scrape a single Amazon product page using a dual-fetch strategy.

    This function tries two different methods to fetch the HTML:
      1. Playwright (primary, handles dynamic content)
      2. curl-cffi/requests (fallback if Playwright fails)

    After fetching the HTML, it parses the page using BeautifulSoup and extracts
    key product fields such as title, brand, price, and MRP. It also extracts the
    ASIN for generating a clean canonical Amazon URL.

    Steps performed:
      - Attempt HTML fetch via Playwright.
      - If Playwright fails, retry using curl-based request.
      - Parse the page with BeautifulSoup.
      - Extract product details using `gather_details_from_soup`.
      - Generate a cleaned Amazon URL using ASIN (if available).
      - Remove ASIN from the final returned dictionary (not stored in DB).
    """
        
    html = await playwright_fetch(page, url)
    if html is None:
        html = fetch_using_curl(url, ua)


    if html is None:
        logging.error(f"Both curl & Playwright failed for {url}")
        return None

    soup = BeautifulSoup(html, "html.parser")
    details = gather_details_from_soup(soup, html)
    asin = details.get("asin")
    details["url"] = clean_amazon_url(url, asin)
    
    # Remove asin from final data since we don't store it for sample data
    if "asin" in details:
        del details["asin"]

    return details

The scrape_product function works like a careful two-step safety net for loading an Amazon product page, beginning with Playwright to handle pages that rely on dynamic content and then quietly switching to a curl-based request if the first attempt does not return any HTML, giving the script a dependable way to keep moving without stopping at the first failure; once the HTML is available, the function hands the page to BeautifulSoup, which works like a gentle reader that turns the raw markup into something easier to navigate, allowing gather_details_from_soup to pull out the title, brand, price, MRP, and the ASIN with the help of several parsing helpers placed throughout the codebase; after collecting the data, the function creates a clean Amazon URL through clean_amazon_url, using the ASIN to remove long tracking parameters and give a neater link, similar to tidying a long string into something short and readable, and as the final touch, the ASIN is removed from the output because it is needed only for URL cleaning and not for storing in the final dataset, keeping the returned dictionary focused on the essential fields; with this flow, the entire process feels like watching someone methodically fetch the page, fall back when necessary, parse the content, extract meaningful information, and hand back an organized result without abrupt jumps or technical clutter, making the logic easy to follow even for someone new to scraping.


Coordinating the Scraping Workflow with the Main Function

# MAIN RUNNER

async def main(limit: Optional[int] = None):
    """
    Main controller function for scraping Amazon product pages.

    This function manages the entire scraping workflow:
    1. Loads pending product URLs from the SQLite database.
    2. Opens a Playwright browser session (Chromium).
    3. For each URL:
        - Selects a random User-Agent.
        - Creates a fresh browser context and page.
        - Applies stealth mode to reduce detection.
        - Attempts to scrape product data using Playwright first,
          and falls back to curl-cffi if needed.
        - Cleans and normalizes extracted data.
        - Saves the scraped data to the database.
        - Marks the URL as "scraped" in the product_urls table.

    Returns
    -------
    None
        This function does not return data. All results are stored directly
        in SQLite and logged through the logger.
    """

    init_db()
    pending = fetch_pending_urls(limit)

    if not pending:
        print("No URLs left.")
        logging.info("No URLs left to process.")
        return

    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=False)

        try:
            for row_id, url in pending:

                # Pick user agent once for this URL
                ua = random.choice(USER_AGENTS)

                # Create context/page per URL
                context = await browser.new_context(user_agent=ua)
                page = await context.new_page()

                # Apply stealth BEFORE scraping
                try:
                    await stealth_async(page)
                except Exception as e:
                    logging.debug(f"stealth_async warning: {e}")

                # Scrape product using dual method (Playwright → curl)
                try:
                    data = await scrape_product(page, url, ua)
                except Exception as e:
                    logging.exception(f"Unexpected scrape error for {url}: {e}")
                    data = None

                # Save only if successful
                if data:
                    save_product(data)
                    mark_scraped(row_id)
                    logging.info(f"Saved & marked scraped: {url}")
                else:
                    logging.error(f"Failed to scrape (kept pending): {url}")

                await context.close()
                await asyncio.sleep(random.uniform(*SLEEP_RANGE))

        finally:
            await browser.close()

The main function acts like the central coordinator that quietly guides the entire scraping process from start to finish, beginning with a simple step of pulling the list of product links stored in a SQLite database and then preparing a Playwright browser session so each page can be visited in a controlled environment; once the pending URLs are ready, the function moves through them one by one, choosing a random User-Agent for each request so the scraper behaves more like a regular visitor, and resources such as new browser contexts are created fresh for every link to avoid letting any previous page leave behind clues that automation is being used, which is especially important for websites with detection systems; before opening each page, stealth mode is enabled to reduce signs of automation, and after that, the scraper tries to collect details through scrape_product, where Playwright handles most pages and curl-cffi becomes a fallback if dynamic loading fails; successful results are saved in the database using helper functions like save_product, and the URL is marked as scraped so the system knows not to repeat the work the next time the script runs, allowing the whole workflow to feel smooth and dependable even if some pages fail and remain pending for later attempts; the function closes each context carefully, adds a small random delay to mimic normal browsing habits, and eventually shuts down the browser once every link has been processed, bringing the scraping cycle to a clean finish without returning data directly because everything is already written into SQLite and captured through logs for review.


Endpoint

# ENDPOINT

""" Runs the main function when the script is executed directly."""

if __name__ == "__main__":
    # optional: run with a limit
    asyncio.run(main(limit=None))

This small endpoint acts like the front door of the script, making sure the main scraping routine starts only when the file is run directly rather than being imported somewhere else. Think of it as a simple trigger that tells Python to launch the main function, which then carries the entire workflow forward. Adding the optional limit parameter gives extra control during testing, helping the script run safely without processing more pages than needed while still keeping the flow of execution clean and predictable.


Conclusion


Choosing a menstrual cup becomes much simpler once the basic ideas behind its design, material safety, and long-term benefits are clearly understood, and going through the details on trusted product pages—whether from official brand websites or well-structured informational resources like health blogs and verified guides—helps build confidence for anyone trying it for the first time. As the properties of the cup start to make sense, such as how medical-grade silicone keeps it flexible yet safe, or how its reusable nature reduces monthly waste, the product stops feeling like a complicated new tool and more like a practical alternative worth considering. Exploring FAQs, user experiences, and comparison sections on the same site brings a clearer picture of how sizing works, how insertion becomes easier with practice, and how proper cleaning keeps everything hygienic. External references from reliable health platforms also reassure beginners that menstrual cups are globally recommended for comfort, sustainability, and cost-effectiveness. After understanding these points step-by-step, the idea of switching no longer feels overwhelming; instead, the cup comes across as a thoughtful, modern solution that supports comfort, confidence, and a more eco-friendly period experience—something that grows easier and more empowering with each cycle.


Libraries and Versions Used


Name: asyncio

Version: Built-in Python module


Name: random

Version: Built-in Python module


Name: sqlite3

Version: Built-in Python module


Name: json

Version: Built-in Python module


Name: logging

Version: Built-in Python module


Name: re

Version: Built-in Python module


Name: contextlib

Version: Built-in Python module


Name: urllib.parse

Version: Built-in Python module


Name: pathlib

Version: Built-in Python module


Name: BeautifulSoup (bs4)

Version: 4.12.3


Name: playwright

Version: 1.48.0


Name: playwright-stealth

Version: 1.0.6


Name: curl-cffi (requests module)

Version: 0.6.2


AUTHOR


I’m Anusha P O, a Data Science Intern at Datahut, with hands-on experience in building intelligent, scalable web-scraping systems. In this blog, I break down how we extracted structured product information for Menstrual Cup listings from Amazon using a hybrid scraping pipeline powered by Playwright, curl-cffi, SQLite, JSON, and an async dual-fetch workflow.From handling anti-bot challenges to cleaning and normalizing product attributes, this project shows how messy Amazon product pages can be transformed into clean, reliable, analysis-ready datasets.


At Datahut, we help businesses unlock the power of web data by designing robust scraping architectures for price monitoring, product research, competitive intelligence, and large-scale e-commerce analytics.If you’re working on data-driven strategies for online retail or want to organize large product datasets efficiently, feel free to reach out through the chat widget on the right.Let’s turn raw web data into clear, actionable insights.


FAQ SECTION


1. What is the purpose of scraping Amazon’s menstrual cup data?

Scraping menstrual cup data helps analyze pricing, reviews, ratings, product features, and competitor positioning. It enables ecommerce sellers and analysts to make informed decisions based on real-time market trends.


2. Why use Playwright for scraping Amazon product pages?

Playwright is ideal for handling dynamic content, JavaScript rendering, and tough anti-bot measures. It mimics human browsing behavior, making it more reliable for scraping Amazon product detail and listing pages.


3. How does curl-cffi help in scraping Amazon?

curl-cffi bypasses strict bot detection by using TLS fingerprinting identical to real browsers. This makes it effective for fetching Amazon HTML pages without being blocked.


4. Is it legal to scrape Amazon?

Web scraping Amazon for personal research, price monitoring, or competitive insights is generally allowed when done ethically—without breaching login walls, violating robots.txt, or harming servers. Always follow Amazon’s terms and respect data privacy laws like GDPR.


5. What insights can be extracted from menstrual cup product data?

You can extract price variations, best-selling brands, rating distribution, keyword-rich descriptions, feature comparisons, and customer sentiment insights to support ecommerce product analysis.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page