How to Scrape Amazon Dog-food Using Python Libraries
- Anusha P O
 - Sep 17
 - 30 min read
 
Updated: 7 days ago

Have you ever wondered how major e-commerce platforms manage thousands of product listings across categories like electronics, fashion, or even pet food? One of the biggest names in this space, Amazon, holds an enormous inventory that spans nearly every product imaginable—earning it the title “The Everything Store.” Founded in 1994 by Jeff Bezos, Amazon began as a humble online bookstore and grew into one of the world’s most influential tech giants. Beyond e-commerce, its reach extends into cloud computing (AWS), digital streaming (Prime Video, Audible), consumer electronics (Kindle, Fire TV), and AI innovations. With over 4000+ products listed in just one category—dog food—it becomes clear why many businesses turn to web scraping to collect and analyze such vast information efficiently.
This blog walks through a beginner-friendly journey into web scraping, focusing on the dog food section of Amazon US. If you're a Data Analyst, or Product Researcher, and you're curious about how to gather large-scale data to improve decision-making or stay ahead of the competition, you're in the right place. Whether you're just exploring or actively planning a data-driven strategy, you’ll learn how structured scraping methods can transform unmanageable volumes of online data into usable, insightful formats for data analysis, pricing strategies, trend monitoring, and much more.
Curious how data can sharpen your edge in the market? Let’s dive into the scraping process and unlock the real value behind e-commerce product data. Want to supercharge your business decisions with real-time insights? Reach out today and let’s make data work for you.
Smart and Automated Data Gathering
What’s the best way to gather information when a website has thousands of product listings spread across multiple pages? That’s where web scraping becomes a game-changer. The process begins by collecting all the product URLs from a chosen category—like dog food on Amazon US—and then visiting each of those links to extract useful details such as product names, prices, ratings, and more. This structured data collection method helps turn scattered online content into organized datasets ready for data analysis, empowering e-commerce teams, product managers, and analysts to make smarter business decisions.
Step 1: Collecting Product URLs
Have you ever noticed how product listings on Amazon don’t appear all at once? Instead, they’re spread across multiple pages, especially in large categories. To collect all the product links, the first step was to change the delivery location from India to a U.S. zip code—10001, New York—so that only relevant U.S. listings would appear. Once the location was set, the scraper navigated through each paginated page of the dog food section, capturing every product URL displayed.
A browser automation tool like Playwright played a key role here. It behaved like a real user—opening pages, waiting for content to load, and clicking the “Next” button to move through each page. Along the way, it handled pop-ups and ensured that only clean, working links were gathered. These product URLs—over 4,000 in total—were saved neatly into a SQLite database, making it easier to manage, avoid duplicates, and prepare for the next step: collecting detailed product information from each link. This structured setup ensures the data remains organized and ready for smooth analysis later.
Step 2: Collecting Information from Product URLs
So what happens after collecting thousands of product URLs? The next step is to visit each link and carefully extract detailed information about every item listed. For accurate results, the delivery location on Amazon was first set to New York, zip code 10001, so that only U.S.-relevant product details would appear. Then, each page was opened one by one using Playwright, which behaves like a real user—waiting for content to load, handling dynamic elements, and scrolling where needed. This approach helped gather key product attributes like the title, price, brand, number of reviews, availability, description, and more.
All of this information was directly saved into the same database used earlier for the URLs, keeping everything neatly organized in one place. This setup not only simplifies data management but also ensures there’s a clear link between each product and its details. With structured and reliable data in hand, the foundation is now set for deeper data analysis, such as comparing prices, identifying trends, or tracking brand performance.
Step 3: Cleaning the Extracted Amazon Dog Food Data
Have you ever opened a spreadsheet full of messy, inconsistent data and wondered how anyone makes sense of it? That’s a common situation after collecting product information from large websites like Amazon. Even when every page is visited carefully and data is stored correctly, the result can include unwanted duplicates, inconsistent formatting, and symbols like the dollar sign cluttering up price fields. For example, the same product might appear under slightly different URLs, leading to multiple identical entries that need to be removed.
To clean and prepare the data, tools like OpenRefine are incredibly useful. It’s like a smarter, more powerful version of Excel, letting you spot duplicates, fill in missing values with “N/A,” and fix typos or inconsistent brand names with just a few clicks. And for more advanced cleanup, especially when dealing with large datasets or hidden formatting issues, Python’s pandas library is an excellent choice. It works behind the scenes to strip out extra symbols, tidy up text, and format numbers so they’re easy to analyze. Cleaning might not feel as exciting as data collection, but it’s a critical step—because well-prepared data is what turns raw information into reliable insights.
Comprehensive Tool-kits for Efficient Data Extraction
What’s the secret to collecting data from thousands of product pages without losing speed or structure? The answer lies in combining the right tools and libraries. When it comes to web scraping at scale, efficiency and reliability go hand in hand—and that’s only possible with a well-chosen tech stack. This setup uses a group of powerful Python libraries that automate browsing, extract content, store data, and handle background tasks—all working together like parts of a well-oiled machine.
At the heart of this system is asyncio, which allows multiple operations to run at the same time. Instead of waiting for each product page to load and finish one by one, asyncio keeps things moving—helping to process hundreds of pages without slowing down. Alongside this is Playwright’s async API, a modern browser automation library that opens web pages, scrolls, clicks, and extracts content as if a human were doing it. Combined with Playwright Stealth, the tool becomes even more powerful by masking automation patterns to avoid detection on websites with anti-bot mechanisms.
Once the product data is collected, it needs to be stored in a clean and organized way. For this, sqlite3 is used to manage URLs and structured information in a lightweight database. It’s simple, fast, and doesn’t need a separate server—making it ideal for scraping workflows. For storing detailed product data with flexible structure, MongoDB is a better fit, and can easily handle complex, unstructured content like descriptions, reviews, and specifications using pymongo.
Supporting tools like BeautifulSoup help extract text from HTML, while modules such as random, logging, and pathlib play vital roles behind the scenes—mimicking natural delays, managing error logs, and organizing data files. Together, this combination of libraries builds a resilient and scalable scraping system that can collect, process, and store thousands of data points with precision.
With these tools working in harmony, even large-scale data extraction—from dynamic e-commerce sites to paginated product catalogs—becomes not only possible, but efficient and well-managed.
Step 1: Scraping Product URLs from the Dog Food Section on Amazon
Importing Libraries
import asyncio
import random
import sqlite3
import logging
from pathlib import Path
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from playwright.async_api import async_playwrightEver wondered how a script knows how to browse a website, click buttons, or collect data like a human would? It all starts with the right set of Python libraries—prebuilt tool-kits that save time and make automation easier. In this project, several libraries work together to make the process of collecting product URLs from Amazon’s dog food section smooth and efficient.
The most important tool here is playwright.async_api, a browser automation library that simulates real user behavior. It can open Amazon pages, scroll, wait for content to load, and even handle dynamic elements—all without manual clicks. This makes it ideal for interacting with paginated product listings that require multiple actions to reveal all items.
To keep track of all the URLs being collected, sqlite3 is used as a lightweight database. It stores each product link in an organized format, making it easy to manage and access later. This database lives right on the local system, making it both portable and efficient for quick lookup.
Meanwhile, logging is set up to create detailed records of what the script is doing—whether it’s successfully collecting data or running into issues. Each log entry is timestamped, helping developers understand what happened and when. Additional tools like BeautifulSoup help parse HTML content, urljoin builds complete URLs, and random introduces slight delays between actions, mimicking human browsing and reducing the chance of getting blocked.
Together, these libraries form the backbone of a smart, stable web scraping setup. With each one playing a specific role, they make the data collection process faster, safer, and more reliable—especially when dealing with large volumes of dynamic content.
Getting Started: Where to Save, What to Log, and Where Scraping Begins
# Config paths
USER_AGENTS_PATH = "/home/anusha/Desktop/DATAHUT/Macys_clothing/user_agents.txt"
DB_PATH = "/home/anusha/Desktop/DATAHUT/Amazon/Data/US/dogfood_us1.db"
LOG_PATH = "/home/anusha/Desktop/DATAHUT/Amazon/Data/US/dogfood_scraper1.log"
# Amazon URLs
BASE_URL = "https://www.amazon.com"
START_URL = "https://www.amazon.com/gp/browse.html?node=2975359011&ref_=nav_em__sd_df_0_2_21_4"
"""
Configuration Constants
This section defines important configuration variables used throughout the script.
1. USER_AGENTS_PATH:
   - Path to a text file containing a list of user agent strings.
   - A user agent string simulates a specific browser/device when making requests to the website.
   - Helps reduce the chances of getting blocked by rotating user agents.
   - Example entry in file: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/...
2. DB_PATH:
   - Full path to the SQLite database file where scraped product URLs will be saved.
   - The database ensures persistent and duplicate-free storage of links.
3. LOG_PATH:
   - Path to the file where all logs (info, warnings, and errors) will be written.
   - Useful for debugging and tracking the scraping process.
4. BASE_URL:
   - The root domain of the website being scraped, in this case, Amazon US.
   - Used to build absolute URLs when only relative links are found on the page.
5. START_URL:
   - The first URL to begin scraping from.
   - This is the Amazon category page for dog food.
   - The scraper will start from this URL and then follow pagination to scrape further pages.
"""How does a scraper know where to start, where to save data, or how to act like a real browser? The answer lies in its configuration settings—a set of predefined paths and constants that give structure to the entire process. These settings are like the map and tools needed before setting off on a data collection journey. They guide the script on what page to begin with, where to save its progress, and how to reduce the chances of being blocked.
To begin with, a user agent file is provided at a specific path. This text file contains different browser signatures that the scraper can rotate through. Every time the scraper makes a request, it can pretend to be a different browser or device—like Chrome on Windows or Safari on iPhone—helping it blend in and stay under the radar. This is especially useful for websites like Amazon, which are quick to detect automated behavior.
Next is the database path, pointing to a SQLite file that acts as a storage unit for all the collected product URLs and data. Unlike temporary memory, this file keeps everything safe even if the process is interrupted. It also helps avoid duplicates by tracking what’s already been saved. Then there’s the log file path, which records every step the script takes—successes, warnings, errors, and even timestamps. These logs act like a black box, making it easier to troubleshoot issues or review scraping performance.
Lastly, the base URL and start URL set the foundation for navigation. The base URL defines the main site—https://www.amazon.com—while the start URL leads directly to the dog food category. From there, the scraper knows where to begin and how to explore further using pagination. Together, these configuration paths keep the process well-structured, transparent, and ready for large-scale data extraction.
Setting Up Amazon Cookies to Act Like a Real User
# Predefined cookies to simulate a session 
cookies = [
   {"name": "i18n-prefs", "value": "USD", "domain": ".amazon.com", "path": "/"},
   {"name": "lc-main", "value": "en_US", "domain": ".amazon.com", "path": "/"},
   {"name": "session-id", "value": "131-4818556-2161121", "domain": ".amazon.com", "path": "/"},
   {"name": "session-id-time", "value": "2082787201l", "domain": ".amazon.com", "path": "/"},
   {"name": "ubid-main", "value": "134-2297602-3092101", "domain": ".amazon.com", "path": "/"}
]
"""
Amazon uses cookies to manage sessions, regional preferences, and user-specific settings.
By setting these cookies manually in the browser context, we simulate a session that:
1. Prevents Amazon from redirecting to a different country site.
2. Loads pages with English language and USD currency preferences.
3. Appears more like a real user session to avoid bot detection.
Each cookie is a dictionary with the following keys:
- "name": The name of the cookie (e.g., "i18n-prefs").
- "value": The value associated with that cookie.
- "domain": The domain to which the cookie applies ("amazon.com").
- "path": The path within the domain where the cookie is valid (usually "/").
"""When accessing websites like Amazon, simply sending automated requests often isn’t enough. That’s because Amazon, like many large e-commerce platforms, closely monitors browsing behavior to detect bots. It uses cookies—tiny pieces of data stored by your browser—to track things like region, currency, and session details. If these aren’t set properly, the website may behave differently or even block access. That’s why configuring predefined cookies is an important step in building a scraper that mimics real user behavior.
By manually setting cookies, the scraper can simulate a realistic browsing session. For example, specifying cookies like "i18n-prefs" and "lc-main" ensures that pages load in English and display prices in USD—which is important when targeting the U.S. Amazon site. Other cookies, such as "session-id" and "ubid-main", give the scraper a consistent session identity, helping it appear more like a regular shopper instead of a script. These details reduce the risk of being redirected to a different country site or triggering anti-bot defenses.
Each cookie is defined as a small dictionary that includes the name, value, domain, and path. Together, these pieces act like an invisible user passport—helping the scraper blend in, stay on the correct site version, and avoid unnecessary blocks. For anyone building reliable web scraping workflows, especially for data analysis and product tracking on Amazon, managing cookies is a quiet but powerful way to keep the process smooth and uninterrupted.
Logging Configuration
# Logging Configuration
logging.basicConfig(filename=LOG_PATH, level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
"""
Logging Configuration
This setup initializes a basic logging system that records events during the scraping process.
It helps track the scraper’s progress, debug issues, and maintain a record of what happened during execution.
Configuration parameters:
- filename: The full path to the log file where messages will be saved (LOG_PATH).
- level: The minimum severity level of messages to log. 'INFO' logs info, warnings, and errors.
- format: The layout of each log message. It includes:
   - %(asctime)s: Timestamp of the log entry.
   - %(levelname)s: The severity of the log message (INFO, ERROR, etc.).
   - %(message)s: The actual log message.
"""When collecting data from hundreds or thousands of web pages, how do you keep track of what your script is doing behind the scenes? This is where logging becomes essential. A well-configured logging system acts like a live journal for your scraper—recording everything from routine progress to unexpected errors. In this setup, Python’s logging.basicConfig() is used to capture important events in a clean and readable format.
The log file stores entries with a timestamp, the type of message (like INFO or ERROR), and a short description of what happened. By setting the log level to INFO, the system records not only problems but also routine actions—making it easier to understand where the scraper is in its workflow. This becomes incredibly useful when reviewing large-scale data extraction tasks, debugging failed requests, or simply ensuring that each product page was processed as expected. With proper logging, you gain visibility and control—two things every e-commerce analyst or data engineer needs for reliable automation.
Database Management
# SQLite Database Initialization
def init_db():
   """
   Initializes the SQLite database by creating the `product_urls` table if it does not exist.
   This table stores unique product URLs scraped from Amazon.
   """
   with sqlite3.connect(DB_PATH) as conn:
       cursor = conn.cursor()
       cursor.execute("""
           CREATE TABLE IF NOT EXISTS product_urls (
               id INTEGER PRIMARY KEY AUTOINCREMENT,
               url TEXT UNIQUE
           )
       """)
       conn.commit()When working with large-scale web scraping projects, managing the growing list of collected URLs becomes just as important as extracting the data itself. To keep things organized and avoid duplication, a SQLite database is often used. It’s lightweight, easy to set up, and doesn’t require any server—making it perfect for e-commerce data collection tasks.
In this setup, a function is used to initialize the database by creating a table named product_urls if it doesn’t already exist. Each URL is stored uniquely, thanks to a constraint that prevents the same link from being saved more than once. This clean structure ensures that no product page is visited twice, which saves both time and resources. For data analysts, developers, or product teams, it’s a practical way to maintain control over a constantly growing dataset.
Setting the ZIP Code for Accurate Amazon Data
# Change ZIP Code Function
async def change_zip_code(page):
   """
   Change the delivery ZIP code on Amazon to ensure region-specific product availability.
   Purpose:
   Amazon displays different products based on the delivery ZIP code (location). This function:
   - Simulates a user manually changing the delivery address to a specific ZIP code (10001 - New York).
   - Ensures that product listings are consistent and not filtered out due to regional shipping restrictions.
   - Helps retrieve more complete product listings during scraping.
   How it works:
   1. Click the location link in the top navigation bar.
   2. Waits for the ZIP code input modal to appear.
   3. Inputs the ZIP code `10001` and submits the change.
   4. Waits for confirmation and checks if the ZIP code was successfully applied.
   It logs each step, and handles errors gracefully in case any interaction fails.
   - A wait timeout is added for each step to ensure elements have time to load.
   - If the ZIP code change is successful, a success log is generated. If not, a warning is logged.
   - If any error occurs during the process, it is caught and logged as an error.
   """
   logging.info("Changing delivery zip code to 10001 (New York)")
   try:
       await page.wait_for_selector("#nav-global-location-popover-link", timeout=15000)
       await page.click("#nav-global-location-popover-link")
       logging.info("Clicked delivery location button")
       await page.wait_for_selector("#GLUXZipUpdateInput", timeout=15000)
       await page.fill("#GLUXZipUpdateInput", "10001")
       logging.info("Entered zip code 10001")
       await page.wait_for_selector("#GLUXZipUpdate span input[type='submit']", timeout=10000)
       await page.click("#GLUXZipUpdate span input[type='submit']")
       logging.info("Clicked Apply button")
       await page.wait_for_selector("button[name='glowDoneButton']", timeout=10000)
       await page.click("button[name='glowDoneButton']")
       logging.info("Clicked Done button")
       await page.wait_for_selector("#glow-ingress-line2", timeout=10000)
       delivery_text = await page.inner_text("#glow-ingress-line2")
       if "10001" in delivery_text:
           logging.info(f"Delivery location successfully set to: {delivery_text.strip()}")
       else:
           logging.warning(f"Delivery location not updated as expected. Current: {delivery_text.strip()}")
       await page.wait_for_timeout(2000)
   except Exception as e:
       logging.error(f"Failed to change zip code: {e}")Have you ever noticed how Amazon shows different products depending on where you’re located? That’s because availability, pricing, and even product listings often depend on your delivery ZIP code. So, if a scraper runs without setting a U.S. ZIP code, it might miss out on items that are only shown to American customers. To avoid this, a ZIP code update function can be used to simulate a user manually setting their location—specifically to New York’s 10001 ZIP code.
This is more than just a minor tweak. Automating the ZIP code change ensures the scraper collects region-specific results, reflecting what a real user in New York would see. The function interacts with Amazon’s page by clicking the location button, entering the new ZIP, applying it, and confirming the update. Each step includes a small wait time to match human-like behavior, reducing the risk of getting blocked or receiving incomplete data.
By logging every action—from opening the ZIP input to confirming the location—this function ensures the scraping session starts on solid ground. Whether you’re a data analyst trying to understand market availability or a product manager tracking competitors, this simple yet powerful detail helps ensure consistency and accuracy in your data collection workflow.
Smart Amazon Scraper: Human-Like Browsing for Accurate Product Data
# Scrape Amazon Function
async def scrape_amazon():
   """
   Main Scraping Function for Extracting Amazon Product URLs (Dog Food Category)
   Overview:
   ---------
   This is the main coroutine responsible for the entire scraping process. It uses Playwright (headless browser automation)
   and BeautifulSoup (HTML parsing) to extract product URLs from the Amazon dog food category.
   Key Steps:
   ----------
   1. Database Initialization:
      - Creates a local SQLite database (if not already created) and a table to store product URLs uniquely.
   2. User-Agent Rotation
      - Loads a list of user-agent strings from a local file and selects one at random.
      - This helps simulate different browsers and reduces the chances of getting blocked.
   3. Launch Browser Using Playwright
      - Starts a Firefox browser (in visible mode for debugging, `headless=False`).
      - Opens a new context and page for interaction.
   4. Set Cookies and Headers
      - Injects predefined session cookies to simulate a real user session.
      - Sets a random user-agent header for browser requests.
   5. Navigate to Start URL
      - Opens the Amazon dog food category page.
      - Calls `change_zip_code()` to set the delivery location to ZIP code `10001` (New York).
   6. Scraping Loop (Pagination)
      - Continues visiting each product listing page until no "Next" button is found.
      - On each page:
        a. Waits a random delay (to mimic human behavior).
        b. Loads the HTML content and parses it with BeautifulSoup.
        c. Searches for product anchor tags using multiple CSS selectors.
        d. Cleans and normalizes product URLs (ensuring absolute URLs).
        e. De-duplicates and inserts product URLs into the SQLite database using `INSERT OR IGNORE`.
   7. Pagination Logic
      - Checks for the "Next" page using common selectors.
      - If the selector fails, use JavaScript evaluation as a fallback.
      - Continues scraping until no more next pages are found.
   8. Cleanup
      - After scraping is complete, close the browser session gracefully.
   """
   init_db()
   # Load user agents
   with open(USER_AGENTS_PATH, "r") as f:
       user_agents = [line.strip() for line in f.readlines() if line.strip()]
   async with async_playwright() as p:
       browser = await p.firefox.launch(headless=False)
       context = await browser.new_context()
       page = await context.new_page()
       # Set cookies for session continuity
       await context.add_cookies(cookies)
       # Apply random user agent to reduce blocking risk
       ua = random.choice(user_agents)
       await page.set_extra_http_headers({"User-Agent": ua})
       # Connect to DB
       with sqlite3.connect(DB_PATH) as conn:
           cursor = conn.cursor()
           # Start scraping
           next_page = START_URL
           await page.goto(next_page, timeout=60000)
           await change_zip_code(page)
           while next_page:
               logging.info(f"Scraping page: {next_page}")
               await page.goto(next_page, timeout=120000)
               await page.wait_for_timeout(random.randint(3000, 6000))
               html = await page.content()
               soup = BeautifulSoup(html, "html.parser")
               product_links = []
               # Extract product URLs (main + sponsored)
               selectors = [
                   "div.a-section.a-spacing-none.a-spacing-top-small.s-title-instructions-style a.a-link-normal",
                   "a.a-link-normal.s-line-clamp-3.s-link-style.a-text-normal"
               ]
               for selector in selectors:
                   for a_tag in soup.select(selector):
                       href = a_tag.get("href")
                       if href:
                           # Clean URL: remove multiple BASE_URL prefixes
                           if href.startswith("https://"):
                               cleaned_url = href
                               if cleaned_url.startswith(BASE_URL + "/https://"):
                                   cleaned_url = cleaned_url.replace(BASE_URL + "/", "")
                           else:
                               cleaned_url = urljoin(BASE_URL, href)
                           product_links.append(cleaned_url)
               # De-duplicate and insert
               logging.info(f"Found {len(product_links)} product URLs on page")
               for url in set(product_links):
                   try:
                       cursor.execute("INSERT OR IGNORE INTO product_urls (url) VALUES (?)", (url,))
                       conn.commit()
                   except Exception as e:
                       logging.error(f"Failed to insert URL: {url} | Error: {e}")
               
                # Attempt to find the 'Next' button for pagination
               try:
                   next_page_tag = await page.query_selector("a.s-pagination-next")
                   if not next_page_tag:
                       next_page_tag = await page.query_selector("a.s-pagination-item.s-pagination-next.s-pagination-button.s-pagination-button-accessibility.s-pagination-separator")
                  
                   if next_page_tag:
                       href = await next_page_tag.get_attribute("href")
                       if href:
                           next_page = urljoin(BASE_URL, href)
                           logging.info(f"Next page found: {next_page}")
                       else:
                           next_page = None
                   else:
                       # Fallback using JS evaluation
                       next_href = await page.evaluate("""() => {
                           const next = document.querySelector('a.s-pagination-next');
                           return next ? next.href : null;
                       }""")
                       if next_href:
                           next_page = next_href
                           logging.info(f"Next page found via JS evaluation: {next_page}")
                       else:
                           next_page = None
                           logging.info("No more pages found. Scraping completed.")
               except Exception as e:
                   logging.error(f"Pagination extraction failed: {e}")
                   next_page = None
               # Add wait to mimic human-like behavior before next page
               await page.wait_for_timeout(random.randint(4000, 8000))
       await browser.close()
When collecting product data from massive e-commerce platforms, precision and planning are key. A simple request to load a page and extract links often isn't enough—Amazon constantly changes how data is presented, uses region-based filtering, and includes anti-bot mechanisms. That’s where an advanced scraper, driven by tools like Playwright and BeautifulSoup, comes into play. This setup isn’t just about grabbing data; it’s about thinking like a browser, acting like a human, and adapting like a smart assistant.
At the heart of the process is a function that carefully simulates a user’s journey through the Amazon dog food category. It begins by preparing a local SQLite database to store product URLs without duplication. To behave more naturally online, the scraper randomly chooses a user-agent string—this makes it look like the request is coming from a real browser on someone’s laptop or phone. It also sets session cookies to avoid redirection and triggers a custom ZIP code update to 10001 (New York), ensuring the page shows the intended regional results.
Once on the category page, the scraper patiently browses through each section. Using BeautifulSoup, it parses the HTML, finds product links using multiple selectors, and then standardizes them into clean, usable URLs. These URLs are checked to avoid duplicates and stored into the database with care. The scraping loop continues as long as a "Next" button is found—either directly via page elements or through a smart fallback using JavaScript evaluation.
What makes this scraper efficient isn't just how much it collects, but how gracefully it handles the flow. Logging is used throughout to track activity, errors are caught with minimal disruption, and random delays are introduced between actions to mimic human-like behavior. Once there are no more pages left to visit, the browser session is closed neatly.
For anyone managing e-commerce analytics or conducting competitor research, this kind of setup offers a reliable and structured way to access valuable product data—accurately, safely, and at scale.
Script Entry Point
# Entry Point of the Scraper Script
if name == "__main__":
   asyncio.run(scrape_amazon())
"""
Entry Point of the Scraper Script
This block ensures that the asynchronous `scrape_amazon()` function is executed
only when the script is run directly, not when it is imported as a module.
Key Concepts:
-------------
- `__name__ == "__main__"`:
   This condition checks if the current script is being run as the main program.
   If true, it executes the scraping function.
   If the script is imported into another Python script, this block will be skipped.
- `asyncio.run(scrape_amazon())`:
   This starts the asynchronous scraping coroutine.
   It handles setting up the event loop, running the task, and closing the loop when done.
This will begin scraping product URLs from the Amazon dog food category and store them in a SQLite database.
"""When building a web scraper using Python, one small but important line often determines whether the script actually runs or just sits idle. That line is: if name == "__main__":. At first glance, it might look a bit mysterious. But in simple terms, this line acts like the front door of the program—it checks if someone is running the script directly or just peeking inside it from another file. Only when the script is run directly will the scraping process begin.
Under that condition, the function asyncio.run(scrape_amazon()) is called. This is what launches the main engine of the scraper. Since the scraping function is asynchronous (meaning it performs multiple actions efficiently without waiting around), it needs a special loop to run properly. That’s exactly what asyncio.run() provides. It opens the loop, runs the scrape_amazon() task, and then closes the loop neatly when everything is finished.
This setup is especially useful when organizing code into multiple files or modules. If the script is imported elsewhere, maybe for testing or for use in a larger pipeline, the scraping won’t automatically run. It will only activate when you execute the file directly. This simple structure makes the scraper more reusable, cleaner to manage, and easier to extend later.
Step 2: Extracting Complete Product Details from Each URL
Importing Libraries
import asyncio
import json
import random
import sqlite3
import logging
from pathlib import Path
from bs4 import BeautifulSoup
from playwright.async_api import async_playwrightThe script begins, as expected, by importing essential built-in and third-party Python modules—such as sqlite3, asyncio, playwright, and BeautifulSoup—which provide the core functionality for database handling, asynchronous operations, browser automation, and HTML parsing.
Essential Paths
# Database and file paths
DB_PATH = "/home/anusha/Desktop/DATAHUT/Amazon/Data/US/dogfood_us1.db"
USER_AGENTS_PATH = "/home/anusha/Desktop/DATAHUT/Macys_clothing/user_agents.txt"
OUTPUT_JSON = "/home/anusha/Desktop/DATAHUT/Amazon/Data/US/product_data.json"
LOG_PATH = "/home/anusha/Desktop/DATAHUT/Amazon/Log/US/amazon_data_scraper.log"
"""
These constants define paths and filenames used throughout the scraper:
- DB_PATH: Location of SQLite database storing URLs and scraped product data.
- USER_AGENTS_PATH: Text file with a list of user agent strings.
- OUTPUT_JSON: Final output file to store structured product data.
- LOG_PATH: File path for storing logs.
"""Every well-structured scraping project begins with a solid configuration. Think of it as setting the foundation before constructing a building. In this case, a few constants keep everything in order: the DB_PATH points to where all scraped product URLs and details are stored safely in a SQLite database. The USER_AGENTS_PATH holds a list of browser identities to rotate during scraping, helping avoid detection. Once data is gathered, it’s saved in a clean and structured format using OUTPUT_JSON, making it easy to analyze later. And to track the entire scraping journey—every step, error, or success—LOG_PATH ensures everything is recorded in a neat log file. These paths may seem like small details, but they’re the backbone of a scraper that runs smoothly and reliably.
Applying Amazon Cookies to Mimic Human Browsing
#  Amazon Session Cookies
cookies = [
   {"name": "i18n-prefs", "value": "USD", "domain": ".amazon.com", "path": "/"},
   {"name": "lc-main", "value": "en_US", "domain": ".amazon.com", "path": "/"},
   {"name": "session-id", "value": "131-4818556-2161121", "domain": ".amazon.com", "path": "/"},
   {"name": "session-id-time", "value": "2082787201l", "domain": ".amazon.com", "path": "/"},
   {"name": "ubid-main", "value": "134-2297602-3092101", "domain": ".amazon.com", "path": "/"}
]
"""
Amazon uses cookies to store session and localization information.
These cookies simulate a persistent session for scraping with consistent regional settings.
"""To make sure the scraper views consistent results every time, predefined session cookies are used. These cookies help simulate a real user's experience—maintaining USD currency, English language, and stable session behavior across pages. This approach improves accuracy and reduces the chances of getting blocked during web scraping.
Mimicking Real Browsers with Custom HTTP Headers
# HTTP Headers 
headers_template = {
   "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
   "Accept-Language": "en-US,en;q=0.9",
   "Connection": "keep-alive",
   "Upgrade-Insecure-Requests": "1",
}
"""
HTTP Request Headers Template
This dictionary defines custom HTTP headers that mimic those sent by real web browsers (like Chrome or Firefox).
These headers are passed to the Playwright browser context to make automated requests appear more like
genuine human browsing, reducing the risk of bot detection by Amazon.
Headers Explained:
------------------
1. "Accept":
  - Tells the server what content types the client (browser) can process.
  - The values here indicate support for standard HTML, XHTML, XML, and modern image formats like WebP and AVIF.
  - Helps ensure the server responds with a fully formatted product page.
2. "Accept-Language":
  - Indicates the preferred languages for content.
  - "en-US,en;q=0.9" means US English is preferred, and any English dialect is acceptable as a fallback.
3. "Connection":
  - "keep-alive" allows the connection to stay open for multiple requests, improving speed and mimicking real browser behavior.
4. "Upgrade-Insecure-Requests":
  - Set to "1" to signal that the browser supports secure HTTPS connections and prefers them over HTTP.
Why These Headers Are Important:
--------------------------------
- Amazon and other large websites often use bot detection systems that analyze request headers.
- Sending requests without proper headers (or with default ones) can quickly result in CAPTCHA challenges or IP blocking.
- These headers help bypass basic bot detection by mimicking real users' network behavior.
"""In web scraping, blending in like a real user is half the battle—and that starts with how requests are made. When browsers visit a website like Amazon, they send HTTP headers that carry important context about the request. These headers tell the server what kind of content is expected, which language is preferred, how the connection should behave, and more. By manually setting headers like "Accept", "Accept-Language", and "Connection", the scraper behaves more like an actual browser. This makes the requests look natural, reducing the chance of being flagged or blocked. These small but powerful details help ensure smoother scraping sessions, especially on platforms known for strong bot protection. Want to avoid detection while gathering data? Setting the right headers is one of the smartest first steps.
Logging Configuration
# Logging Configuration
logging.basicConfig(filename=LOG_PATH, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
"""
Logs messages to a file for debugging and auditing.
Logs include timestamps, severity levels, and messages.
"""Keeping track of what happens during a scraping session is essential, especially when working with complex websites like Amazon. That’s where logging comes in. By configuring a logging system with timestamps and message levels (like INFO or ERROR), it's easier to monitor progress, catch issues early, and understand exactly when something went wrong. All logs are saved to a file, creating a clear and searchable record of the scraping process for later review.
Rotating User Agents to Avoid Detection
#  Helper Function to Get Random User Agent 
def get_random_user_agent():
   """
   Returns a random user agent string from the user_agents.txt file.
   Used to simulate requests from different browsers/devices.
   """
   with open(USER_AGENTS_PATH) as f:
       user_agents = f.read().splitlines()
   return random.choice(user_agents)Websites often identify visitors by their browser’s identity, known as a user agent. This tiny piece of information tells the site whether you're using Chrome on Windows, Safari on iPhone, or something else entirely. But for web scraping, sending the same user agent again and again is like knocking on a door wearing the same outfit every time—eventually, someone notices. To solve this, a helper function is used to randomly choose a user agent from a list stored in a text file. Every time the script runs, it picks a different one—like rotating disguises—to avoid detection. This not only helps the scraper look more like a human visitor but also makes it more resilient against basic anti-bot filters. It’s a small move with a big impact, especially on websites that closely watch for suspicious patterns. Need your scraper to fly under the radar? Rotating user agents is a must-have tactic.
Smart Text Extraction with Error-Free Handling
# Helper Function to Extract Text or Return None 
def extract_text_or_none(soup, selector):
   """
   Extracts text from the first HTML element matching the given CSS selector.
   Returns None if the element is not found.
   """
   tag = soup.select_one(selector)
   return tag.get_text(strip=True) if tag else NoneAnother useful trick in the scraper’s toolkit is handling missing or unpredictable data. Web pages often have inconsistent layouts—sometimes an element is there, and sometimes it's not. To deal with this gracefully, a simple helper function is used to extract text from a web element only if it exists. Instead of the script breaking when a product title or price isn’t found, this function quietly returns None, allowing the scraper to move on without interruption. It works by searching for the first HTML tag that matches a given CSS selector, and if found, it neatly pulls out the text. If not, it returns nothing—avoiding crashes and saving time on debugging. This little safeguard helps ensure the scraping process remains stable, even when the web page doesn't always behave as expected. For any team working with dynamic content extraction, this type of function is essential for building resilient, production-ready data pipelines.
Extracting List Price Without Noise
# Strict List Price Extraction 
def extract_list_price_strict(soup):
   """
   Extracts the list price only from the target div containing List Price text.
   Returns None if not found.
   """
   div = soup.select_one("div.a-section.a-spacing-small.aok-align-center")
   if div:
       span = div.select_one("span.a-price.a-text-price span.a-offscreen")
       if span:
           return span.get_text(strip=True)
   return NoneTo go a step further in extracting product pricing, another helper function focuses specifically on capturing the price—that’s the original price shown before any discount is applied. On many e-commerce sites like Amazon, this value is nested deep inside styled HTML blocks, which can be tricky to navigate. This function carefully targets the exact HTML structure where Amazon places the list price, ensuring we don’t accidentally pick up discounted or promotional prices instead. It first looks for a div container that usually holds price information, then drills down into the nested span tags where the actual dollar value is displayed. If the element isn't present—like in cases where no original price is listed—it simply returns None without causing any disruption. This targeted extraction adds another layer of precision to the scraping process, helping analysts capture price comparisons accurately and making data analysis far more reliable.
Extracting Complete Product Details from an Amazon Page
# Product Scraper Function 
async def scrape_product(page, url):
   """
   Scrapes detailed product information from a single Amazon product page.
   Overview:
   ---------
   This function automates visiting a given Amazon product URL using a headless browser
   (via Playwright), interacts with the page to set a delivery ZIP code (ensuring availability
   data is accurate), and extracts structured information using BeautifulSoup.
   Purpose:
   --------
   - Mimic human behavior to avoid Amazon's anti-bot detection.
   - Ensure delivery location is set to ZIP code `10001` (New York) for consistent product visibility.
   - Extract product-related data from the rendered HTML, even if dynamic content is involved.
   - Return clean and structured data in dictionary form for further storage or analysis.
   Function Workflow:
   ------------------
   1. Visit the Product Page:
       - Navigates to the provided URL with a timeout to avoid infinite waits.
   2. Set Delivery Location to ZIP 10001:
       - Simulates the user clicking on the delivery address area.
       - Fills in ZIP code `10001`, applies the change, and closes the popup.
       - Ensures that location-dependent product data is displayed correctly.
   3. Wait for the Page to Load Fully:
       - Uses a hard wait (5 seconds) to give the dynamic content time to render.
   4. Parse the Page Content with BeautifulSoup:
       - Grabs the pages HTML and parses it using BeautifulSoup for easier selector-based extraction.
   5. Extract Product Fields:
       - Uses helper functions (`extract_text_or_none`, `extract_list_price_strict`) to grab each field safely.
       - Handles missing elements gracefully (returns None).
       - Combines feature bullet points into a single string.
   6. Log and Return Extracted Data:
       - Logs the complete dictionary of extracted fields.
       - Returns the dictionary for database insertion or JSON saving.
   """
   logging.info(f"Scraping URL: {url}")
   await page.goto(url, timeout=60000)
   await page.wait_for_timeout(5000)
   # Change delivery location to 10001 (New York)
   try:
       await page.click('#nav-global-location-data-modal-action', timeout=10000)
       await page.wait_for_timeout(2000)
       await page.fill('#GLUXZipUpdateInput', '10001')
       await page.click('#GLUXZipUpdate')
       await page.wait_for_timeout(4000)
       await page.click('button[name="glowDoneButton"]')
       logging.info(f"Delivery location changed to 10001 for {url}")
   except Exception as e:
       logging.warning(f"Failed to change delivery location for {url}: {e}")
   await page.wait_for_timeout(5000)
   content = await page.content()
   soup = BeautifulSoup(content, "html.parser")
   # Extract required fields with None if not present
   data = {
       "product_url": url,
       "name": extract_text_or_none(soup, "#productTitle"),
       "image_url": soup.select_one("#landingImage")['src'] if soup.select_one("#landingImage") else None,
       "brand": extract_text_or_none(soup, "tr.po-brand span.po-break-word"),
       "size": extract_text_or_none(soup, "#inline-twister-expanded-dimension-text-size_name"),
       "flavor": extract_text_or_none(soup, "tr.po-flavor span.po-break-word"),
       "age_range": extract_text_or_none(soup, "tr.po-age_range_description span.po-break-word"),
       "description": " | ".join([li.get_text(strip=True) for li in soup.select("#feature-bullets ul li")]) if soup.select("#feature-bullets ul li") else None,
       "item_category": extract_text_or_none(soup, "tr.po-item_form span.po-break-word"),
       "total_purchased_count": extract_text_or_none(soup, "div.social-proofing-faceout-title span.a-text-bold"),
       "selling_price": extract_text_or_none(soup, "span.a-price span.a-offscreen"),
       "discount": extract_text_or_none(soup, "span.savingPriceOverride"),
       "original_price": extract_list_price_strict(soup),  # STRICT list price extraction
       "rating": extract_text_or_none(soup, "span.reviewCountTextLinkedHistogram span.a-size-base"),
       "rating_count": extract_text_or_none(soup, "#acrCustomerReviewText")
   }
   logging.info(f"Scraped data for {url}: {data}")
   return dataIn the world of e-commerce, where product availability and pricing shift constantly, having up-to-date and structured product data is essential. Whether you're a data analyst tracking market trends, a product manager monitoring competitor listings, or part of an e-commerce intelligence team, automated web scraping can unlock valuable insights—especially when done with care and precision. This blog walks through a Playwright-based product scraper for Amazon, designed specifically for the dog food category, though the logic can be extended to other segments. It combines dynamic browser automation with HTML parsing to collect consistent, structured product data while navigating real-world challenges like dynamic content, location-specific availability, and anti-bot mechanisms.
A critical part of this process is a function that extracts detailed product information from an Amazon product page. Once the browser reaches the given product URL, it begins by setting the delivery location to ZIP code 10001 (New York). This step ensures consistency—Amazon often tailors availability and pricing based on location, and skipping this can lead to missing or incorrect data. After confirming the ZIP code update, the script waits for the page to fully load to avoid capturing incomplete content. BeautifulSoup is then used to parse the HTML, making it easier to navigate the DOM and extract necessary fields.
The function pulls a wide range of data: product title, brand, image URL, size, flavor, age range, key features, product form (like dry or wet food), purchase count, price details, discounts, and customer ratings. To ensure reliability, it uses helper functions to fetch each field, handling missing elements without breaking the flow. For example, it checks if an image or rating exists before trying to read its value. Feature bullets are combined into a clean string, and even pricing is handled carefully to differentiate between original and discounted rates.
In the end, this function returns a well-structured dictionary of all relevant product information. It’s clean, readable, and ready for storage in databases or transformation into a JSON file for further analysis. By logging each step—from page visit to ZIP code application to final data capture—it becomes easier to monitor the scraping process, debug issues, and scale up confidently.
Main Function to Run the Scraper
# Main Function to Run the Scraper 
async def main():
   """
   Main function that coordinates scraping all unprocessed product URLs.
   Workflow:
   - Connects to SQLite database.
   - Creates tables and adds missing columns if needed.
   - Loads URLs that haven't been scraped (scraped=0).
   - Opens browser using Playwright with random user agent and cookies.
   - Loops over each product URL:
       - Extracts product data
       - Saves it to database and JSON
       - Marks the URL as scraped
   - Saves final results to a JSON file.
   """
   conn = sqlite3.connect(DB_PATH)
   cursor = conn.cursor()
   # Create scraped column if not exists
   try:
       cursor.execute("ALTER TABLE product_urls ADD COLUMN scraped INTEGER DEFAULT 0")
       conn.commit()
   except:
       pass
   # Create product_data table if not exists
   cursor.execute("""
       CREATE TABLE IF NOT EXISTS product_data (
           id INTEGER PRIMARY KEY AUTOINCREMENT,
           product_url TEXT UNIQUE,
           name TEXT,
           image_url TEXT,
           brand TEXT,
           size TEXT,
           flavor TEXT,
           age_range TEXT,
           description TEXT,
           item_category TEXT,
           total_purchased_count TEXT,
           selling_price TEXT,
           discount TEXT,
           original_price TEXT,
           rating TEXT,
           rating_count TEXT
       )
   """)
   conn.commit()
  # Load URLs that haven't been scraped yet
   cursor.execute("SELECT id, url FROM product_urls WHERE scraped=0")
   urls = cursor.fetchall()
  
   # Pick a random user agent
   user_agent = get_random_user_agent()
 
   # Launch browser and start scraping
   async with async_playwright() as p:
       browser = await p.chromium.launch(headless=False)
       context = await browser.new_context(
           user_agent=user_agent,
           extra_http_headers=headers_template
       )
       await context.add_cookies(cookies)
       page = await context.new_page()
       all_results = []
       for row in urls:
           id_, url = row
           try:
               data = await scrape_product(page, url)
               all_results.append(data)
               # Insert into product_data table
               cursor.execute("""
                   INSERT OR REPLACE INTO product_data (
                       product_url, name, image_url, brand, size, flavor, age_range,
                       description, item_category, total_purchased_count,
                       selling_price, discount, original_price, rating, rating_count
                   ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
               """, (
                   data["product_url"], data["name"], data["image_url"], data["brand"], data["size"],
                   data["flavor"], data["age_range"], data["description"], data["item_category"],
                   data["total_purchased_count"], data["selling_price"], data["discount"],
                   data["original_price"], data["rating"], data["rating_count"]
               ))
               conn.commit()
               # Mark as scraped
               cursor.execute("UPDATE product_urls SET scraped=1 WHERE id=?", (id_,))
               conn.commit()
               logging.info(f"Data saved for {url}")
           except Exception as e:
               logging.error(f"Error scraping {url}: {e}")
       # Save all results to output JSON file
       with open(OUTPUT_JSON, "w") as f:
           json.dump(all_results, f, indent=4)
       await browser.close()
   conn.close()Managing the end-to-end workflow of a web scraping project requires more than just visiting pages and pulling data—it also involves organizing that process systematically. The main() function acts as the central controller that brings together all the moving parts of the scraping system. Think of it like the manager of a busy kitchen, coordinating ingredients (URLs), tools (browser and headers), and helpers (functions) to produce a well-structured data dish.
It starts by connecting to an SQLite database and making sure everything is in place: from creating required tables to ensuring a scraped column exists to track which URLs have already been processed. This prevents duplication and lets the scraper resume smoothly even if it was stopped midway. Next, it selects all unprocessed product URLs from the database—those with a scraped status of 0.
Once the data is ready, the function launches a browser session using Playwright. To avoid detection by anti-bot systems, it sets a random user agent and predefined headers, then loads cookies to simulate a real user session. For every product URL, it visits the page, extracts structured product details using the scrape_product() function, and saves the results both in a product_data database table and into a local JSON file.
After each product is successfully scraped, its status is updated in the database so that it's not processed again. If an error occurs, it is logged without stopping the entire process. Finally, all gathered data is saved into an output JSON file for easy access and analysis.
This well-organized loop not only makes the scraping process scalable and efficient but also ensures that results are stored reliably for future use. Whether you're a data analyst preparing for insights or a product manager monitoring listings, this kind of structured workflow ensures you always have clean, up-to-date data in hand.
Execution Flow
#  Entry Point 
if name == "__main__":
   asyncio.run(main())
"""
This block ensures the script runs only when executed directly (not imported as a module).
It uses asyncio to run the `main()` function asynchronously, which starts the scraping process.
"""Every Python script needs a clear starting point—just like a train waits for the signal before it moves. In this case, the line if name == "__main__": acts as that signal. It tells Python to begin the execution of the script only if it’s run directly, not when it’s imported elsewhere. This setup ensures that the scraper launches at the right time. Inside this block, asyncio.run(main()) kicks off the entire asynchronous scraping process. It’s like pressing the “Go” button, allowing all the carefully written logic to unfold—from loading pages to saving data. This simple structure keeps everything organized and ensures that automation begins exactly where and when it should.
Conclusion
In today’s data-driven world, staying ahead in e-commerce often comes down to having the right information at the right time. This blog demonstrated how web scraping—when done strategically using tools like Playwright, SQLite, and BeautifulSoup—can transform thousands of Amazon product listings into clean, structured, and insightful data. By automating the collection of real-time product details from the dog food category, we’ve shown how even a complex site like Amazon can be decoded with the right approach. Whether you're a data analyst, researcher, or entrepreneur, this scraping workflow lays the foundation for smarter decision-making, competitive analysis, and scalable data solutions—without the guesswork.
Libraries and Versions
Name: playwright
Version: 1.48.0
Name: beautifulsoup4
Version: 4.13.3
AUTHOR
I’m Anusha P O, Data Science Intern at Datahut. I specialize in building smart scraping systems that automate large-scale data collection from complex e-commerce sites like Amazon. In this blog, I walk you through how we extracted and structured thousands of product listings from Amazon’s dog food section using Playwright, SQLite, and asynchronous Python workflows—turning vast amounts of raw HTML into clean, analysis-ready datasets.
At Datahut, we help businesses unlock the full potential of web data by designing robust, scalable scraping solutions tailored for competitive intelligence, pricing analysis, and product visibility tracking. If you’re exploring data-driven strategies for e-commerce or product research, reach out via the chat widget on the right. Let’s work together to transform your data needs into actionable insights.
FAQs
1. Is it legal to scrape Amazon dog-food product data using Python libraries?
Scraping Amazon must be done carefully, as Amazon’s Terms of Service restrict automated scraping. To stay compliant, focus on publicly available data, respect robots.txt, and avoid overloading servers. For commercial use, consider Amazon’s official APIs.
2. Which Python libraries are best for scraping Amazon dog-food product listings?
Popular libraries include Requests for sending HTTP requests, BeautifulSoup for parsing HTML, Scrapy for large-scale crawling, and Selenium or Playwright for handling dynamic content like JavaScript-rendered pages.
3. Can I scrape product details like price, reviews, and ratings for Amazon dog-food items?
Yes, you can extract details such as product title, price, ratings, number of reviews, brand, and ingredients. However, prices and stock change frequently, so it’s important to schedule your scraper to run at regular intervals for updated data.
4. How do I avoid getting blocked while scraping Amazon?
Use techniques like rotating user agents, proxy servers, and adding random delays between requests. Also, keep your scraping rate slow to mimic human browsing and reduce the risk of CAPTCHAs or IP bans.
5. What are the real-world applications of scraping Amazon dog-food data?
Scraped data can help with price comparison, market trend analysis, competitor monitoring, customer sentiment analysis through reviews, and inventory optimization for pet product retailers and e-commerce businesses.


