How to Scrape Ceiling Fans Data from Amazon ?

Q: What data should I collect when scraping ceiling fans on Amazon?

Collect product title, ASIN, price, list price, discounts, images, specs, rating, review count, seller name, shipping info, availability, dimensions/weight, category breadcrumbs, product URL, scrape timestamp, and source HTML snapshot.

Q: Which tools & approach work best for scraping Amazon listings?

Use requests + BeautifulSoup for static pages, Playwright or Selenium for JS pages. For scale, use Playwright with rotating proxies and throttling. Prefer Amazon's Product Advertising API when possible.

Q: How do I find the right HTML elements (selectors) for ceiling fan info?

Inspect product pages in the browser. Common selectors: #productTitle for title, #priceblock_ourprice or #priceblock_dealprice for price, #imgTagWrapperId img for images, .a-icon-alt for rating, and #acrCustomerReviewText for reviews. Test selectors across multiple pages.

Q: What anti-blocking and legal/ethical practices should I follow?

Rate-limit requests, rotate user-agents and IPs, use exponential backoff, respect robots.txt and Amazon's Terms of Service, prefer the official API, and avoid scraping personal/buyer data.

Q: How should I store, clean, and deduplicate scraped fan data?

Store raw HTML/JSON and parsed fields with timestamps. Use ASIN as a unique key for deduplication. Normalize prices and units, parse numeric fields, and keep a history table for price/availability trends.

Shahana farvin
Sep 26
13 min read

Updated: Oct 27

How to Scrape Ceiling Fans Data
from Amazon ? — How to Scrape Ceiling Fans Data from Amazon ?

Web scraping can seem overwhelming at first, but it's really just about teaching your computer to visit websites and collect information automatically. Today, we'll walk through a project that scrapes ceiling fan data from Amazon India. We'll break this down into simple, manageable steps that anyone can follow.

This project works in two phases. First, we collect all the product page links from Amazon's search results. Then, we visit each of those links to gather detailed information about each ceiling fan. Let's start with phase one.

Url Collection

Setting Up Our Tools

Before we can start collecting data, we need to import the right tools for the job. Think of this like gathering all your materials before starting a craft project.

import sqlite3
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

We're using four main tools here. SQLite helps us store our data in an organised database. Asyncio lets our program handle multiple tasks efficiently. Playwright acts like a remote control for web browsers, letting us navigate websites automatically. Beautiful Soup helps us read and understand the structure of web pages.

We also set up a simple variable to name our database file. This keeps everything organised and makes it easy to find our data later.

# SQLite database setup
DB_NAME = "amazon_products.db"

Creating Our Data Storage

Every good project needs a safe place to store the information we collect. We create a database that works like a digital filing cabinet.

def setup_database():
    """
    Create a SQLite database and table if it does not exist.

    This function initializes the database by creating a table named `product_urls`
    with two columns:
    - `id`: An auto-incrementing primary key.
    - `url`: A unique text field to store product URLs.

    If the table already exists, this function does nothing.
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS product_urls (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT NOT NULL UNIQUE
        )
    """)
    conn.commit()
    conn.close()

This function creates a simple table with two columns. The first column gives each entry a unique number automatically. The second column stores the actual web addresses of the products we find. The database only accepts each URL once, which prevents us from accidentally storing duplicates.

Saving URLs to Our Database

Once we find product links, we need a way to save them safely. This function handles that task for us.

def save_urls_to_db(urls):
    """
    Save a list of product URLs to the SQLite database.
    Args:
        urls (list): A list of product URLs to be saved.

    This function inserts URLs into the `product_urls` table. Duplicate URLs are ignored
    using the `INSERT OR IGNORE` statement to ensure uniqueness.
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        for url in urls:
            cursor.execute("INSERT OR IGNORE INTO product_urls (url) VALUES (?)", (url,))
        conn.commit()
    except sqlite3.Error as e:
        print(f"Database error: {e}")
    finally:
        conn.close()

This function takes a list of URLs and adds each one to our database. The "INSERT OR IGNORE" command means that if we try to add a URL that already exists, the database will simply skip it instead of causing an error. We wrap everything in a try-except block to handle any problems gracefully.

Collecting Product URLs from Amazon

Now comes the main event. This function visits Amazon's search results and collects links to individual product pages.

async def scrape_product_urls():
    """
    Scrape product URLs from Amazon India and handle pagination.

    This function uses Playwright to navigate through Amazon India's search results pages
    for ceiling fans. It extracts product URLs using Beautiful Soup and saves them to
    the SQLite database.

    Pagination is handled by iterating through a range of pages (1 to 401). The URLs are
    dynamically constructed using the `base_url` template.

    Steps:
    1. Navigate to each page.
    2. Wait for the page content to load.
    3. Parse the page content using Beautiful Soup.
    4. Extract product URLs from the page.
    5. Save the URLs to the database.

    Error handling ensures graceful recovery in case of scraping or database issues.
    """
    base_url = "https://www.amazon.in/s?k=ceiling+fans&i=kitchen&page={page}&crid=28ITASDE5GSK7&qid=1752732819&sprefix=ceiling+fans+%2Ckitchen%2C229&xpid=j43vzzw1MmBiO&ref=sr_pg_{page}"
    product_urls = []

We start by creating a template URL that we can modify for different pages. Amazon shows search results across many pages, so we need to visit each page systematically. The curly braces in the URL act like blanks that we'll fill in with page numbers.

    async with async_playwright() as playwright:
        try:
            browser = await playwright.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

Here we launch a browser that our program can control. We set "headless=False" which means you can actually watch the browser work. This is helpful when you're learning because you can see exactly what's happening.

The real work happens in a loop that visits each page of search results.

            for page_number in range(1, 401):  # Iterate through pages 1 to 401
                current_url = base_url.format(page=page_number)
                print(f"Scraping: {current_url}")
                await page.goto(current_url, timeout=60000)

                # Wait for page content to load
                await page.wait_for_selector("div[role='listitem']", timeout=60000)
                content = await page.content()

For each page number from 1 to 400, we create the full URL by inserting the page number into our template. Then we navigate to that page and wait for it to load completely. The wait_for_selector line ensures that the product listings have appeared before we try to extract information from them.

Once the page loads, we use Beautiful Soup to find the product links.

                # Parse the page content with Beautiful Soup
                soup = BeautifulSoup(content, "html.parser")
                product_elements = soup.select("span.rush-component > a.a-link-normal.s-no-outline")

                # Extract product URLs
                page_urls = []
                for element in product_elements:
                    href = element.get("href")
                    if href and href.startswith("/"):
                        page_urls.append(f"https://www.amazon.in{href}")

Beautiful Soup reads the page like a structured document and finds all the links that match Amazon's pattern for product pages. We look for specific CSS selectors that Amazon uses for product links. Each link we find gets added to our collection, but we need to add Amazon's domain to the beginning since the links are relative.

After collecting URLs from each page, we add them to our main list and show our progress. Then save the urls using save_urls_to_db() function mentioned before.

                # Print the number of URLs scraped from the current page
                print(f"Number of URLs scraped from page {page_number}: {len(page_urls)}")
                
                # Add the URLs from the current page to the main list
                product_urls.extend(page_urls)

            await browser.close()

        except Exception as e:
            print(f"Error during scraping: {e}")

    # Save URLs to the database
    save_urls_to_db(product_urls)
    print(f"Scraped {len(product_urls)} product URLs.")

Bringing It All Together

The final part of our script coordinates everything we've built.

if __name__ == "__main__":
    """
    Entry point of the script.

    This script performs the following tasks:
    1. Sets up the SQLite database.
    2. Initiates the scraping process to extract product URLs.
    """
    setup_database()
    asyncio.run(scrape_product_urls())

When we run the script, it first sets up our database table. Then it starts the URL collection process. The asyncio.run command handles all the complex timing that makes our web scraping work smoothly.

This completes phase one of our project. We now have a database filled with links to individual ceiling fan product pages on Amazon India. In phase two, we would visit each of these URLs to collect detailed information about each product, like prices, ratings, and specifications.

The beauty of this approach is that we've separated the two tasks. We can run this script once to collect all the URLs, then run a different script to gather the detailed information. This makes our code more organised and easier to debug if something goes wrong.

Data Collection

Now that we have all our product URLs safely stored, it's time to visit each page and gather the specific details we want. This phase is like having a list of addresses and then visiting each house to collect information about what's inside.

Setting Up for Detailed Data Collection

Phase two starts with some additional tools that we'll need for handling the more complex data we're about to collect.

import sqlite3
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import json  # Add this import at the top of the file
import random  # Add this import at the top of the file

# SQLite database setup
DB_NAME = "amazon_products.db"

We add JSON support because product specifications and details come in complex formats that need special handling. The random module helps us add small delays between requests, which is important for being respectful to Amazon's servers.

Creating Storage for Product Details

Just like we needed a place to store URLs in phase one, we need a more sophisticated storage system for all the product details we're about to collect.

def setup_product_table():
    """
    Create a SQLite table named `product_data` if it does not exist.

    This table stores detailed product information scraped from Amazon.
    Columns:
    - `id`: Auto-incrementing primary key.
    - `title`: Product title.
    - `price`: Product price.
    - `rating`: Product rating.
    - `reviews`: Number of reviews.
    - `discount`: Discount percentage.
    - `original_price`: Original price.
    - `color`: Product color.
    - `url`: Product URL (unique).

    Also ensures the `scraped` column exists in the `product_urls` table.
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS product_data (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT,
            rating TEXT,
            reviews TEXT,
            price TEXT,
            discount TEXT,
            original_price TEXT,
            color TEXT,
            specifications TEXT,
            extra_details TEXT,
            about TEXT,
            url TEXT NOT NULL UNIQUE
        )
    """)

This new table has columns for all the different pieces of information we want to collect from each product page. Some columns like specifications, extra_details, and about will store complex information as JSON text, which lets us keep lists and detailed information organised.

We also need to modify our original URL table to track which pages we've already processed.

    # Check if the `scraped` column exists in the `product_urls` table
    cursor.execute("PRAGMA table_info(product_urls)")
    columns = [row[1] for row in cursor.fetchall()]
    if "scraped" not in columns:
        cursor.execute("""
            ALTER TABLE product_urls ADD COLUMN scraped INTEGER DEFAULT 0
        """)

    conn.commit()
    conn.close()

This code checks if our URL table already has a "scraped" column. If not, it adds one. This column acts like a checkbox system, marking each URL as either processed (1) or not yet processed (0).

Managing Our Scraping Progress

We need helper functions to keep track of which URLs we still need to process and to mark them as complete when we're done.

def fetch_unscraped_urls():
    """
    Fetch all URLs from the `product_urls` table where `scraped` is 0.

    Returns:
        list: A list of unscraped product URLs.
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    cursor.execute("SELECT url FROM product_urls WHERE scraped = 0")
    urls = [row[0] for row in cursor.fetchall()]
    conn.close()
    return urls

This function looks at our URL table and returns only the URLs we haven't processed yet. It's like having a to-do list that automatically shows us what work is left.

def mark_url_as_scraped(url):
    """
    Mark a URL as scraped in the `product_urls` table.

    Args:
        url (str): The URL to mark as scraped.
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    cursor.execute("UPDATE product_urls SET scraped = 1 WHERE url = ?", (url,))
    conn.commit()
    conn.close()

After we successfully collect data from a product page, this function marks that URL as complete. If our scraping gets interrupted, we can restart and pick up exactly where we left off.

Saving Complex Product Data

The product information we collect is much more detailed than simple URLs, so we need a more sophisticated saving function.

def save_product_data(data):
    """
    Save scraped product data to the `product_data` table.

    Args:
        data (dict): A dictionary containing product details.
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()

        # Convert the specifications list to a JSON string
        data["specifications"] = json.dumps(data["specifications"])
        data["extra_details"] = json.dumps(data["extra_details"])
        data["about"] = json.dumps(data["about"])

        cursor.execute("""
            INSERT OR IGNORE INTO product_data (title, rating, reviews, price, discount, original_price, color, specifications, extra_details, about, url)
            VALUES (:title, :rating, :reviews, :price, :discount, :original_price, :color, :specifications, :extra_details, :about, :url)
        """, data)
        conn.commit()
    except sqlite3.Error as e:
        print(f"Database error: {e}")
    finally:
        conn.close()

Before saving, we convert complex data like specifications into JSON format. This lets us store lists and detailed information in a way that we can easily read back later. The function uses named parameters, which makes the code clearer and safer.

Extracting Information from Product Pages

Now comes the detailed work of finding and extracting specific pieces of information from each product page. Amazon's pages have consistent patterns, so we can write functions that know exactly where to look for each piece of data.

async def parse_title(soup):
    """
    Parse the product title from the HTML content.

    Args:
        soup (BeautifulSoup): Parsed HTML content.

    Returns:
        str: Product title or None if not found.
    """
    try:
        return soup.select_one("h1#title > span#productTitle").get_text(strip=True)
    except AttributeError:
        return None

Each parsing function follows the same pattern. We use CSS selectors to find the exact location of the information we want, then extract the text content. If something goes wrong or the information isn't there, we return None instead of crashing.

The CSS selectors might look complex, but they're just precise addresses that tell Beautiful Soup exactly where to find each piece of information on the page. Each function handles one specific piece of data, like title, price, or rating.

async def parse_product_specifications(soup):
    """
    Parse the product specifications from the HTML content.

    Args:
        soup (BeautifulSoup): Parsed HTML content.

    Returns:
        list: A list of dictionaries containing key-value pairs for product specifications.
    """
    specifications = []
    try:
        # Find all rows in the specifications table
        rows = soup.select("table.a-normal.a-spacing-micro tr")
        for row in rows:
            # Extract the key (left column) and value (right column)
            key = row.select_one("td.a-span3 span.a-size-base.a-text-bold")
            value = row.select_one("td.a-span9 span.a-size-base.po-break-word")
            if key and value:
                specifications.append({
                    "key": key.get_text(strip=True),
                    "value": value.get_text(strip=True)
                })
    except AttributeError:
        pass  # Handle cases where the structure is not found

    return specifications

Some functions, like this one for specifications, are more complex because they extract multiple pieces of related information. This function finds Amazon's specifications table and converts each row into a key-value pair, creating a list of all the technical details about the product.

Coordinating the Data Extraction

We need a main function that uses all our individual parsing functions to extract complete product information from each page.

async def parse_product_page(content, url):
    """
    Parse the product page content and extract details.

    Args:
        content (str): HTML content of the product page.
        url (str): URL of the product page.

    Returns:
        dict: A dictionary containing product details.
    """
    soup = BeautifulSoup(content, "html.parser")
    title = await parse_title(soup)
    price = await parse_price(soup)
    rating = await parse_rating(soup)
    reviews = await parse_number_of_reviews(soup)
    discount = await parse_discount_percentage(soup)
    original_price = await parse_original_price(soup)
    color = await parse_color(soup)
    specifications = await parse_product_specifications(soup)
    extra_details = await parse_extra_details(soup)
    about = await parse_about(soup)

    return {
        "title": title,
        "price": price,
        "rating": rating,
        "reviews": reviews,
        "discount": discount,
        "original_price": original_price,
        "color": color,
        "url": url,
        "specifications": specifications,
        "extra_details": extra_details,
        "about": about
    }

This function takes the raw HTML content of a product page and runs it through all our parsing functions. The result is a clean dictionary containing all the information we could extract from that page.

Running the Complete Data Collection

The main scraping function brings everything together, processing each URL in our database systematically.

async def scrape_product_data():
    """
    Scrape product data by iterating through URLs in the `product_urls` table.

    This function uses Playwright to navigate to each product URL, extracts data,
    and saves it to the `product_data` table. It also marks URLs as scraped.
    """
    urls = fetch_unscraped_urls()
    if not urls:
        print("No unscraped URLs found.")
        return

    async with async_playwright() as playwright:
        try:
            browser = await playwright.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()

            for url in urls:
                try:
                    print(f"Scraping: {url}")
                    await page.goto(url, timeout=60000)

                    # Wait for the page content to load
                    await page.wait_for_selector("#productTitle", timeout=60000)
                    content = await page.content()

                    # Parse the product page
                    product_data = await parse_product_page(content, url)

                    # Save the product data to the database
                    save_product_data(product_data)

                    # Mark the URL as scraped
                    mark_url_as_scraped(url)

                    # Add a random delay between 2 and 3 seconds
                    delay = random.uniform(2, 3)
                    print(f"Delaying for {delay:.2f} seconds...")
                    await asyncio.sleep(delay)

                except Exception as e:
                    print(f"Error scraping {url}: {e}")

            await browser.close()

        except Exception as e:
            print(f"Error initializing Playwright: {e}")

The function starts by getting all unprocessed URLs. For each URL, it navigates to the page, waits for the main content to load, extracts all the product information, saves it to our database, and marks the URL as complete.

The random delay between requests is important. It makes our scraping more natural and respectful to Amazon's servers. Each delay is between 2 and 3 seconds, which gives the server time to breathe between our requests.

Completing the Project

The final coordination happens when we run the script, just like in phase one.

if __name__ == "__main__":
    """
    Entry point of the script.

    This script performs the following tasks:
    1. Sets up the SQLite database and tables.
    2. Scrapes product data from URLs stored in the `product_urls` table.
    """
    setup_product_table()
    asyncio.run(scrape_product_data())

When we run phase two, it sets up the new database table and then processes all the URLs we collected in phase one. The beauty of this two-phase approach is that each part has a clear job. Phase one focuses on finding all the products, while phase two focuses on collecting detailed information about each one.

By the end of this process, we have a complete database of ceiling fan information from Amazon India. We can analyze prices, compare ratings, study specifications, and gain insights into the ceiling fan market. The structured data we've collected opens up possibilities for analysis, comparison shopping tools, or market research that would be impossible to do manually.

Wrapping Up

You've successfully built a complete web scraping system that collects product data from Amazon India. By separating URL collection from data extraction, you created a robust and maintainable solution that can handle thousands of products efficiently.

The skills you've learned here go beyond just this project. You now understand how to navigate websites programmatically, extract structured data from HTML, manage databases, and implement respectful scraping practices. These techniques can be adapted to collect data from other websites and build different types of analysis tools.

Your database is now filled with valuable product information that would have taken weeks to collect manually. Whether you use this data for market research, price comparison, or trend analysis, you have the foundation to turn web data into actionable insights.

FAQ Section

1) What data should I collect when scraping ceiling fans on Amazon?

Collect product title, ASIN, price, list price (MRP), discounts, product images, bullet points/specs, average rating, number of reviews, seller name, shipping info, availability, dimensions/weight (if shown), category breadcrumbs, and product URL. Also capture scrape timestamp and source page HTML or snapshot for auditing.

2) Which tools & approach work best for scraping Amazon listings?

Start with requests + BeautifulSoup for static pages. Use Playwright or Selenium (headless) for JS-rendered content or infinite scroll. For scale or robust anti-bot handling, use Playwright with rotating proxies, randomized user-agents, and request throttling. Prefer using Amazon Product Advertising API if you have access — it’s official and safer.

3) How do I find the right HTML elements (selectors) for ceiling fan info?

Open a product page in a browser, right-click → Inspect. Look for: title (#productTitle), price (#priceblock_ourprice or #priceblock_dealprice), images (#imgTagWrapperId img or data-a-dynamic-image), rating (.a-icon-alt), reviews count (#acrCustomerReviewText), and specs in the “Product details” or “Technical Details” table. Use CSS selectors or XPath to extract those nodes. Always test selectors across multiple product pages—Amazon templates vary.

4) What anti-blocking and legal/ethical practices should I follow?

Rate-limit your requests (random delays), use backoff on failures, rotate User-Agent strings, and rotate IPs/proxies if scraping at scale. Respect robots.txt and Amazon’s Terms of Service; prefer their official API when possible. Never scrape or store personal data (buyer info). Monitor for CAPTCHA / 503 responses and stop if the site detects you. Log your activity and add caching to reduce load on Amazon.

5) How should I store, clean, and deduplicate scraped fan data?

Store raw HTML or JSON output plus parsed fields and a timestamp. Use ASIN as the unique key to deduplicate. Normalize prices (store numeric + currency), strip whitespace from titles/specs, parse numeric values (ratings, review counts), and standardize units (e.g., dimensions). Keep a version or history table if you want price/availability time series.

AUTHOR

I’m Shahana, a Data Engineer at Datahut, where I specialize in building smart, scalable data pipelines that transform messy web data into structured, usable formats—especially in domains like retail, e-commerce, and competitive intelligence.

At Datahut, we help businesses across industries gather valuable insights by automating data collection from websites, even those that rely on JavaScript and complex navigation. In this blog, I’ve walked you through a real-world project where we created a robust web scraping workflow to collect product information efficiently using Playwright, BeautifulSoup, and SQLite. Our goal was to design a system that handles dynamic pages, pagination, and data storage—while staying lightweight, reliable, and beginner-friendly.

If your team is exploring ways to extract structured product or pricing data at scale—or if you're just curious how web scraping can support smarter decisions—feel free to connect with us using the chat widget on the right. We’re always excited to share ideas and build custom solutions around your data needs.