top of page

How to Scrape Tablet Data from Amazon Using Playwright (Step-by-Step Tutorial)

  • Writer: Anusha P O
    Anusha P O
  • 2 hours ago
  • 15 min read
How to Scrape Tablet Data from Amazon Using Playwright (Step-by-Step Tutorial)

Amazon India’s tablets section becomes especially interesting during the Great Indian Festival, when prices fluctuate rapidly, rankings change by the hour, and new offers appear across thousands of product listings. This dataset matters because it captures how real tablet products are presented, priced, and promoted during one of the busiest sale periods of the year, offering a clear window into pricing trends, brand competition, and product visibility on a large e-commerce platform.


In this blog, Playwright is used as the core tool because it behaves like a real browser, allowing dynamic pages to load fully before data is collected, even when content changes frequently during a sale. Over a carefully managed three-day scraping process, product URLs and detailed tablet information such as names, brands, and sale prices are extracted in a stable and controlled manner.


Scrape Tablet Data from Amazon Using Playwright


Let us analyze how to Scrape Tablet Data from Amazon using Playwright step by step:


Stage 1: URL Scraping from the Amazon Tablets Section


Before detailed product information can be collected, the first and most time-consuming step is gathering the correct product page URLs, and this stage focuses on how tablet links were carefully scraped from Amazon’s tablets section during the high-traffic Great Indian Festival period. The tablets listing page acts as the main entry point, where hundreds of products appear across multiple pages and continuously change due to offers, rankings, and availability, much like shelves being rearranged in a busy store every few hours. The scraping process was spread across three days, allowing the system to handle rate limits, page refreshes, and temporary blocks more safely while ensuring no valid tablet listings were missed. The script opens the category page, waits for products to load fully, and then scans each visible product card to extract only the product URLs, carefully skipping accessories or unrelated items. As pagination moves the listing forward, newly discovered links are stored in a database with progress tracking, so already-collected URLs are not repeated in later runs in a day. In next day urls are collected in a separate table.


Stage 2: Scraping Data by Visiting Each Amazon Tablet Product Page


After the tablet product URLs were safely collected, the next step focused on visiting each link individually and carefully extracting the details displayed on those pages, turning simple URLs into meaningful product data. This stage can be compared to opening every product box after listing them, taking time to read the label, brand, and price instead of judging from the shelf alone. During the Great Indian Festival, when prices and availability changed frequently, this process was intentionally spread across three days to reduce pressure on the website and ensure stable, accurate data collection. Each stored URL was opened one by one, the page was given enough time to load fully, and the visible content was converted into a readable structure so important details such as the tablet name, brand, sale price, and scrape time could be captured without confusion. Once the information was extracted, it was saved into a structured database and marked as completed, which prevented the same page from being revisited in later runs and allowed the scraping process to pause and resume safely if needed.


Stage 3: Cleaning and Preparing the Dataset


After scraping raw product data from the Amazon Tablet product pages, the next step was cleaning and preparing the dataset so it could be analyzed reliably. Using OpenRefine, the scraped file was loaded into a spreadsheet-style view, making inconsistencies easy to detect. Prices were in Indian Rupees (₹) and removing currency symbols, commas, and formatting issues make it . During this stage, non-relevant products such as tablet medicines, which appeared alongside tablets on Amazon, were identified and removed to keep the dataset category-specific. Duplicate product URLs generated during scraping were also eliminated. By the end of this process, the dataset was clean, consistent, and fully ready for accurate analysis of Amazon tablet products.


Essential Python Libraries Powering the Amazon Tablet Scraper


At first glance, a web scraper may appear to be just a short script that pulls data from a webpage, but behind that simplicity lies a group of carefully chosen Python libraries working together to keep the process stable and beginner friendly. In this project, asyncio manages the flow of tasks so the scraper can wait patiently while Amazon tablet pages load without freezing the entire program, which is especially useful when handling many product URLs. Playwright acts as the real browser, opening Amazon’s tablets section, loading JavaScript-driven content, and capturing the full page just as a human visitor would during the Great Indian Festival rush. Once the page content is available, BeautifulSoup steps in to read the raw HTML and gently extract useful details like product links, names, and prices without confusion. To keep everything organized, sqlite3 provides a lightweight local database where URLs and scraped tablet data are stored safely, while json allows the same information to be saved in a portable format for sharing or analysis. Throughout the process, logging quietly records each step, making it easier to trace progress or understand issues later, and standard tools like os and datetime help manage files and timestamps, ensuring the scraper runs in a clean, predictable way. Together, these libraries form a reliable backbone that turns a complex, dynamic Amazon page into structured tablet data in a way that feels approachable and easy to follow for beginners.


Step 1: Scraping Product URLs from the Tablets Section on Amazon India


Importing Libraries

Import necessary libraries

import asyncio
import logging
import os
import sqlite3
import json
from datetime import datetime
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

This set of commonly used Python libraries make web scraping simple and reliable—Playwright is used to open and load Amazon’s tablets listing page just like a real browser, BeautifulSoup helps read and understand the page’s HTML content, and SQLite stores the collected tablet details neatly so they can be reused later, while logging and JSON help track progress and save data in an organized way.


Foundational Configuration Settings

Configurations

START_URL = "https://www.amazon.in/s?i=computers&rh=n%3A1375458031&s=popularity-rank&fs=true&ref=lp_1375458031_sar"
DB_PATH = "/home/anusha/Desktop/DATAHUT/Amazon_tablets/Data/amazon_tablets.db"
JSON_PATH = "/home/anusha/Desktop/DATAHUT/Amazon_tablets/Data/amazon_tablets.json"

This configuration section sets the foundation for the entire scraping process by clearly defining where the data comes from and where it should be saved, making the workflow easy to understand. The START_URL points to Amazon India’s tablets listing page, which ensures the scraper consistently starts from a stable and relevant source page, while DB_PATH specifies the exact location where scraped tablet details will be stored in a local SQLite database for structured access and easy querying later, and JSON_PATH provides a parallel option to save the same data in JSON format, which is widely used for data sharing, APIs, and analysis; together, these paths act like a roadmap for the scraper, clearly separating data collection from data storage and helping newcomers understand how raw web data moves from a webpage into reusable files for further analysis or reporting.


Logging Setup for Reliable Tablet Data Scraping

Logging

LOG_DIR = "/home/anusha/Desktop/DATAHUT/Amazon_tablets/Log"
os.makedirs(LOG_DIR, exist_ok=True)

logging.basicConfig(
    filename=os.path.join(LOG_DIR, f"url_scraper_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"),
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] - %(message)s"
)

This logging setup acts like a quiet notebook running alongside the scraper, carefully noting what happens at each step so that progress and issues can be reviewed later without interrupting the main task. The code first defines a dedicated folder to store log files and ensures it exists, which helps keep scraping runs organized instead of mixing messages with other project files, and then configures Python’s logging system to automatically create a time-stamped log file that records important events in a clear, readable format; this makes it easier to understand when the scraper started, which pages were processed, and whether any errors occurred, and it also encourages good development habits by showing how real-world data projects rely on logs to debug problems and track long-running tasks, similar to how a delivery receipt helps confirm where a package has been and when it arrived.


SQLite Database Setup for Storing URLs

DB Setup

def init_db():
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS tablets2 (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT,
            scraped_date TEXT
        )
    """)
    conn.commit()
    return conn

This database setup creates a simple and dependable place to store tablet product links collected from Amazon’s tablets listing page, making the scraping process more organized and easier to manage over time. The init_db function connects to a local SQLite database file and checks whether a table named tablets2 already exists, creating it only if needed so the scraper can be run multiple times without errors, while the table structure itself is kept intentionally simple by storing each product URL along with the date it was scraped, which helps to understand how raw website links are gradually turned into structured data that can be queried, updated, or reused later, much like maintaining a neatly labeled notebook instead of loose pages scattered across a desk.


Filtering Non-Tablet Products While Scraping Tablet Listings

Filtering Logic

EXCLUDE_KEYWORDS = [
    "Timer", "HandBag", "Back Cover", "Case Cover", "Flip Case",
    "Volume Button Side Button Out Keys", "Coin Tissue",
    "Power Volume Button Flex Internal Keys", "Sticker Multicolour",
    "Photo Stand", "Pencil Case", "Vinyl Stickers", "Tablet Cutter",
    "NAGARJUNA", "Kashaya", "Kashayam"
]

This filtering logic helps keep the scraped data clean and focused by clearly defining which items should be ignored while collecting tablet product links from Amazon’s tablets section. The EXCLUDE_KEYWORDS list contains common words and phrases that usually belong to accessories, covers, stickers, or unrelated products, and during scraping these terms act like a simple checkpoint—if a product title contains any of them, it is skipped—so only genuine tablet listings move forward in the process, which makes the final dataset more accurate and easier to analyze later.


Filter unwanted products by HTML structure and keywords

Filter unwanted products

def is_valid_product(html):

    """Filter unwanted products by HTML structure and keywords"""

    # Exclude Ayurvedic & non-tablet products with s-line-clamp-3
    if 's-line-clamp-3' in html:
        return False

    # Exclude by keywords
    for word in EXCLUDE_KEYWORDS:
        if word.lower() in html.lower():
            return False

    return True

This product validation function acts as a final quality check to ensure that only genuine tablet listings are saved during the scraping process. The is_valid_product function examines the raw HTML of each product card and immediately filters out unrelated items, such as ayurvedic products or accessories, by checking for specific page patterns and previously defined keywords, and if any unwanted signal is found the product is skipped without stopping the scraper; this simple logic helps to understand how small checks, applied at the right stage, can significantly improve data accuracy and reduce noise, much like quickly scanning a document for obvious mismatches before filing it away.

Scraper

async def scrape_tablets():
    conn = init_db()
    cur = conn.cursor()
    all_results = []
    scraped_date = datetime.now().strftime("%Y-%m-%d")

    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=False)
        page = await browser.new_page()
        await page.goto(START_URL, timeout=60000)

        while True:
            logging.info(f"Scraping page: {page.url}")
            await page.wait_for_selector("div[data-cy='title-recipe']", timeout=30000)
            html = await page.content()
            soup = BeautifulSoup(html, "html.parser")

This scraper function is the heart of the entire process, carefully guiding the program from opening Amazon’s tablets listing page to collecting clean and usable product links across multiple pages. In the first part, the scrape_tablets function prepares the environment by connecting to the SQLite database, creating a cursor for saving data, initializing a list to store results for JSON output, and recording the current date so every scraped link is time-stamped, which helps track when the data was collected. Using Playwright, a real browser session is launched and directed to the Amazon tablets page defined in START_URL, and once the page loads, the HTML content is captured and passed to BeautifulSoup, which translates the complex webpage structure into something readable and searchable, similar to turning a crowded bookshelf into neatly arranged sections.

            # Extract product links (normal + sponsored)
            links = soup.select("div[data-cy='title-recipe'] a.a-link-normal, a.a-link-normal.s-line-clamp-4")
            logging.info(f"Found {len(links)} links on page")

            for link in links:
                href = link.get("href")
                if not href:
                    continue

                full_url = "https://www.amazon.in" + href if href.startswith("/") else href
                product_html = str(link.parent)

                if not is_valid_product(product_html):
                    logging.info(f"Excluded product: {full_url}")
                    continue

The second part focuses on finding the actual product links on the page, including both regular and sponsored listings, by selecting common HTML patterns Amazon uses for tablet titles. Each link is checked carefully to ensure it contains a valid URL, converted into a full Amazon link if needed, and then reviewed using the earlier filtering logic so that accessories or unrelated products do not slip through, reinforcing the idea that good scraping is not just about collecting more data, but about collecting the right data.


                # Save to DB
                cur.execute("INSERT INTO tablets2 (url, scraped_date) VALUES (?, ?)",
                            (full_url, scraped_date))
                conn.commit()

                # Save to list for JSON
                all_results.append({
                    "url": full_url,
                    "scraped_date": scraped_date
                })

                logging.info(f"Scraped: {full_url}")

In the third part, every valid tablet link is saved immediately into the database along with the scrape date, ensuring nothing is lost even if the scraper stops unexpectedly, and at the same time the same information is added to a Python list so it can later be written into a JSON file, showing how the same data can be stored in multiple formats for different use cases, such as analysis or sharing.

            # Check for "Next" button
            next_button = soup.select_one("a.s-pagination-next")
            if next_button and "href" in next_button.attrs:
                next_url = "https://www.amazon.in" + next_button["href"]
                await page.goto(next_url, timeout=60000)
            else:
                logging.info("No more pages. Exiting.")
                break

        await browser.close()

The final part handles pagination, which allows the scraper to move smoothly from one results page to the next by checking for the presence of Amazon’s “Next” button and loading the next page when available, and once no further pages are found, the loop ends gracefully and the browser is closed, completing the journey from the first tablet listing to the last without manual intervention, much like flipping through pages of a catalog until the end is reached.


Running the Scraper Using Main Entry Point

Main

if __name__ == "__main__":
    asyncio.run(scrape_tablets())

This main block acts as the official starting switch for the scraper, telling Python exactly when the scraping process should begin. By checking if name == "__main__":, the code ensures that the scrape_tablets function runs only when this file is executed directly and not when it is imported elsewhere, and asyncio.run(scrape_tablets()) then safely starts the asynchronous scraping task.



Step2: Collecting Detailed Information from Individual Product Pages


Importing Libraries

Import necessary libraries

import asyncio
import sqlite3
import logging
import os
from datetime import datetime
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

Logging Configuration

Logging Setup

LOG_DIR = "/home/anusha/Desktop/DATAHUT/Amazon_tablets/Log"
os.makedirs(LOG_DIR, exist_ok=True)
logging.basicConfig(
    filename=os.path.join(LOG_DIR, "data_scraper.log"),
    level=logging.DEBUG,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

"""This section configures the logging system"""

This logging setup quietly records everything the scraper does in the background, making it easier to understand how the program behaves while collecting tablet data from Amazon’s listings page.


SQLite Database Setup for Storing Product Details

Database Setup

 """This section sets up the SQLite database"""

DB_PATH = "/home/anusha/Desktop/DATAHUT/Amazon_tablets/Data/amazon_tablets.db"

def init_db():
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()

   
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS product_data (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        url TEXT,
        name TEXT,
        brand TEXT,
        sale_price TEXT,
        scraped_date TEXT
    )
    """)

In the first part of the database setup, the code creates a structured and reliable space to store detailed tablet information scraped from Amazon’s tablets listing page, starting by defining a clear file path for the SQLite database and then opening a connection to it. A table named product_data is created only if it does not already exist, with carefully chosen columns such as product URL, name, brand, sale price, and scrape date, which helps to understand how raw product listings are gradually transformed into organized records that are easy to query later, similar to arranging product details into clearly labeled columns in a spreadsheet instead of leaving them scattered across notes.

    cursor.execute("""
    CREATE TABLE IF NOT EXISTS tablets2 (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        url TEXT,
        scraped INTEGER DEFAULT 0
    )
    """)

    try:
        cursor.execute("ALTER TABLE tablets2 ADD COLUMN scraped INTEGER DEFAULT 0")
    except sqlite3.OperationalError:
        pass

    conn.commit()
    return conn

The second part focuses on tracking progress during scraping, which becomes especially important when dealing with many pages. Here, a separate table called tablets2 is created to store only product URLs along with a scraped flag that indicates whether a link has already been processed, allowing the scraper to pause and resume without repeating work. The additional check to add the scraped column safely, without breaking the program if it already exists, introduces a practical, real-world pattern used in long-running data tasks, and the final commit ensures all changes are saved before returning the database connection for use in the rest of the scraping workflow.


Extracting Individual Tablet Details Using Playwright and BeautifulSoup

Scraping Logic

"""Scrape product details from a single Amazon product page"""

async def scrape_product(page, url):
    try:
        await page.goto(url, timeout=60000)
        await page.wait_for_timeout(3000)
        soup = BeautifulSoup(await page.content(), "html.parser")

In the first part of this scraping logic, the function focuses on visiting a single Amazon tablet product page and preparing its content for extraction in a safe and readable way. The browser page is directed to the product URL, given a short pause to fully load dynamic content, and then the complete HTML is passed into BeautifulSoup, which converts the complex webpage into a structured format that can be easily searched, helping beginners see how each product page is handled one at a time rather than all at once.


        def safe_select(selector, attr=None):
            el = soup.select_one(selector)
            if not el:
                return None
            return el.get(attr) if attr else el.get_text(strip=True)

      
        name = safe_select("span#productTitle")
        if not name:
            logging.info(f"Skipped empty product: {url}")
            return None

        brand = safe_select("tr.po-brand span.po-break-word") or safe_select("a#bylineInfo")
        sale_price = safe_select("span.a-price span.a-price-whole")
        scraped_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        return {
            "url": url,
            "name": name,
            "brand": brand,
            "sale_price": sale_price,
            "scraped_date": scraped_date
        }

    except Exception as e:
        logging.error(f"Error scraping {url}: {e}")
        return None

The second part carefully pulls out specific product details while avoiding common errors that can stop a scraper. A small helper function checks whether each piece of information exists before trying to read it, which prevents the program from breaking when a field is missing, and then key details such as the product name, brand, and sale price are collected along with the current date to record when the data was captured. If a page does not contain a valid product title, it is safely skipped, and any unexpected issue is logged for later review, showing how thoughtful checks and error handling help transform raw Amazon pages into clean, usable tablet data suitable for storage and analysis.


Main Runner Logic for Scraping and Storing Product Data

Main Runner

async def main():

"""This function controls the overall scraping process"""
    conn = init_db()
    cursor = conn.cursor()

    cursor.execute("SELECT id, url FROM tablets2 WHERE scraped = 0")
    urls = cursor.fetchall()
    logging.info(f"Total URLs to scrape: {len(urls)}")

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

        for idx, (url_id, url) in enumerate(urls, start=1):
            logging.info(f"Scraping {idx}/{len(urls)}: {url}")
            data = await scrape_product(page, url)

In the first part of the main runner, the main function acts as the control center that coordinates the entire scraping flow from start to finish. It begins by setting up the database connection and selecting only those tablet URLs that have not yet been scraped, using a simple status flag to avoid repeating work, and this list of pending URLs is then logged so it is clear how much data remains to be collected. A Playwright browser session is opened and kept alive while each product page is visited one by one, and for every URL the scraper calls the product-level scraping function, which helps to understand how large tasks are broken into smaller, manageable steps that work together smoothly.


            if data:
                try:
                    cursor.execute("""
                        INSERT INTO product_data
                        (url, name, brand, sale_price, scraped_date)
                        VALUES (?, ?, ?, ?, ?)
                    """, (
                        data["url"],
                        data["name"],
                        data["brand"],
                        data["sale_price"],
                        data["scraped_date"]
                    ))

                    cursor.execute(
                        "UPDATE tablets2 SET scraped = 1 WHERE id = ?",
                        (url_id,)
                    )
                    conn.commit()
                    logging.info(f"Saved & marked scraped: {data['name'][:50]}...")
                except Exception as db_err:
                    logging.error(f"DB error for {url}: {db_err}")

        await browser.close()
    conn.close()

The second part focuses on saving the collected product details and updating progress in a reliable way. When valid data is returned, the product information is inserted into the main product table, and the corresponding URL is immediately marked as scraped so it will not be processed again in future runs, which is especially useful for long scraping jobs that may need to be paused and resumed. Any database-related issue is safely logged without stopping the entire process, and once all URLs are processed the browser and database connections are closed cleanly, completing the journey from raw Amazon tablet links to structured, ready-to-use product data.


Entry Point for Starting the Scraping Process

Entry Point

if __name__ == "__main__":
    asyncio.run(main())

"""This block makes sure the script runs only when executed directly"""

This entry point acts as the final trigger that starts the entire scraping workflow at the right moment. By checking if name == "__main__":, the script ensures that the main scraping function runs only when this file is executed directly and not when it is imported into another program, and asyncio.run(main()) then safely launches the asynchronous process that ties together database setup, page scraping, and data storage.


Conclusion


This project demonstrates how a well-planned web scraping workflow can turn a fast-moving Amazon tablets sale page into a clean, reliable dataset, even during a high-traffic event like the Great Indian Festival spread across multiple days. Starting from carefully collecting product URLs, moving through controlled page-by-page data extraction, and ending with thoughtful cleaning and preparation, each stage shows that effective scraping is less about speed and more about patience, structure, and accuracy. By using simple tools such as Playwright, BeautifulSoup, SQLite, and OpenRefine, raw and constantly changing product listings are gradually shaped into organized information that can be analyzed with confidence. More importantly, this approach highlights good data practices—handling errors gracefully, avoiding duplicates, tracking progress, and respecting dynamic website behavior—which are essential lessons for beginners stepping into real-world data engineering. In the end, the project is not just about Amazon tablets, but about understanding how disciplined data collection lays the foundation for meaningful insights and informed decision-making.


AUTHOR


I’m Anusha P O, a Data Science Intern at Datahut, with a strong interest in building automated web data workflows that transform large volumes of online information into clean, analysis-ready datasets.


This blog focuses on extracting data from the Amazon India tablets section, a category that showcases how modern e-commerce platforms organize products across dynamic and frequently updated pages. By working with real tablet listings—covering details such as product names, brands, prices, and rankings—the blog walks through practical techniques for collecting and structuring data in a way that is reliable, beginner friendly, and scalable.


At Datahut, the broader objective is to help businesses unlock value from public web data for use cases like pricing analysis, product research, and competitive insights, and this walk through demonstrates how thoughtful scraping and data organization can turn everyday online listings into meaningful intelligence.


Frequently Asked Questions (FAQs)


1. What is Playwright and why is it used for web scraping Amazon?

Playwright is a modern browser automation framework that allows developers to control browsers like Chromium, Firefox, and WebKit programmatically. It is widely used for web scraping because it can handle dynamic websites, JavaScript-rendered content, and user interactions, making it ideal for extracting product data from complex e-commerce platforms like Amazon.


2. What tablet data can be extracted from Amazon using Playwright?

Using Playwright, you can extract various tablet product details from Amazon such as product name, price, ratings, number of reviews, product specifications, brand name, and product URL. This data can be useful for price monitoring, competitor analysis, and market research.


3. Is it legal to scrape product data from Amazon?

Web scraping legality depends on how the data is collected and used. Publicly available product information can generally be collected for research and analytics, but it’s important to follow Amazon’s terms of service, respect robots.txt guidelines, and avoid aggressive scraping that may harm website performance.


4. Why is Playwright better than traditional scraping libraries?

Unlike traditional scraping libraries that only fetch HTML content, Playwright can fully render JavaScript-driven pages. It simulates real user behavior such as scrolling, clicking, and waiting for elements to load, which makes it more reliable when scraping modern e-commerce websites.


5. What are common challenges when scraping Amazon product data?

Common challenges include dynamic page structures, anti-bot detection mechanisms, CAPTCHA prompts, rate limiting, and frequent layout changes. Using techniques like request delays, rotating user agents, and structured selectors can help improve scraping reliability.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page