top of page

How to Scrape Zepto’s Fruits and Vegetables Data Using Python?

  • Writer: Shahana farvin
    Shahana farvin
  • 4 hours ago
  • 24 min read
How to Scrape Zepto’s Fruits and Vegetables Data Using Python?

Have you ever tried copying product details from a website by hand—one item at a time? If so, you probably know how slow and frustrating it can be. Now, imagine if a robot could do that for you - fast, accurately, and without complaining. That’s exactly what web scraping does. It’s like having a personal assistant that reads through web pages and pulls out the useful bits for you, turning messy website code into neat, usable data.


In this blog, we’re going to explore a real web scraping project. Our goal? To collect product details from the fruits and vegetables section of Zepto, a popular online grocery delivery app in India. If you’ve been looking for a hands-on example of how to scrape a modern website, you’re in the right place.


Why did we pick Zepto?


Zepto is growing fast in the Indian market, and it offers a smooth online shopping experience. But there’s a twist—its website runs on JavaScript, which makes scraping a bit trickier. That’s actually a good thing here because it gives us the chance to learn how to handle such sites using the right tools.


Our Plan to Scrape Zepto : Two Simple Steps


We’ll break down the scraping process into two clear parts:

  1. Link Collection – First, we’ll collect the URLs of all the product pages from the category listings.

  2. Data Collection – Then, we’ll visit each product page and grab the details we need—like the name, price, discount, description, and whether it’s in stock.


By following this step-by-step approach, our code will stay organized, and we’ll have an easier time spotting and fixing errors if anything goes wrong.


The Tools We’ll Use


Here’s what we’ll be working with:

  • Playwright – Think of this as a tool that opens a browser and clicks around the website just like a real person. It’s especially helpful when dealing with websites that load content using JavaScript.

  • BeautifulSoup – This is a Python library that helps us read and pull out specific pieces of data from a web page once it's fully loaded.

  • SQLite – A simple, file-based database where we’ll neatly store all our scraped data. It’s easy to use and doesn’t require any server setup.


In the next sections, I’ll walk you through how we combine these tools to build a working scraping system. Along the way, I’ll share what worked, what didn’t, and what I learned—so you can avoid common mistakes and get better at this, one project at a time.


Links Collection


In this part of the project, we built a simple Python scraper to collect product links from Zepto’s fruits and vegetables section. As we told before, to make this work we used a few helpful tools: Playwright for loading the website like a real browser, BeautifulSoup for reading the page’s content, and SQLite for storing the links we collect.


Here’s what the scraper actually does behind the scenes:Zepto loads more products as you scroll down the page. So instead of grabbing just what’s visible at first, our scraper scrolls through the page automatically—just like you would if you were browsing. As it scrolls, it collects all the product links one by one and saves them into a local database. These links will come in handy later when we need to visit each product page to collect detailed information.


The code is structured in a neat and organized way. Each task—like fetching the page content, picking out the product links, saving them, handling errors, and even printing updates—is handled by its own function. This makes the code easier to follow, fix if anything goes wrong, and reuse in future projects.


Now, let’s break the whole process down step by step and take a closer look at how each part works.


Library Imports and Initial Setup

import sqlite3
import logging
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import time
from datetime import datetime

# Configure logging
logging.basicConfig(filename="scraper.log", level=logging.INFO, 
                    format="%(asctime)s - %(levelname)s - %(message)s")

BASE_URL = "https://www.zeptonow.com"

This piece of code initializes everything that we need for our scraper to work. We begin by importing a couple of very useful Python librarie : sqlite3,logging captures events and errors when our scraper runs, Playwright, BeautifulSoup ,time which add pauses between actions to avoid overloading the website and date-time module will provide our data with time stamps.


We also set up a logging system that can write detail to a file called scraper.log; each entry will include the date / time, along with timestamps, log type, and message — this will make it easier to trace our movements and troubleshoot if things go astray.


Finally we defined a constant called BASE_URL, in which we store an address for the website. We will use this to build the full product URLs by appending the shorter paths that we scrape from the website.


Page Content Fetching

def fetch_page_content(url):
    """
    Fetches the full page content using Playwright with incremental scrolling to load all dynamic content.
    
    This function launches a Chromium browser, navigates to the specified URL, and performs
    scrolling operations to ensure all lazy-loaded content is rendered before returning the page's HTML.
    
    Args:
        url (str): The URL to fetch content from.
        
    Returns:
        str or None: The HTML content of the page if successful, None otherwise.
        
    Raises:
        Various exceptions from Playwright which are caught and logged.
        
    Note:
        The function uses a non-headless browser (visible) which might not be suitable for
        production environments. Change headless=False to headless=True for invisible operation.
    """
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=False)
            page = browser.new_page()
            page.goto(url, timeout=60000)
            
            scroll_position = 0  # Start from the top

            while True:
                # Scroll down in increments
                scroll_position += 800
                page.evaluate(f"window.scrollTo(0, {scroll_position})")
                time.sleep(2)  # Wait for new content to load
                
                # Get new page height after scrolling
                new_height = page.evaluate("document.body.scrollHeight")

                # Stop if we can't scroll further
                if scroll_position >= new_height:
                    break

            content = page.content()
            browser.close()
            logging.info(f"Successfully fetched content from {url}")
            return content
    except Exception as e:
        logging.error(f"Error fetching page content from {url}: {e}")
        return None

This function is designed to handle one of the trickiest parts of scraping modern websites—lazy loading. On sites like Zepto, the full list of products doesn’t appear all at once. Instead, more items load only when you scroll down. So, if we want to grab all the product details, we need a way to scroll automatically, just like a human would.


To do that, we use Playwright, which launches a real browser. We run it in visible mode (headless=False) so you can actually watch how the scraping works in action. The browser is set to wait up to 60 seconds for the page to fully load—this helps on slower internet connections.


The main trick here is the simulated scrolling. The scraper scrolls down the page bit by bit—about 800 pixels at a time—and pauses for 2 seconds between each scroll. That short pause gives the website time to load more content in the background using JavaScript. In most cases, 2 seconds is enough for this to happen smoothly.


As it scrolls, the scraper checks whether it has reached the bottom of the page by comparing how far it's scrolled with the total height of the content. Once all products are loaded and there's nothing more to scroll, the function captures the full HTML content of the page.


After collecting the data, the browser closes, and the HTML content is returned so we can parse it later.


If anything goes wrong—like if the browser crashes or the network fails—the function catches the error, logs what happened, and returns None instead of breaking the whole script. This helps the scraper run smoothly and makes it easier to troubleshoot when needed.


Link Extraction

def parse_links(html_content):
    """
    Parses the HTML content and extracts all product links.
    
    This function uses BeautifulSoup to parse the HTML and extract links to product pages,
    specifically targeting elements that match the product card selector.
    
    Args:
        html_content (str): The HTML content to parse.
        
    Returns:
        list: A list of product URLs with the base URL prepended.
        
    Raises:
        Exceptions during parsing are caught and logged.
        
    Note:
        The function specifically targets div elements with data-testid="product-card"
        which should be updated if the website structure changes.
    """
    try:
        soup = BeautifulSoup(html_content, 'html.parser')
        links = [BASE_URL + a['href'] for a in soup.select('div.w-full > div.grow > div > div > a[data-testid="product-card"]', href=True)]
        logging.info(f"Extracted {len(links)} links.")
        return links
    except Exception as e:
        logging.error(f"Error parsing links: {e}")
        return []

Now that we’ve got the full HTML content of the page, the next step is to pull out the useful parts—in this case, the links to individual product pages. This is where BeautifulSoup comes in.


BeautifulSoup takes the raw HTML and turns it into a format that’s much easier to navigate—almost like a family tree of elements. This lets us zoom in on exactly what we need without digging through all the messy code manually.


To find the product links, we use a CSS selector, which is basically a way to tell BeautifulSoup, “Look here!” The selector we use is: 'div.w-full > div.grow > div > div > a[data-testid="product-card"]'.


This points directly to the a tags (which are HTML links) that have the attribute data-testid="product-card". These are the clickable product cards on Zepto’s site. It’s a reliable way to identify them, since this structure stays consistent across the page.


However, the links we collect are relative URLs—they’re not full links yet. So we use a little Python trick called a list comprehension to combine each relative URL with Zepto’s base URL. This gives us a full set of proper product links we can actually use later.


Before the function finishes, it logs how many links it found. That gives us a quick confirmation that the scraping worked. And if something goes wrong during the parsing, it won’t crash the script—it’ll just log the error and return an empty list, so the rest of the code can still keep running smoothly.


Database Storage

def save_links_to_db(links, category, db_name="scraped_links.db"):
    """
    Saves the extracted links with category information to an SQLite database.
    
    This function creates a database (if it doesn't exist) and adds the extracted links
    along with their category, the current date, and a default 'scraped' status of 0 (unscraped).
    It handles duplicate links by ignoring them rather than raising an error.
    
    Args:
        links (list): A list of URLs to save.
        category (str): The category label for these links (e.g., "fruits", "vegetables").
        db_name (str, optional): The name of the SQLite database file. Defaults to "scraped_links.db".
        
    Returns:
        None
        
    Raises:
        Various database exceptions which are caught and logged.
        
    Database Schema:
        - id: Primary key, auto-incrementing integer
        - url: The product URL (unique)
        - category: The product category
        - scraped_date: The date when the link was added to the database
        - scraped: Integer flag (0=unscraped, 1=scraped)
    """
    try:
        conn = sqlite3.connect(db_name)
        cursor = conn.cursor()

        # Create table with a 'scraped' column (default 0) if not exists
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS links (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT UNIQUE,
                category TEXT,
                scraped_date TEXT,
                scraped INTEGER DEFAULT 0
            )
        """)
        
        current_date = datetime.now().strftime("%Y-%m-%d")  # Get today's date in YYYY-MM-DD format

        for link in links:
            try:
                cursor.execute("INSERT INTO links (url, category, scraped_date, scraped) VALUES (?, ?, ?, ?)", 
                               (link, category, current_date, 0))
            except sqlite3.IntegrityError:
                logging.warning(f"Duplicate link ignored: {link}")
        
        conn.commit()
        conn.close()
        logging.info("Links saved successfully to the database.")
    except Exception as e:
        logging.error(f"Error saving links to database: {e}")

Once we’ve collected the product links, we need a safe place to store them—somewhere they won’t get lost and can be reused later. That’s where this function comes in. It creates an SQLite database to save all the scraped links in an organized way.


The first time this function runs, it creates a new .db file (a simple database file on your computer), along with a table to hold our product link data.


Here’s what each column in the table does:

  • id – A unique number for each entry. It helps us keep track of the rows.

  • url – The actual product link. This is marked as UNIQUE, so we don’t accidentally save the same link more than once.

  • category – The product type, like "fruits" or "vegetables".

  • date – The date when the link was scraped, in a standard YYYY-MM-DD format.

  • scraped – A flag that’s either 0 or 1. It tells us if this link has already been scraped for product details or not.


That last column, scraped, is especially useful when we’re splitting our scraping into two steps: first, collecting links, and later, visiting each link to get more information. When links are first added, scraped is set to 0. After we process them in the second step, we update it to 1 so we don’t scrape the same page twice.


Sometimes, we might try to add a link that’s already in the database. When this happens, an IntegrityError is raised because of the UNIQUE setting on the URL column. That’s expected, so we handle it gently using a try-except block. Instead of stopping the whole program, we simply log a message and move on. We use logging.info here, since it’s just a normal part of the scraping process—not something to worry about.


Finally, once all the links are added, the function saves the changes and closes the connection to the database properly. This ensures our data is stored safely and ready for the next step.


Main Execution Logic

def main():
    """
    Main function that orchestrates the scraping process.
    
    This function defines the target URLs to scrape, along with their corresponding categories.
    For each URL, it:
    1. Fetches the page content
    2. Parses the content to extract product links
    3. Saves the extracted links to the database
    
    Returns:
        None
        
    Note:
        The URLs are hardcoded in this function. For more flexibility, consider
        loading them from a configuration file or command-line arguments.
    """
    urls = {
        "https://www.zeptonow.com/cn/fruits-vegetables/fresh-vegetables/cid/64374cfe-d06f-4a01-898e-c07c46462c36/scid/b4827798-fcb6-4520-ba5b-0f2bd9bd7208": "vegetables",
        "https://www.zeptonow.com/cn/fruits-vegetables/fresh-fruits/cid/64374cfe-d06f-4a01-898e-c07c46462c36/scid/09e63c15-e5f7-4712-9ff8-513250b79942": "fruits"
    }
    
    for url, category in urls.items():
        html_content = fetch_page_content(url)
        if html_content:
            links = parse_links(html_content)
            if links:
                save_links_to_db(links, category)
    
    print("Links saved successfully.")

if __name__ == "__main__":
    main()

The main() function is like the brain of our scraper—it controls how everything runs from start to finish. It starts by creating a simple dictionary that maps each product category (like fruits or vegetables) to its matching URL. This setup makes it easy to add new categories or rename existing ones later, without having to change the rest of the code.


Once the setup is ready, the function goes through each category one by one and follows these steps:

  1. It first uses the scrolling function to load the full page content.

  2. If the content loads correctly, it moves on to extract product links from the page using our parsing function.

  3. If product links are found, they’re saved into the database, along with the category they belong to.


After each step, we use if conditions to check whether things worked as expected before moving forward. This helps avoid a chain reaction of errors—if one part fails, the scraper simply skips the next step instead of crashing.


There’s also a small but important detail: the main() function only runs when this script is executed directly, not when it's imported into another file. This is a best practice in Python that helps keep your code clean, modular, and reusable.


Finally, when everything’s done, the function prints a simple message to confirm that the scraping process has been successfully completed. It’s a neat way to wrap things up and know your data is safely collected.


Data Collection


After collecting all the product links from Zepto’s fruits and vegetables section, the next step is to visit each of those links and pull out detailed product information. This part of the scraper is designed to do just that—it opens each product page, grabs the important details, and saves them neatly into our database.


The scraper works by going through one product at a time, taking short pauses between each request. These small delays help us scrape responsibly, without putting too much pressure on the website’s servers.


The product scraper is built using a modular design, which means each task is handled by its own function. Some functions connect to the database, others fetch the web page, some parse the product information, and a few manage the overall workflow. This makes the code easier to read, test, and update later if needed.


In the sections that follow, we’ll break down each part of this scraper and walk through how it works—from grabbing the page to saving the final results.


Setup and Configuration

import sqlite3
import logging
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import time
import json
import random
from datetime import datetime

# Configure logging
logging.basicConfig(filename="product_scraper.log", level=logging.INFO, 
                    format="%(asctime)s - %(levelname)s - %(message)s")

DB_NAME = "scraped_links.db"

Before we can start building the product scraper, Here also we need to bring in the same set of Python libraries we used in the first part.


We also set up logging, which writes messages and errors into a file called "product_scraper.log". Since this part of the scraper runs separately from the link collector, logging gives us a way to monitor what’s going on and spot issues if they come up.


Lastly, we define a constant called DB_NAME. This tells our scraper which database file to use, so all parts of the code stay in sync and work with the same data throughout the process.


Database Operations for Link Retrieval

def get_unscraped_links():
    """
    Fetch all links from the database where scraped = 0.
    
    Queries the 'links' table to retrieve records that haven't been processed yet.
    
    Returns:
        list: A list of tuples containing (id, url, category) for each unscraped link.
        
    Raises:
        Various database exceptions which are caught and logged.
        
    Note:
        This function relies on the database structure created by the link scraper.
        The 'links' table should have columns for id, url, category, and scraped status.
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("SELECT id, url, category FROM links WHERE scraped = 0")
        links = cursor.fetchall()  # List of tuples: (id, url, category)
        conn.close()
        logging.info(f"Fetched {len(links)} unscraped links from the database.")
        return links
    except Exception as e:
        logging.error(f"Error fetching unscraped links: {e}")
        return []

This part of the code is all about getting the list of product links that still need to be scraped. To do that, we connect to our SQLite database and fetch every product entry where the scraped value is set to 0—which means the data hasn’t been collected from those pages yet.


For each of these entries, we pull out three things:

  • The ID, which helps us later mark the product as "done" once it’s scraped

  • The URL, which points to the product’s page

  • The category, which tells us whether it’s a fruit, vegetable, or something else


This setup acts like a simple to-do list inside our database. Rather than storing links in a file or a long list in memory, the database keeps track of what’s left to do. So if the scraper stops midway—say due to a network issue—we can run it again later and it’ll simply continue from where it left off.


It’s a practical way to make sure no work is repeated, and every product gets scraped exactly once.


Content Fetching

def fetch_page_content(url):
    """
    Fetches the full product page content using Playwright.
    
    Launches a headless browser to render JavaScript and fetch the complete HTML content
    of a product page, ensuring all dynamic elements are loaded.
    
    Args:
        url (str): The URL of the product page to fetch.
        
    Returns:
        str or None: The HTML content of the page if successful, None otherwise.
        
    Raises:
        Various exceptions from Playwright which are caught and logged.
        
    Note:
        Includes a 5-second delay to allow JavaScript content to fully load.
    """
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url, timeout=60000)
            time.sleep(5)  # Give some time for JavaScript content to load
            content = page.content()
            browser.close()
            logging.info(f"Successfully fetched page content for {url}")
            return content
    except Exception as e:
        logging.error(f"Error fetching page content from {url}: {e}")
        return None

This function is responsible for loading individual product pages in the background so we can scrape the details we need. It uses a headless browser, which means it runs without opening a visible browser window. This helps things run faster and more efficiently, especially since we don’t need to scroll or interact with the page—we just need to wait for it to fully load.


Once the page opens, the scraper waits for 5 seconds to give the website enough time to load everything, including any dynamic content like the product’s price or description. We also set a 60-second timeout, just in case the page is slow or your internet connection takes a little longer.


If something goes wrong—like the page fails to load or times out—the function doesn’t crash the whole program. Instead, it simply logs the error and moves on to the next product. This way, a few failed pages won’t stop the entire scraping process. It’s a smart way to keep things running smoothly, even when there are occasional hiccups.


Parsing Functions

def parse_product_name(soup):
    """
    Extracts the product name from the page.
    
    Args:
        soup (BeautifulSoup): The parsed HTML of the product page.
        
    Returns:
        str: The product name if found, 'N/A' otherwise.
        
    Note:
        Uses a specific CSS selector that may need updating if the website structure changes.
    """
    try:
        return soup.select_one("#product-features-wrapper > div:nth-child(1) > div > div.mt-2.flex.items-center.justify-between.gap-6 > h1").text.strip()
    except AttributeError:
        logging.warning("Product name not found.")
        return "N/A"

In this part of the scraper, we’ve created a group of small helper functions, where each one focuses on pulling out just one piece of product information—like the name, price, or stock status.


This approach keeps the code organized and easy to manage. If something breaks—say the website layout changes slightly—we only need to fix the specific function related to that data point, without touching the rest of the scraper.


Each function follows the same basic steps:

  1. It looks for a specific element on the page using a CSS selector.

  2. It grabs the text inside that element.

  3. It removes any extra spaces around the text.

  4. It returns the final value.


If the function can’t find what it’s looking for—maybe the element is missing or the layout changed—it won’t crash the program. Instead, it logs a warning and returns "N/A". This way, the scraper keeps going and collects everything else that’s still available.


The CSS selectors used in these functions were carefully picked by inspecting Zepto’s product page layout using browser tools like “Inspect Element.” For example, one function grabs the product name, while others pull out the net quantity, discounted price, original price, and availability status—all in the same reliable way.


By handling each piece of data separately, we make the scraper much easier to read, test, and update when needed.


Complex Data Parsing

def parse_product_highlights(soup):
    """
    Extracts the product highlights section with key-value pairs.
    
    This function identifies the product highlights section and extracts all key-value
    pairs found within it, returning them as a JSON string.
    
    Args:
        soup (BeautifulSoup): The parsed HTML of the product page.
        
    Returns:
        str: A JSON string containing key-value pairs of product highlights.
             Returns '{}' (empty JSON object) if no highlights are found or in case of errors.
             
    Raises:
        Exceptions during parsing are caught, logged, and an empty JSON object is returned.
        
    Note:
        The function expects a specific HTML structure with h3 elements for keys and
        p elements for values within each highlight div.
    """
    try:
        highlights_section = soup.select("#productHighlights > div > div > div.flex.flex-col.gap-8 > div.flex.items-start.gap-3")
        
        if not highlights_section:
            logging.warning("No product highlights found in the given HTML.")
            return json.dumps({})
        
        highlights = {}
        
        for div in highlights_section:
            try:
                key_element = div.find("h3")
                value_element = div.find("p")
                
                if key_element and value_element:
                    key = key_element.get_text(strip=True)
                    value = value_element.get_text(strip=True)
                    highlights[key] = value
                else:
                    logging.warning("Missing key-value pair in a product highlight div.")
            except Exception as e:
                logging.error(f"Error processing a highlight div: {e}")
                continue
        
        return json.dumps(highlights, indent=4)
    
    except Exception as e:
        logging.error(f"Error parsing product highlights: {e}")
        return json.dumps({})

For more detailed information—like product highlights—our scraper needs to go beyond just grabbing plain text. These details often come in pairs, like a heading and a description, so we use a special method to collect them as key-value pairs, and then turn them into a JSON string. This makes the data easier to store, read, and use later on.


Here’s how it works:


The function starts by finding the highlights section on the page using a CSS selector. Once found, it loops through each item in that section—grabbing the heading (which becomes the key) and the description (which becomes the value). Each pair is added to a dictionary. After collecting everything, the dictionary is neatly converted into a formatted JSON string.


This structure is really helpful when the product has multiple features or specifications. Rather than trying to force everything into one long string, we get clean, organized data that’s easy to work with.


The function also uses nested try/except blocks to handle any errors. The outer block checks whether the entire highlights section exists. Inside that, each individual item is handled carefully—so if one part is missing or doesn’t load properly, it won’t break the entire process. If the first part can’t be read, it simply skips the rest and keeps going without crashing.


We use the same method for another section on the page too—the product info section. This keeps everything consistent and helps us capture more structured data where needed.


Combined Product Details Extraction

def parse_product_details(html_content):
    """
    Parses all product details from the HTML content.
    
    This function serves as the main parser that coordinates the extraction of all product
    details by calling individual parsing functions for each data point.
    
    Args:
        html_content (str): The HTML content of the product page.
        
    Returns:
        dict or None: A dictionary containing all extracted product details if successful,
                     None otherwise.
        
    Raises:
        Exceptions during parsing are caught, logged, and None is returned.
        
    Note:
        The returned dictionary contains keys for name, net_quantity, sale_price,
        product_price, in_stock, highlights, and information.
    """
    try:
        soup = BeautifulSoup(html_content, 'html.parser')

        product_data = {
            "name": parse_product_name(soup),
            "net_quantity": parse_net_quantity(soup),
            "sale_price": parse_sale_price(soup),
            "product_price": parse_product_price(soup),
            "in_stock":parse_stock(soup),
            "highlights":parse_product_highlights(soup),
            "information":parse_product_info(soup)
        }

        logging.info(f"Extracted product details: {product_data}")
        return product_data
    except Exception as e:
        logging.error(f"Error parsing product details: {e}")
        return None

The main parsing function is the part that brings everything together. Its job is to take the raw HTML from a product page and coordinate all the smaller functions to collect complete product information.


It starts by turning the raw HTML into a BeautifulSoup object, which makes it much easier to search and extract specific elements from the page.


Once the page is ready, the function calls each of our smaller helper functions—like the ones that get the product name, prices, highlights, and availability. Each of these functions pulls out one piece of information, and their results are all combined into a single, well-structured dictionary. This dictionary holds everything we’ve scraped for that product in one place.


To make sure things run smoothly, the function includes error handling. If something unexpected happens while parsing the page, the error is caught and logged, so the scraper doesn’t break.


Before returning the final result, the function also logs all the details it collected. This is really helpful when you want to double-check your results or troubleshoot if something doesn’t look right. It gives you a clear view of what was successfully scraped for each product.


Database Operations for Product Storage

def save_product_data(product_data, category, scraped_date, url):
    """
    Saves product details into the database.
    
    Creates a 'products' table if it doesn't exist and inserts the extracted product
    data along with metadata such as category and scrape date.
    
    Args:
        product_data (dict): Dictionary containing product details.
        category (str): The product category.
        scraped_date (str): The date when the product was scraped (YYYY-MM-DD format).
        url (str): The URL of the product page.
        
    Returns:
        bool: True if data was successfully saved, False otherwise.
        
    Raises:
        Various database exceptions which are caught and logged.
        
    Note:
        The return value is used to determine whether to mark the link as scraped
        in the 'links' table. If saving fails, the link remains unscraped so it can
        be retried later.
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()

        # Create table if it doesn't exist
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                url TEXT,
                name TEXT,
                net_quantity TEXT,
                sale_price TEXT,
                product_price TEXT,
                in_stock TEXT,
                category TEXT,
                highlights TEXT,
                information TEXT,
                scraped_date TEXT
            )
        """)

        cursor.execute("""
            INSERT INTO products (url, name, net_quantity, sale_price, product_price, in_stock, category, highlights, information, scraped_date) 
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (url, product_data["name"], product_data["net_quantity"], product_data["sale_price"], product_data["product_price"], product_data["in_stock"], category, product_data["highlights"], product_data["information"], scraped_date))

        conn.commit()
        conn.close()
        logging.info(f"Product data saved successfully for {url}")
        return True  # Indicate success
    except sqlite3.IntegrityError:
        logging.warning(f"Duplicate product ignored: {url}")
        return False  # Do not mark as scraped
    except Exception as e:
        logging.error(f"Error saving product data to database: {e}")
        return False  #  Do not mark as scraped

This function handles the job of saving the scraped product data into our SQLite database. It starts by checking if the products table already exists. If it doesn’t, the function creates the table using the same structure as the fields we’ve collected—like product name, price, stock status, and so on.


For more detailed fields, like highlights and product info, which hold multiple pieces of data, we store them as JSON strings. This keeps their structure intact inside the database, making it easier to work with later.


Once the table is ready, the function tries to save the product data. If everything is stored correctly, it returns True. If something goes wrong, it returns False.


This return value is important. It lets the scraper know whether it’s safe to mark the product link as “processed.” We only mark links as done when we’re sure their data has been saved properly. That way, we avoid losing any products due to errors or interruptions, and we keep the scraper accurate and reliable.


Status Update Function

def mark_link_as_scraped(link_id):
    """
    Updates the `scraped` column in the `links` table to mark the link as processed.
    
    After a product page has been successfully processed and its data saved, this function
    marks the corresponding link as scraped (scraped=1) to prevent it from being processed again.
    
    Args:
        link_id (int): The ID of the link in the 'links' table.
        
    Returns:
        None
        
    Raises:
        Various database exceptions which are caught and logged.
        
    Note:
        This function relies on the database structure created by the link scraper.
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("UPDATE links SET scraped = 1 WHERE id = ?", (link_id,))
        conn.commit()
        conn.close()
        logging.info(f"Marked link ID {link_id} as scraped.")
    except Exception as e:
        logging.error(f"Error updating scraped status for link ID {link_id}: {e}")

This function is used to mark a product link as processed once its data has been successfully scraped and saved. It does this by updating the scraped field in the links table, changing its value from 0 to 1.


By keeping track of which links are already processed, the scraper can skip them in future runs. This way, we avoid repeating work and keep everything running efficiently—even if the scraper is stopped and restarted.


To make the update safe and precise, the function uses a parameterized SQL query, which updates only the row that matches the given link’s unique ID. This ensures that only the correct entry is changed, without affecting any others. It’s a clean and reliable way to manage our progress and make the scraper more sustainable over time.


Main Scraping Workflow

def scrape_products():
    """
    Main function to scrape product details from unscraped links.
    
    This function orchestrates the entire scraping process:
    1. Fetches unscraped links from the database
    2. For each link, fetches the product page content
    3. Parses product details from the page content
    4. Saves the data to the database
    5. Marks the link as scraped if the save was successful
    6. Implements random delays between requests to avoid detection
    
    Returns:
        None
        
    Note:
        This function implements error handling and logging at each step.
        It uses random delays between 7-10 seconds to avoid overloading the server
        and to reduce the risk of being detected as a bot.
    """
    unscraped_links = get_unscraped_links()

    if not unscraped_links:
        logging.info("No unscraped links found. Exiting scraper.")
        return

    for link_id, url, category in unscraped_links:
        logging.info(f"Processing {url} (Category: {category})")
        html_content = fetch_page_content(url)

        if html_content:
            product_data = parse_product_details(html_content)
            if product_data:
                scraped_date = datetime.now().strftime("%Y-%m-%d")
                
                # Only mark as scraped if saving was successful
                if save_product_data(product_data, category, scraped_date, url):
                    mark_link_as_scraped(link_id)
                    logging.info(f"Successfully saved and marked {url} as scraped.")
                else:
                    logging.warning(f"Skipping marking {url} as scraped due to save failure.")
        time.sleep(random.uniform(7,10))  # Random delay between processing requests
    logging.info("Scraping completed for all available links.")


if __name__ == "__main__":
    scrape_products()

This function is the main controller for the entire product scraping process. It runs through each step in a clear, orderly way and keeps everything running smoothly from start to finish.


First, it grabs a list of product links from the database where scraping hasn’t been attempted yet (those marked with scraped = 0). Then, it processes each link one at a time, following this simple and reliable workflow:

  1. Load the product page using our page loader.

  2. Parse the product information using the small helper functions we created earlier.

  3. Save the collected data into the database.

  4. Update the product link to show that it’s been successfully scraped.


This process is designed to be resilient and careful. If there are no links left to scrape, the function stops early—saving time and resources. And if something goes wrong while processing a link (like the page doesn’t load or parsing fails), the link isn’t marked as done. That way, we can try it again later without losing any data due to temporary issues.

To stay polite and avoid being flagged as a bot, the function also adds a small random delay between requests. This helps reduce the load on the website’s server and makes the scraper less likely to be blocked or rate-limited.


Finally, this function is only run when it’s specifically called, so it can be used as a standalone tool or plugged into a larger scraping system. It’s flexible, efficient, and built to handle real-world challenges without skipping a beat.


Conclusion


Web scraping helps turn messy website content into clean, usable data. In this project, we scraped product details from Zepto’s Fruits and Vegetables section using Playwright, BeautifulSoup, and SQLite.


We learned how to handle JavaScript-loaded content with Playwright, save data neatly using SQLite, and scrape responsibly by adding random delays and error handling. By splitting the process into two steps—collecting links first, then scraping product details—we made the scraper more reliable and easier to manage.


This approach works well not just for product data, but also for price tracking, market research, and more. Just remember: as websites evolve, scrapers should too—and always be respectful and ethical in how you use them.


Author


I’m Shahana, a Data Engineer at Datahut, where I specialize in building reliable, scalable data pipelines that convert complex web content into clean, structured datasets—particularly for industries like e-commerce, grocery delivery, and retail analytics.


At Datahut, we help clients automate data extraction from modern websites, including those that use JavaScript and infinite scrolling. In this blog, I walked through a real-world scraping project focused on Zepto’s Fruits and Vegetables section. We used Playwright, BeautifulSoup, and SQLite to build a solution that handles dynamic content, stores data efficiently, and follows responsible scraping practices—all while keeping the code clean and beginner-friendly.


If your team is looking to automate product data collection in the eyewear space or beyond, reach out to us through the chat widget on the right. We’d love to help you build a solution that fits your goals.



FAQ SECTION


FAQ 1: Is it legal to scrape data from Zepto?

  • Zepto's publicly visible product data (names, prices, availability) is generally accessible, but always review the website's Terms of Service before scraping

  • Avoid scraping personal user data, login-protected pages, or any content explicitly restricted in Zepto's robots.txt file

  • Use scraping responsibly — add delays between requests, avoid overloading the server, and use the data only for lawful purposes like research or price monitoring

  • For a deeper understanding, refer to Datahut's guide on whether scraping e-commerce websites is legal


FAQ 2: Why do we need Playwright instead of a simple requests library for Zepto?

  • Zepto is a JavaScript-heavy website — its product listings don't exist in the raw HTML source; they load dynamically in the browser

  • A basic requests call only fetches the static HTML, which means you'd get an empty or incomplete page with no products

  • Playwright launches a real Chromium browser, waits for JavaScript to execute, and captures the fully rendered page — just like a human visiting the site

  • It also handles infinite scrolling automatically, which is essential since Zepto loads more products as you scroll down


FAQ 3: What happens if the scraper stops midway through collecting data?

  • Because we use SQLite with a scraped flag (0 = pending, 1 = done), the scraper knows exactly where it left off

  • On restart, it simply queries for all links where scraped = 0 and continues from there — no data is lost or duplicated

  • Product data is only marked as scraped after it has been successfully saved, so a failed save automatically retries on the next run

  • This two-step design (links first, then product details) is specifically built for resilience against interruptions


FAQ 4: Can this scraper be adapted for other categories or grocery apps?

  • Yes — for other Zepto categories, simply add the new category URL and a label to the urls dictionary in the main() function

  • For other grocery apps like Blinkit or BigBasket, the core structure (Playwright + BeautifulSoup + SQLite) stays the same; only the CSS selectors and URLs need to be updated to match the new site's layout

  • The modular design — where each function handles one task — makes it straightforward to swap out or update individual parts without rewriting the whole script

  • Datahut has published a similar project for Blinkit's fruits and vegetables section that follows the same approach


FAQ 5: How do we handle changes in Zepto's website structure?

  • Zepto may update its HTML layout, CSS class names, or data attributes over time — when this happens, CSS selectors used in the parsing functions will stop matching and return "N/A"

  • The fix is to open the updated product page in Chrome, right-click the element you need, select "Inspect," and copy the new CSS selector path

  • Each piece of data (name, price, stock, highlights) is handled by its own dedicated function, so updating one selector never breaks the others

  • Setting up logging (which this scraper already does) helps you spot when fields start returning "N/A" consistently — that's usually the first sign a selector needs updating

  • Treating CSS selectors as the one maintenance cost of any scraper is a realistic expectation — the rest of the pipeline stays stable

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page