How to Scrape Product Data from Family Food Centre? (Step-by-Step Python Guide)

Q: What tools do you need to scrape product data from Family Food Centre?

To scrape product data from Family Food Centre, you typically need Playwright for browser automation, BeautifulSoup for HTML parsing, SQLite for local data storage, asyncio for asynchronous execution, and Python's datetime module for timestamping collected records.

Q: Why is the scraping process split into two phases?

The scraping workflow is split into two phases to improve efficiency and reliability. Phase 1 collects only product URLs from listing pages, while Phase 2 visits each saved URL to extract detailed product information like price, SKU, and packaging. This structure makes debugging easier and allows the scraper to resume unfinished jobs without repeating completed tasks.

Q: How does the scraper handle multiple pages on the Family Food Centre website?

The scraper uses Playwright to load each category page and wait until the network becomes idle. It then detects and clicks the pagination 'Next' button automatically until no additional pages remain. A page counter tracks progress during the scraping process.

Q: How is scraped data stored and tracked to avoid duplicates?

Scraped product URLs are stored in an SQLite products table with a scraped status flag. Detailed product information is stored in a separate product_data table. Once a product is successfully processed, its status is updated so future runs only scrape pending links, preventing duplicate processing.

Q: What happens if a product page fails to load or has missing data?

The scraper uses try/except error handling to prevent failures from stopping the entire process. If a page times out or data is missing, the error is logged and the scraper continues with the next product. Missing fields are stored as 'N/A' to maintain database consistency.

Shahana farvin
5 hours ago
27 min read

How to Scrape Product Data from Family Food Centre? (Step-by-Step Python Guide)

Have you ever wondered if there's a way to automatically collect information from online shopping websites - without manually copying and pasting everything? That’s exactly what web scraping helps us do. It’s like teaching a computer to visit web pages and pick out the information you need, just like how you would—but much faster and without the effort.

In this blog, I’ll walk you through a simple web scraping project I worked on. The goal? To collect data from the fruits and vegetables section of the Family Food Centre website. If you’re not familiar, Family Food Centre is a well-known online supermarket in Qatar, offering a wide variety of fresh produce.

The scraping process was divided into two clear steps

First, we needed to visit the fruits and vegetables category page and collect all the product links listed there.

Then, using those links, we visited each individual product page to gather detailed information—like the product name, price, and packaging details.

We've used the same two-phase approach on other grocery platforms too - check out how we did it for Blinkit.

Scrape Product Data from Family Food Centre: Tools Behind the Magic

You might be wondering—how is web scraping even possible? How can a computer visit a website and pick out just the parts we care about?

Well, that’s where some smart tools come in. Throughout this project, I used a few simple but powerful tools that made the entire scraping process much easier. Think of them as a team working together—each with its own special job. Let’s meet them.

Playwright – The One Who Browses for You

Imagine you’re sitting in front of a website, clicking buttons, scrolling down, and waiting for new items to load. Now imagine handing that job over to a helper who can do all that for you—quickly and accurately. That’s what Playwright does.

If you're hitting anti-bot walls, see how curl_cffi can help you scrape without getting blocked.

Playwright is a tool that acts like a real browser. It can open web pages, click on things, and scroll through content, just like a human would. This is especially useful for websites that load more products only when you scroll down (you’ve probably seen this on shopping sites). Playwright takes care of that smoothly, without you lifting a finger.

BeautifulSoup – The One Who Finds What You Need

Once the page is fully loaded, it’s time to pick out the useful details—like product names, prices, or packaging info. That’s where BeautifulSoup comes in.

Think of a webpage as a messy room full of information. BeautifulSoup helps us search through that room and find exactly what we’re looking for. It takes the page’s HTML code (which is how websites are built behind the scenes) and lets us extract just the parts we care about.

SQLite – The One Who Keeps Everything Safe

After collecting all this information, we need a place to store it. Not in a notebook—but in a digital one called SQLite.

SQLite is a simple, lightweight database that runs on your own computer. You don’t need to install anything fancy or set up a server. It helps you save your data in a neat and organized way so you can come back to it later, run analysis, or even share it.

Getting Started: The Foundation Code

Now that everything is set up, it’s time to collect the actual product links. In this part, we go through the fruits and vegetables section of the website and gather links to each product listed there. These links will later help us visit each product page and pull out the detailed information we need. We’ll also handle multiple pages, just like a user clicking through “Next” to see more items. The steps below will walk you through how the script grabs those links and gets them ready for the next stage.

Setting Up Our Tools: The Import Section

Before jumping into the code, there’s one important step we need to take—setting up our tools. Each of these libraries plays a specific role in the process, and together, they help everything run smoothly.

import asyncio
import sqlite3
import datetime
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

Let’s start by understanding the tools we’re bringing into our project—and why they’re important.

First, we have asyncio. Think of it as the manager that keeps things moving behind the scenes. When our script is waiting for a page to load or some data to process, asyncio makes sure that time isn’t wasted. Instead of just sitting and waiting, it allows the program to keep working on other tasks. This helps our scraper run faster and more efficiently, especially when dealing with lots of web pages.

Another necessary tool as we told before is sqlite3, our structured data storage system. Manually storing hundreds or thousands of product links is not feasible—this is where SQLite plays its role by efficiently storing scraped data. To go along with datetime, which timestamps our gathered data for tracking purposes, these pieces are the basis of our data storage solution.

Finally, our stars of our scraping process—Playwright and BeautifulSoup—collaborate to scan web pages and extract useful content, making the entire scraping task smooth and easy.

Global Settings: Our Project's Command Center

# Global configuration
DB_NAME = "product_links.db"
URL = "https://family.qa/default/produce/fruits-and-vegetables.html"

Every project needs a clear starting point—and some basic rules to follow. In our web scraping script, that structure comes from global variables. Think of them as fixed reference points that guide how and where the script works.

In this case, we define two key variables:

DB_NAME tells the script where to store the scraped data (our database file), and
URL tells it where to begin scraping (the starting web page).

Even though these values are simple and don’t change during the run of the script, they play a big role. They’re like signposts that point the program in the right direction.

Setting these values at the top of our script keeps things neat and easy to manage. For example, if we ever want to scrape a different section of the website, we just update the URL. Or if we want to save the data to a different file, we simply change the DB_NAME. There’s no need to dig through the whole code to make those changes.

By organizing our script this way, we make it easier to maintain, flexible to adapt, and ready to grow if we want to scale things up later.

Database Creation: Building Our Digital Warehouse

def create_database():
    """
    Creates and initializes an SQLite database for storing product links.
    
    Creates a 'products' table with the following schema:
        - id: INTEGER PRIMARY KEY AUTOINCREMENT
        - link: TEXT (stores the product URL)
        - date: TEXT (stores the scraping date)
        - scraped: INTEGER (flag to track processed links, default 0)
    
    Raises:
        sqlite3.Error: If there's an error creating the database or table
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                link TEXT,
                date TEXT,
                scraped INTEGER DEFAULT 0
            )
        """)
        conn.commit()
    except sqlite3.Error as e:
        print(f"Database error: {e}")
    finally:
        conn.close()

Before we start collecting data, we need a proper place to store it—something organized and reliable. That’s where our create_database() function comes in. Its job is to set up a small, local database using SQLite, where all our product links will be stored safely.

The process begins by connecting to the database using sqlite3.connect(DB_NAME). If the database file doesn't exist yet, no worries—SQLite will automatically create one for us. This makes it very beginner-friendly and easy to work with.

Once connected, we create something called a cursor. You can think of the cursor as a messenger—it helps us send instructions (SQL commands) to the database.

Next, we ask the database to create a table called products—but only if it doesn’t already exist. This is done using the CREATE TABLE IF NOT EXISTS command. The table will have four columns:

id: a unique number for each entry (automatically increases for every new product),
link: where the product URL is stored,
date: which saves the date when the data was collected,
scraped: a flag to tell us whether the product has already been scraped. It starts with a value of 0, meaning "not yet scraped."

After setting everything up, we save the changes and close the connection to keep things clean and free up resources. If anything goes wrong during this process, the function will catch the error and print it out. That way, we can quickly figure out what happened and fix it.

The Link Saving System: Our Data Archival Process

def save_links_to_db(links):
    """
    Saves product links to the SQLite database with the current date.
    
    Args:
        links (list): List of product URLs to be saved
    
    Notes:
        - Each link is saved with the current date and scraped=0 flag
        - Duplicate links are allowed to track historical data
        - Current date is stored in ISO format (YYYY-MM-DD)
    
    Raises:
        sqlite3.Error: If there's an error during database operations
    """
    if not links:
        print("No links to save.")
        return

    conn = None
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        current_date = datetime.date.today().isoformat()

        inserted_count = 0
        for link in links:
            cursor.execute("INSERT INTO products (link, date, scraped) VALUES (?, ?, 0)", (link, current_date))
            inserted_count += 1  # Count inserted links

        conn.commit()
        print(f"Inserted {inserted_count} links into the database.")
    except sqlite3.Error as e:
        print(f"Database insert error: {e}")
    finally:
        if conn:
            conn.close()

When we collect the product links from the website, the next step is to store them neatly in our database. That’s exactly what the save_links_to_db(links) function does—it takes a list of product URLs and saves each one into our SQLite database, tagging them with today’s date for reference.

The function starts by checking whether the list of links is empty. If there are no links to save, it simply prints a message and stops right there—no need to move forward if there’s nothing to do.

But if we do have links, the function connects to the database and sets up a cursor to send instructions. Then it uses datetime.date.today().isoformat() to get today’s date in a clean, standard format (like "2025-06-12"), which helps us keep track of when each link was added.

Each link is then added to the products table with that date. We also include a scraped value, which is set to 0 for now—this tells us that we haven’t yet scraped the full product details from this link. Think of it as a little note saying, “Hey, this one’s still waiting to be processed.”

The function keeps count of how many links it successfully saves and prints that out once it’s done, giving us helpful feedback. If something goes wrong—like a connection issue or an unexpected error—it catches the problem and prints it clearly so we can fix it. And just like good housekeeping, it makes sure to close the database connection at the end, no matter what.

Link Extraction: Finding Treasures in HTML

def extract_product_links(html):
    """
    Extracts product links from the HTML content using BeautifulSoup.
    
    Args:
        html (str): Raw HTML content of the page
    
    Returns:
        list: List of extracted product URLs
    
    Notes:
        - Targets links with class 'product-item-link'
        - Filters out None/empty links
        - Uses BeautifulSoup's CSS selector for efficient parsing
    """
    soup = BeautifulSoup(html, "html.parser")
    product_links = []

    for a_tag in soup.select("a.product-item-link"):
        link = a_tag.get("href")
        if link:
            product_links.append(link)

    return product_links

To work with data from a website, we can’t just rely on the raw HTML—it’s messy, unorganized, and full of information we don’t need. That’s where structured parsing comes in. It helps us focus on just the pieces we care about. In this part of the process, we use a function called extract_product_links(html) to pull out only the product links from the HTML code we’ve collected.

Here’s how it works:

First, we take the big block of HTML content and hand it over to BeautifulSoup, a library that helps us understand and navigate the structure of a web page. It’s a bit like turning a scrambled document into a searchable map where we can zoom in on the parts we want.

Next, the function searches through this map to find all the <a> tags (these are the building blocks of links on a webpage). But we don’t want just any links—we’re looking specifically for ones that have a class named "product-item-link", which, in this website’s layout, are the ones that point to individual product pages.

Once it finds those tags, it uses .get("href") to grab the actual URLs hidden inside them. These links are added to a list, but only if they’re not empty (sometimes a tag might be there but missing the actual link, so we skip those).

In the end, the function returns a clean list of product links—nothing more, nothing less. This step is crucial because it filters out all the extra clutter and leaves us with just the data we need to move forward: the direct links to the product pages we want to explore next.

Page Navigation: Our Automated Browser Control

async def fetch_all_pages():
    """
    Asynchronously fetches all paginated pages and extracts product links.
    
    Returns:
        list: Consolidated list of product URLs from all pages
    
    Notes:
        - Uses Playwright for browser automation
        - Implements pagination handling
        - Waits for network idle state to ensure page loads
        - Uses a 60-second timeout for initial page load
        - Handles browser cleanup in case of errors
    
    Raises:
        Exception: Any error during page navigation or scraping
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        
        try:
            await page.goto(URL, timeout=60000)  # 60s timeout to prevent timeouts
            await page.wait_for_load_state("networkidle")

            all_links = []
            page_number = 1

            while True:
                content = await page.content()
                links = extract_product_links(content)
                all_links.extend(links)

                print(f"Page {page_number}: Scraped {len(links)} links.")  # Debugging output

                next_button = await page.query_selector("#layered-ajax-list-products > div:nth-child(3) > div.pages > ul > li.item.pages-item-next > a")
                if next_button:
                    await next_button.click()
                    await page.wait_for_load_state("networkidle")
                    page_number += 1
                else:
                    break

        except Exception as e:
            print(f"Scraping error: {e}")
        finally:
            await browser.close()

    print(f"Total Scraped: {len(all_links)} links.")  # Debugging output
    return all_links

To collect product links from multiple pages of a website, we use the fetch_all_pages() function. This function works like a tireless assistant that mimics how a human browses—clicking through each page and saving useful links along the way. It uses Playwright, which opens a browser in the background (called "headless" mode, since it doesn't actually show the browser window). Once the browser opens, it visits the starting URL and waits for the entire page to finish loading. This is important because many websites load content dynamically as you scroll or wait.

After the page loads, the function grabs the HTML content and sends it to another helper function, extract_product_links(), which finds all the product URLs on that page. These links are saved into a growing list. Then, the function checks if there’s a “Next” button on the page. If it finds one, it clicks on it—just like a real user would—and waits for the next page to load fully. This cycle repeats: scrape, click “Next,” wait, and repeat. If there is no “Next” button, the function understands it has reached the last page and stops.

Throughout the process, it prints helpful updates, like how many links were found on each page. Finally, once all pages are scraped, the browser is closed to save memory, and the full list of product links is returned. This way, we make sure no product is left behind—even if it's hidden on page 10!

The Main Orchestra Conductor

async def main():
    """
    Main execution function that orchestrates the scraping process.
    
    Flow:
        1. Creates/initializes the database
        2. Fetches all product links
        3. Saves the links to database
        4. Reports success/failure
    
    Notes:
        - Handles the case when no links are scraped
        - Provides feedback about the operation
    """
    create_database()
    links = await fetch_all_pages()
    
    if links:
        save_links_to_db(links)
        print(f"Saved {len(links)} product links to the database with the current date.")
    else:
        print("No links were scraped.")

if __name__ == "__main__":
    asyncio.run(main())

The main() function is like the project manager of our entire web scraping workflow—it controls the order of execution and ensures everything runs smoothly. It starts by making sure our data storage is ready using create_database(). This step sets up the SQLite database where all product links will be saved, ensuring we have an organized place to store the data.

Next, it kicks off the core scraping process by calling the fetch_all_pages() function. This function runs asynchronously, which means it can handle multiple tasks efficiently without waiting for each step to finish completely before moving on. After scraping, the function checks whether any product links were actually found. If links are available, it saves them into the database using save_links_to_db(links) and prints a confirmation message, letting us know how many links were stored. If no links are found, it prints a message to let us know that the process returned no data—useful for debugging or making improvements later.

At the bottom of the script, there's a standard Python pattern: if name == "__main__":. This ensures the main() function only runs if the script is executed directly (not when imported into another module). Inside this block, asyncio.run(main()) is called to handle all asynchronous operations efficiently. This ensures that our scraping happens smoothly, without blocking or freezing up the rest of the script.

From Links to Details: Diving Deep into Product Data

Now that we have successfully gathered all the product links, we can move on to the crucial next phase—extracting detailed product information. This step involves visiting each collected link and carefully retrieving the relevant data directly from the individual product pages. Details such as the product name, price, and packaging are systematically captured in this stage. In the following sections, we’ll explore how this process is structured in code, ensuring efficient navigation and reliable data extraction across the entire product catalog.

Setting Up Our Data Collection Tools

Just like before, we begin by importing the necessary libraries that facilitate seamless data extraction:

import asyncio
import sqlite3
from datetime import datetime
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

# Configuration constant for database
DB_NAME = "product_links.db"

We’ll continue using the same tools as before, but now we’re shifting our focus. Previously, we used them to find product links. This time, we’ll use them to open each link and collect detailed information from the product pages.

One thing that stays the same is how we store the data. We’re still using the same database we set up earlier. This helps keep everything organized in one place. By sticking to a consistent format for storing our data, we make it easier to keep track of what we’ve collected and avoid duplicates or confusion.

Think of the product links as doors. Each door leads to a page full of information, and now we’re ready to step through each one, gather what we need, and neatly file it away in our database. This structured approach ensures that everything we collect is easy to find and manage later on.

Building Our Product Information Warehouse

def create_data_table():
    """
    Creates a table in SQLite database to store detailed product information.
    
    Schema:
        - id: INTEGER PRIMARY KEY AUTOINCREMENT
        - link: TEXT (product URL)
        - item_name: TEXT (name of the product)
        - packing: TEXT (packaging information)
        - sku: TEXT (product SKU)
        - price: TEXT (product price)
        - category: TEXT (product category)
        - date: TEXT (date of scraping)
    
    Raises:
        sqlite3.Error: If database operations fail
    
    Notes:
        - Uses error handling to manage database connection
        - Creates table only if it doesn't exist
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS product_data (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                link TEXT,
                item_name TEXT,
                packing TEXT,
                sku TEXT, 
                price TEXT,
                category TEXT,
                date TEXT
            )
        """)
        conn.commit()
    except sqlite3.Error as e:
        print(f"[ERROR] Database table creation failed: {e}")
    finally:
        if conn:
            conn.close()

Now that we're ready to store detailed product information, we need a proper place to keep it all. That’s where this function comes in—it sets up a dedicated table inside our SQLite database to hold the data we’re about to collect.

First, the function opens a connection to our database. Think of this like opening the door to a storage room. Once inside, it gets ready to create a table by using something called a cursor. This cursor is like a tool that lets us send instructions to the database.

The main instruction we give is to create a table named product_data. We use a special command that checks whether the table already exists—if it does, we don’t create it again. If not, the table gets created with all the columns we need to hold our product details.

Each row in this table will store one product’s information. The columns include things like the product link, its name, how it's packaged, its SKU (which is just a unique code used to track items), price, category, and the date we scraped the information. There’s also a column for an ID number that automatically counts up for each new row—this helps us keep everything in order.

To make sure everything goes smoothly, the function also watches for any errors. If something goes wrong—like a problem with the database connection—it prints an error message to help us figure out what happened. And no matter what, it always closes the connection at the end, just like locking the door behind you when you leave the storage room. This keeps everything clean and avoids unnecessary memory use.

Finding Products We Haven't Explored Yet

def get_unscraped_links():
    """
    Retrieves all unprocessed product links from the database.
    
    Returns:
        list: Tuples containing (link, date) for unscraped products
    
    Notes:
        - Queries links where scraped=0 flag
        - Returns empty list if query fails
        - Each tuple contains the product URL and its original scrape date
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("SELECT link, date FROM products WHERE scraped = 0")
        links = cursor.fetchall()
        return links
    except sqlite3.Error as e:
        print(f"[ERROR] Failed to fetch unscraped links: {e}")
        return []
    finally:
        if conn:
            conn.close()

This function plays an important role in helping us keep track of our progress while scraping. It’s like checking a list to see which tasks are still left to do. Specifically, it looks through our database and pulls out only the product links we haven’t worked on yet.

Let’s walk through what it does step-by-step.

First, the function connects to the SQLite database—think of this like opening a file where we’ve saved all the product links. Then, it creates a cursor, which acts like a pen that can write or read inside this file. Using that cursor, the function runs a query that says: “Give me all the product links from the products table where the scraped value is 0.” In our setup, a scraped value of 0 means that the product hasn’t been processed yet. This simple check helps us avoid re-scraping the same product pages and keeps everything running efficiently.

The results from the query come back as a list of tuples—each tuple includes the link to the product and the date it was added. This makes it easy for us to keep both pieces of information together and refer back to them later if needed.

To make the function more reliable, we’ve added some error handling. If something goes wrong—like a glitch in the database or a typo in the query—the function will catch the error, print a helpful message, and safely return an empty list. This way, the rest of the scraping process won’t crash, and we’ll know something needs our attention.

Finally, once everything is done, the database connection is closed to tidy up and free any resources. Just like shutting a drawer after you’ve finished looking through your papers, this step helps keep the system clean and ready for the next task.

Keeping Track of Our Progress

def update_scraped_status(link):
    """
    Marks a product link as scraped in the database.
    
    Args:
        link (str): URL of the product that has been processed
    
    Notes:
        - Updates scraped flag to 1 for the given link
        - Uses parameterized query for SQL injection prevention
        - Includes error handling for database operations
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("UPDATE products SET scraped = 1 WHERE link = ?", (link,))
        conn.commit()
    except sqlite3.Error as e:
        print(f"[ERROR] Failed to update scraped status for {link}: {e}")
    finally:
        if conn:
            conn.close()

This function helps us keep our scraping workflow neat and organized by marking each product link as “done” once it’s been processed. Think of it like checking off items on a to-do list so we don’t accidentally work on the same thing twice.

First, it connects to the SQLite database—this is like opening the notebook where we’ve been storing all our product links. Then, it creates a cursor, which acts like a tool for writing into that notebook.

Next comes the update part. The function runs a special command that says: “Find this specific product link and change its status so that it’s marked as scraped.” Technically, it sets the scraped field to 1, which means “this one’s done.” To do this safely, it uses something called a parameterized query—which just means that instead of sticking the link directly into the SQL command (which can be risky), it uses a placeholder and fills it in separately. This method helps prevent errors or even security problems like SQL injection.

Once the update is made, the function saves the change permanently by committing the transaction. That’s like hitting “Save” after editing a document, so the changes don’t get lost.

To make sure everything runs smoothly, the function includes error handling. If anything goes wrong—maybe the link wasn’t found or the database connection failed—it prints an error message that tells us which link had trouble. That way, we know exactly where to look if we need to fix something.

Finally, the function closes the database connection. This is a good habit to keep things tidy and avoid leaving connections open that might slow things down.

Saving Our Product Treasures

def save_product_data(link, item_name, packing, sku, price, category, date):
    """
    Stores extracted product information in the database.
    
    Args:
        link (str): Product URL
        item_name (str): Name of the product
        packing (str): Packaging information
        sku (str): Product SKU
        price (str): Product price
        category (str): Product category
        date (str): Scraping date
    
    Notes:
        - Uses parameterized queries for safe data insertion
        - Logs successful saves and errors
        - Ensures database connection is properly closed
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("""
            INSERT INTO product_data (link, item_name, packing, sku, price, category, date)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        """, (link, item_name, packing, sku, price, category, date))
        conn.commit()
        print(f"[INFO] Saved product: {item_name}")
    except sqlite3.Error as e:
        print(f"[ERROR] Failed to save product data for {link}: {e}")
    finally:
        if conn:
            conn.close()

This function plays a key role in saving the detailed product information we’ve collected into our database. Once we’ve scraped data like the product’s name, price, and category, we need a way to store it all in one place so we can refer to it later—and that’s exactly what this function does.

Let’s walk through it step by step.

First, the function takes in a bunch of details about a product: the link to the product page, its name, how it’s packaged, its SKU (that’s a unique product code), the price, which category it belongs to, and the date we collected the data. All of this information gets passed to the function as inputs.

Next, it connects to our SQLite database—think of this as opening up a storage box where we keep all our product records. After connecting, it creates something called a cursor, which is what we use to “write” into the database.

Now comes the important part: saving the data. The function uses an INSERT INTO command, which is like saying, “Hey database, here’s a new row of product info—please add it to the table.” To keep things secure and clean, it uses parameterized queries. This simply means the actual values (like the name and price) are added separately, which helps avoid mistakes or security issues like SQL injection.

If everything goes smoothly, the function prints a message to let us know the data was added successfully. But if something goes wrong—like if the database is locked or there’s a typo in the data—it logs an error message so we know what happened and can fix it.

Finally, just like closing a file when you’re done reading or writing, the function closes the database connection. This is a good habit because it frees up memory and keeps the system running efficiently.

The Text Extraction Helper

def extract_text(soup, selector):
    """
    Safely extracts text content from HTML using CSS selectors.
    
    Args:
        soup (BeautifulSoup): Parsed HTML content
        selector (str): CSS selector for target element
    
    Returns:
        str: Extracted text or 'N/A' if element not found
    
    Notes:
        - Returns 'N/A' instead of None for missing elements
        - Strips whitespace from extracted text
    """
    element = soup.select_one(selector)
    return element.text.strip() if element else "N/A"

This small but handy function helps us pull out specific bits of text from a webpage’s HTML. When we’re scraping data from a site, the page is often filled with lots of HTML code, and we just want a specific piece—like a product name or a price. That’s where this function comes in.

Imagine you already have the HTML of a webpage loaded using BeautifulSoup—a popular Python tool that makes it easier to work with HTML. You also know the CSS selector for the exact item you want to grab. For example, you might say, “I want the text inside this particular <div> or <span>.”

The function takes these two things: the soup object (which is just the HTML you’ve already parsed), and the selector (which tells it what to look for). It uses soup.select_one(selector) to find the first match. If it finds the element, it grabs the text inside, removes any unnecessary spaces at the beginning or end, and returns it.

But what if the element doesn’t exist on the page? Instead of giving back a confusing None, the function simply returns "N/A". This is helpful because it keeps your data clean and consistent. Later on, if you’re working with a spreadsheet or a database, it’s much easier to handle missing values when they’re marked clearly as "N/A" rather than as blanks or errors.

The Product Data Detective

def parse_product_data(html):
    """
    Extracts product details from HTML content.
    
    Args:
        html (str): Raw HTML content of product page
    
    Returns:
        tuple: Contains (item_name, packing, sku, price, category)
    
    Notes:
        - Uses BeautifulSoup for HTML parsing
        - Employs CSS selectors for precise element targeting
        - Returns 'N/A' for missing data fields
    """
    soup = BeautifulSoup(html, "html.parser")
    # Extract each product detail using CSS selectors
    item_name = extract_text(soup, "div.page-title-wrapper.product > h2 > span")
    packing = extract_text(soup, "div.product-info-price > div > div.stock.available > span")
    sku = extract_text(soup, "div.product-info-price > div > div.product.attribute.sku > div")
    price = extract_text(soup, "span.price")
    category = extract_text(soup, "div.product-page-brand-common-view > ul > li > a:nth-child(2)")
    
    return item_name, packing, sku, price, category

This method plays a central role in transforming messy raw HTML into clean, structured product data. Think of a web page as a complex puzzle full of information—we only want a few specific pieces, like the product name, packaging, SKU, price, and category. This method helps us pull out just those important parts in a neat, organized way.

Here's how it works: First, it uses BeautifulSoup, a Python library that makes it easier to work with HTML, almost like giving us X-ray vision to see through the web page’s code. Once we have the HTML "decoded," the method calls the extract_text() function multiple times. Each call is aimed at a different piece of product info—for example, one call might pull out the product’s name, while another grabs the price or SKU.

Each of these values is picked out using CSS selectors, which work like directions telling the code exactly where to look in the HTML. It’s similar to saying, “Look inside this box, under that label, and grab whatever text you find.”

Once all the fields are collected, the method bundles them into a tuple, which is like a little package of information that keeps all the product details together and in the correct order. If any field happens to be missing on the page, it automatically fills in "N/A" to keep things consistent—so we never end up with holes or mismatched records.

This structured format is really helpful when you later want to save the data into a database or display it neatly elsewhere. Everything is in place, and you know exactly what to expect—making your data clean, predictable, and easy to work with.

The Individual Page Scraper

async def scrape_product_page(page, link, date):
    """
    Scrapes individual product page and saves extracted data.
    
    Args:
        page (Page): Playwright page object
        link (str): URL of the product to scrape
        date (str): Date when the link was originally found
    
    Notes:
        - Uses 60-second timeout for page loading
        - Validates extracted data before saving
        - Includes comprehensive error handling
        - Updates scraped status after successful extraction
    """
    try:
        await page.goto(link, timeout=60000)  # 60 seconds timeout
        await page.wait_for_load_state("networkidle")

        # Extract and parse product data
        content = await page.content()
        item_name, packing, sku, price, category = parse_product_data(content)

        # Validate essential data before saving
        if item_name == "N/A" and price == "N/A":
            print(f"[WARNING] No valid data found for {link}. Skipping...")
            return
        
        # Save data and update status
        save_product_data(link, item_name, packing, sku, price, category, date)
        update_scraped_status(link)
        print(f"[INFO] Scraped: {item_name} | Price: {price}")
    
    except Exception as e:
        print(f"[ERROR] Failed to scrape {link}: {e}")

This asynchronous method is like a dedicated worker that focuses on collecting detailed information from one specific product page at a time. It’s designed to handle modern, dynamic websites where content might take a moment to fully load.

First, it receives three important pieces of information: a Playwright page object (which helps it interact with the web page like a real browser), the product's URL, and the date when this scraping attempt is happening.

The method begins by visiting the product page using Playwright’s goto() function. It gives the page up to 60 seconds to load—this is especially useful when dealing with pages that take longer due to animations, pop-ups, or slow servers. It also waits until the network is quiet, meaning everything on the page (like images and scripts) has had time to finish loading. This ensures that all the product details are present before we start collecting them.

Once the page is fully loaded, it grabs the raw HTML content and sends it over to another helper method called parse_product_data(). This helper organizes the messy HTML into neat product details like name, price, packaging, and so on.

Before saving anything, the method checks to make sure the most important details—like the product name and price—are actually there. If either is missing, it skips saving that product and logs a warning. This prevents the database from being cluttered with incomplete or useless records.

If all the required information is present, it proceeds to store the data in the database and updates the product's status so we know it’s already been processed. Along the way, if anything goes wrong—like the page doesn’t load or the data can’t be saved—it handles the issue gracefully and logs clear error messages. That way, you can figure out what went wrong without the whole scraping process breaking down.

The Grand Orchestra of Data Collection

Now we bring all the pieces together into a smooth, coordinated scraping workflow. Each function plays its part—fetching unprocessed links, visiting product pages, extracting clean data, and saving it to the database. Like a well-oiled machine, the system handles errors gracefully and keeps the process moving. This is where all our careful setup pays off, creating a reliable, automated pipeline for product data collection.

The Product Collection Conductor

async def scrape_all_products():
    """
    Orchestrates the scraping of all unprocessed product pages.
    
    Notes:
        - Retrieves unscraped links from database
        - Launches headless browser instance
        - Processes each product page sequentially
        - Ensures proper cleanup of browser resources
        - Includes error handling and logging
    """
    links = get_unscraped_links()
    if not links:
        print("[INFO] No new products to scrape.")
        return
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        for link, date in links:
            print(f"[INFO] Scraping {link}...")
            await scrape_product_page(page, link, date)

        await browser.close()

This method takes charge of the entire scraping process, acting like the conductor of our data-gathering operation. Its main job is to go through every product link that hasn’t been scraped yet and process them one by one.

It all starts by calling a function named get_unscraped_links(). This function checks the database and returns a list of product URLs that haven’t been handled yet. If it turns out there are no more fresh links to scrape, the method politely prints a message and exits early. This prevents unnecessary work and saves time and resources.

When there are links to work with, the real scraping begins. The method launches a browser—specifically, a lightweight, invisible (or “headless”) version of Chromium using Playwright. It then loops through each unscraped link, visiting each page and calling another function, scrape_product_page(), to extract the product’s information.

This part of the code is asynchronous, which means it can handle tasks more efficiently by not waiting around while pages load. Instead, it keeps things moving, making the whole process faster and more responsive.

Once all the links have been processed and every bit of information has been gathered and saved, the browser is closed. This final step is important because it releases system resources and ensures everything is tidy when the job is done.

The Grand Finale: Our Main Function

async def main():
    """
    Main execution function that coordinates the scraping process.
    
    Flow:
        1. Initializes database table
        2. Scrapes all unprocessed products
        3. Logs completion status
    
    Notes:
        - Asynchronous execution using asyncio
        - Provides feedback about operation progress
    """
    create_data_table()
    await scrape_all_products()
    print("[INFO] Scraping completed.")

if __name__ == "__main__":
    asyncio.run(main())

The main() function serves as the central starting point of the entire scraping operation—it’s where the process truly begins.

It starts with a foundational step: calling create_data_table(). This ensures that the database is set up correctly, with the appropriate table(s) ready to receive incoming product data. Without this, the scraping process would lack a place to store its results, making this an essential initial task.

After preparing the database, main() moves on to its core responsibility—kicking off the actual data collection. It does this by invoking scrape_all_products(), the orchestrated method that handles scraping each individual product link. Since scraping involves asynchronous operations (to make the process faster and more responsive), Python’s asyncio.run(main()) is used when calling the main function. This allows asynchronous tasks to execute smoothly in a controlled event loop.

Finally, once all product pages have been visited and the data is safely stored, the function prints a success message. This serves as a confirmation that everything went according to plan—the data pipeline ran, the information was captured, and the system operated as expected.

Conclusion

By following this carefully organized process, we've built a dependable system that can automatically collect product data from Family food center. Each step—from setting up a clean and reliable database to using tools like Playwright and BeautifulSoup—works together to keep everything running smoothly and accurately. Along the way, we’ve also added checks and messages to help us monitor progress and avoid collecting the same data twice.

One of the key strengths of this setup is that it uses asynchronous programming. This means we can load and process several web pages at once, which saves a lot of time and makes the whole system faster and more scalable. Whether we’re working with just a few products or thousands, this structure holds up well and doesn’t need constant attention.

In the end, what we’ve created is more than just a scraper—it’s a solid foundation for collecting useful data in a smart, repeatable way. This opens the door to doing more with the information we gather, like tracking prices, studying market trends, or even making informed business decisions. It’s a practical and powerful tool for anyone looking to turn web data into real-world insight.

AUTHOR

I’m Shahana, a Data Engineer at Datahut, where I specialize in building smart, scalable data pipelines that transform messy web data into structured, usable formats—especially in domains like retail, e-commerce, and competitive intelligence.

At Datahut, we help businesses across industries gather valuable insights by automating data collection from websites, even those that rely on JavaScript and complex navigation. In this blog, I’ve walked you through a real-world project where we created a robust web scraping workflow to collect product information efficiently using Playwright, BeautifulSoup, and SQLite. Our goal was to design a system that handles dynamic pages, pagination, and data storage—while staying lightweight, reliable, and beginner-friendly.

If your team is exploring ways to extract structured product or pricing data at scale—or if you're just curious how web scraping can support smarter decisions—feel free to connect with us using the chat widget on the right. We’re always excited to share ideas and build custom solutions around your data needs.

FAQ SECTION

1. What tools do you need to scrape product data from Family Food Centre?

Playwright — to automate browser actions like scrolling and clicking "Next" on paginated pages
BeautifulSoup — to parse the loaded HTML and extract product details like name, price, and packaging
SQLite — to store scraped links and product data locally in an organized database
asyncio — to run the scraping process asynchronously, making it faster and more efficient
Python's datetime module — to timestamp each record for tracking when data was collected

2. Why is the scraping process split into two phases?

Phase 1 focuses only on collecting product URLs from the category listing pages
This avoids overloading the scraper by separating link discovery from data extraction
Phase 2 visits each saved URL individually to pull detailed information like price, SKU, and packing
Splitting phases makes it easier to resume if the scraper stops midway — unscraped links remain in the database with a scraped = 0 flag
It also makes debugging simpler since each phase can be tested and fixed independently

3. How does the scraper handle multiple pages on the Family Food Centre website?

Playwright loads the first page and waits for it to reach a networkidle state before extracting links
The scraper then looks for a "Next" button using a CSS selector specific to the site's pagination structure
If the button is found, it is clicked automatically and the next page is allowed to fully load
This loop continues until no "Next" button is found, signaling the last page has been reached
A page counter prints progress updates so you can monitor how many pages and links have been scraped

4. How is scraped data stored and tracked to avoid duplicates?

All product links are saved in a products table in an SQLite database with a scraped column defaulting to 0
Detailed product info (name, price, SKU, packing, category) is stored separately in a product_data table
Once a product page is successfully scraped, its scraped flag is updated to 1 using the update_scraped_status() function
On every new run, get_unscraped_links() queries only rows where scraped = 0, so already-processed links are never revisited
Each record is also timestamped with the scrape date for historical tracking and analysis

5. What happens if a product page fails to load or has missing data?

Every scraping function is wrapped in a try/except block to catch errors gracefully without crashing the entire script
If a page times out (beyond the 60-second limit), the error is logged and the scraper moves on to the next link
Before saving, the script checks whether both item_name and price return "N/A" — if so, that product is skipped with a warning
Missing individual fields (like SKU or packing) are stored as "N/A" to keep the database consistent and query-friendly
All errors are printed to the console with the specific link that caused the issue, making it easy to identify and retry problem URLs