top of page

How to Scrape Data from Noon’s Fragrance Store?

  • Writer: Shahana farvin
    Shahana farvin
  • 3 hours ago
  • 27 min read
How to Scrape Data from Noon’s Fragrance Store?

Have you ever wondered how to collect product information from online stores without copying everything by hand? In this blog, I’ll walk you through a simple project where we gather data from Noon, a well-known shopping website. We’ll be focusing on fragrance products—and by the end, you’ll see how we can collect, clean, and make sense of that data using a bit of Python code.


Web scraping is just a way of telling the computer, “Hey, go to this website and bring me back the information I need.” Instead of manually going through hundreds of pages and copying prices or product names, we can write a small program that does it for us—faster and more accurately. This technique is useful for anyone who wants to track prices, compare brands, or study how the market changes over time.


We chose Noon because it’s one of the top online stores in the Middle East, and it has a wide variety of fragrances to explore. That makes it a great example to practice with. Fragrance products also show interesting trends when we look at them closely—things like brand popularity, price ranges, and customer ratings.


To keep things simple, we’ll split the scraping process into two main steps:

  1. Collecting the product links – First, we’ll go through the fragrance category and grab the individual product page links. These links are important because they’ll guide us to where the real details are.

  2. Extracting the product details – Next, we’ll visit each of those product pages and pull out the information we care about, like the brand, price, rating, and description.


Breaking it into two steps makes our code cleaner and easier to manage if something goes wrong along the way.


By the time you finish this guide, you’ll not only have working code but also a clear idea of how to scrape product data from Noon—or even any other website with a similar structure.


So, let’s dive in and see how we Scrape Data from Noon’s Fragrance Store and can put all the pieces together to build a simple but powerful data scraping workflow!


Scrape Data from Noon’s Fragrance Store: Link Harvesting


Now let’s turn to the first part of our web scraping code. This code is designed to scrape product URLs from the fragrance category of Noon’s website, using Python as well as modern technology such as asynchronous programming followed by browser automation.


Essentially, the code moves through all the various fragrance categories, collecting all the product links, regardless of how many pages of product links there may be. In addition, the code includes error logging for tracking any issues, and it saves all the links in a SQLite database for future retrieval in the next part of the data collection process.


Next, we’ll move on to the part where we visit these product pages and collect the data we’re really after.


Setting Up the Foundation: Imports and Database Configuration

import asyncio
from playwright.async_api import async_playwright
import sqlite3
from bs4 import BeautifulSoup
import logging

# Setup logging
logging.basicConfig(filename="scraping_errors.log", level=logging.ERROR)

Before we dive into writing the scraping logic, the script starts by importing a few important Python libraries. Think of these like tools in a toolbox—each one has a specific job to help us get things done.


First, we have asyncio. This lets us run several tasks at the same time, which means we can scrape many pages faster instead of waiting for each one to finish before starting the next.


Next is playwright. This tool allows us to control a real web browser with code. It’s especially helpful for websites like Noon, where some content loads only when you interact with the page—just like a human would.


Then comes sqlite3, which helps us store all the product links we collect in a small, local database. This makes it easy to keep our data organized and ready to use later.


We also use BeautifulSoup, a simple but powerful library for reading and picking out specific parts of a web page. It’s like using a magnifying glass to zoom in on the exact bits of text or data we need.


Finally, logging is there to keep track of anything that goes wrong. If something fails while the script is running, logging will make a note of it so we can understand and fix the issue later.


Together, these libraries give us everything we need to build a scraper that’s both efficient and dependable.

# SQLite database setup
def setup_db():
    """
    Set up the SQLite database connection and create the product_urls table if it doesn't exist.
    
    Returns:
        tuple: A tuple containing (connection, cursor) objects for database operations.
    
    The database schema includes:
        - id: Auto-incrementing primary key
        - category: The fragrance category (women, men, unisex)
        - product_url: The complete URL to the product page
    """
    conn = sqlite3.connect("product_urls.db")
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS product_urls (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            category TEXT,
            product_url TEXT
        )
    """)
    conn.commit()
    return conn, cursor

Now that we’ve got our tools ready, the next step is to set up a place to store the product URLs we’ll be collecting. For that, we’re using our small database SQLite. This helps us keep our data safe and organized, even after the script finishes running.


We create this setup using a simple function that builds a table named product_urls. Think of this table like a spreadsheet with three columns:

  1. ID – A unique number that goes up automatically each time a new product is added. This helps us keep track of how many links we’ve collected.

  2. Category – Whether the product is from the men's or women’s fragrance section.

  3. Product URL – The actual link to the product page.


By storing everything in this format, it becomes much easier to sort, search, or filter the data later—especially if we want to look only at certain categories.


Also, keeping the scraping and analysis parts separate like this makes our workflow cleaner. We collect the data first, store it safely, and then later we can use it for analysis without having to scrape again.


Browser Initialization: Setting Up Playwright

# Function to initialize the browser and page
async def init_browser():
    """
    Initialize Playwright browser and page with custom headers to mimic a real user.
    
    Returns:
        tuple: A tuple containing (browser, page) objects for web automation.
        If initialization fails, returns (None, None).
    
    The function:
        1. Starts a Playwright instance
        2. Launches a non-headless Chromium browser (visible UI)
        3. Creates a new page with custom headers to avoid detection as a bot
    
    Exceptions are logged to the error log file.
    """
    try:
        playwright = await async_playwright().start()
        browser = await playwright.chromium.launch(headless=False)
        page = await browser.new_page()

        # Set extra HTTP headers for all requests
        await page.set_extra_http_headers({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
        })

        return browser, page
    except Exception as e:
        logging.error(f"Error initializing browser: {str(e)}")
        return None, None

Next, we move on to setting up the browser that will do the actual visiting and clicking around on the Noon website. This is done through a function called init_browser(), which uses Playwright to launch a browser we can control with code.


Playwright is especially helpful because it behaves just like a real web browser. That means it can load pages that rely on JavaScript—just like Noon’s product listings, where the content doesn’t appear right away but loads after the page finishes rendering.


Inside this function, we also customize the HTTP headers—specifically the User-Agent. This is a small detail that tells the website what kind of browser is visiting. By setting it to mimic a real user’s browser (like Chrome or Firefox), we reduce the chances of being blocked or flagged as a bot.


The function gives us back two things:

  • The browser – this is like the full window of the browser we launched.

  • The page – this is like a single tab where we’ll open and interact with different product pages.


Lastly, the function includes error handling, so if something goes wrong while setting up the browser, we’ll get a clear message in the logs. That way, we can troubleshoot easily without guessing what failed.


Resource Management: Properly Closing Browser Sessions

# Function to close the browser
async def close_browser(browser):
    """
    Safely close the Playwright browser instance.
    
    Args:
        browser: The Playwright browser instance to close.
    
    Exceptions during closing are logged to the error log file.
    """
    try:
        await browser.close()
    except Exception as e:
        logging.error(f"Error closing browser: {str(e)}")

Once we’re done scraping, we need to shut things down properly. That’s where the close_browser() function comes in—it closes the Playwright browser when we’ve finished collecting all the data.


While closing a browser might sound like a small task, it’s actually quite important. If we leave the browser open, it keeps using up memory and system resources, which can slow things down—especially if the script runs for a long time or visits hundreds of pages.


To handle this properly, the function uses a try-except block. This means it tries to close the browser as expected, but if something goes wrong, it catches the error and logs it. That way, we’re not left guessing why the script didn’t finish cleanly, and we can go back later to fix any issues if needed.


Managing resources like this is a smart habit to build. It keeps our script efficient, avoids system slowdowns, and makes sure everything runs smoothly even with large or long-running scraping tasks.


The Heart of the Operation: Scraping Category URLs

async def scrape_category_urls(page, category_url, category, cursor):
    """
    Scrape product URLs by navigating through all pages of a category.
    
    Args:
        page: Playwright page object for web interaction
        category_url (str): The URL of the category to scrape
        category (str): The name of the category being scraped (women, men, unisex)
        cursor: SQLite cursor for database operations
    
    Returns:
        list: A list of all product URLs scraped from all pages of the category
    
    The function:
        1. Navigates to the initial category page
        2. Extracts product URLs from the current page
        3. Saves URLs to the database
        4. Clicks the 'Next' button to navigate to the next page
        5. Repeats steps 2-4 until no more pages are available
    
    For each page, the function scrolls down to ensure all products are loaded,
    with a short pause between scrolls to allow content to load.
    """
    await page.goto(category_url, timeout=60000)
    pages = 1
    total_urls = []

    while True:
        print(f"\nScraping Page {pages}...")

        try:
            # Scroll down to load all products
            for _ in range(10):
                await page.evaluate(
                    "window.scrollTo(0, document.body.scrollHeight)"
                )
            await asyncio.sleep(2)
            content = await page.content()
            extracted_urls = extract_product_urls(content)

            if not extracted_urls:
                print("No products found on this page. Ending.")
                break

            save_urls_to_db(cursor, extracted_urls, category)
            total_urls.extend(extracted_urls)
            print(f"✅ Scraped and saved {len(extracted_urls)} URLs from page {pages}")
        except Exception as e:
            logging.error(f"Error scraping page {pages}: {str(e)}")
            break

        # Try to click the 'Next' button
        try:
            next_button = await page.query_selector('#catalog-page-container > div > div.ProductListDesktop_container__08z7c > div.ProductListDesktop_content__3KHXe > div.PlpPagination_paginationWrapper__1AFsm > div > ul > li.next > a')
            if next_button:
                await next_button.click()
                await page.wait_for_timeout(3000)  # wait for next page to load
                pages += 1
            else:
                print("🚫 No 'Next' button found. Reached last page.")
                break
        except Exception as e:
            print(f"❌ Exception while clicking next: {str(e)}")
            break

    return total_urls

Now let’s talk about the scrape_category_urls() function. This part of the script does the real legwork—it goes through a fragrance category on the Noon website and collects the product links one page at a time.


Here’s how it works behind the scenes:

First, the function opens the category page and scrolls to the bottom. This step is important because many websites, including Noon, load more products only when you scroll down. So this ensures we’re seeing everything.


Next, it pauses briefly to give the page enough time to finish loading. Then, it grabs the HTML content of the page and passes it to another function, which is in charge of pulling out the product URLs from that HTML.


Once we have the links, they’re saved into our SQLite database, which keeps everything neatly stored for later use.


The function then looks for a "Next" button on the page. If it finds one, it clicks it to move to the next page and repeats the process.


This loop continues automatically until there are no more pages left to visit. We don’t have to tell the script how many pages to expect—it just keeps going until it reaches the end.


What makes this function solid is that it includes error handling and shows live progress messages. So if something goes wrong, we’ll know exactly where and why—and we can come back and fix it without too much trouble.


Overall, this approach makes the scraper flexible, smart, and capable of handling changes on the site without hardcoding anything.


Parsing Product URLs: Extracting What Matters

# Function to extract product URLs from the page content
def extract_product_urls(content):
    """
    Extract product URLs from the HTML content using BeautifulSoup.
    
    Args:
        content (str): The HTML content of the page to parse
    
    Returns:
        list: A list of complete product URLs extracted from the page
    
    The function uses a CSS selector to find product link elements and 
    constructs full URLs by prepending the base domain.
    """
    soup = BeautifulSoup(content, "html.parser")
    product_links = soup.select("#catalog-page-container > div > div.ProductListDesktop_container__08z7c > div.ProductListDesktop_content__3KHXe > div.ProductListDesktop_layoutWrapper__Kiw3A > div.ProductBoxLinkHandler_linkWrapper__b0qZ9 > a.ProductBoxLinkHandler_productBoxLink__FPhjp")
    return [f"https://www.noon.com{link.get('href')}" for link in product_links if link.get('href')]

The extract_product_urls() function plays a key role in our scraping process. This is the part where we actually pull the product links out of the webpage and prepare them to be saved.


Here’s what it does, step by step:

First, it uses BeautifulSoup to read the HTML of the page. Think of this like scanning a document for certain words or phrases—we’re just telling the program what to look for in the HTML.


Next, it uses a CSS selector to find only the specific parts of the page that contain product links. This is important because it avoids picking up random or unrelated links—so we stay focused only on what we need.


Often, the links we get from the page are just partial URLs, like /product/xyz123, which by themselves aren’t usable. So the function adds the full Noon website URL to the front, turning it into a complete link like https://www.noon.com/product/xyz123.


This approach is simple but very effective. It keeps the process clean and fast by focusing only on the relevant elements—and nothing extra.


Even though this part of the code might seem small, it’s actually critical to the entire project. If the product links aren’t collected correctly here, then none of the later steps will work. So this function may be small, but it does a big job.


Data Persistence: Storing URLs in the Database

# Function to save URLs to the SQLite database
def save_urls_to_db(cursor, urls, category):
    """
    Save the scraped URLs to the SQLite database.
    
    Args:
        cursor: SQLite cursor for database operations
        urls (list): List of product URLs to save
        category (str): The category name (women, men, unisex)
    
    The function inserts each URL with its category into the product_urls table
    and commits the transaction. Exceptions are logged to the error log file.
    """
    try:
        for url in urls:
            cursor.execute("INSERT INTO product_urls (category, product_url) VALUES (?, ?)", (category, url))
        cursor.connection.commit()
        print(f"Saved {len(urls)} URLs for category: {category}")
    except Exception as e:
        logging.error(f"Error saving URLs for category {category}: {str(e)}")

Once we’ve collected the product links, the next step is to save them somewhere safe—and that’s exactly what the save_urls_to_db() function does. It stores each product URL in our SQLite database, along with its category.


Here’s how it works:

The function loops through the list of URLs, and for each one, it adds a new row to the database. Along with the link itself, it also saves the fragrance category (like men's or women’s), so we always know where the link came from.


If something goes wrong during this process—like a connection issue or a duplicate entry—the function catches the error and writes it to the log. This way, the script doesn’t crash, and we can look back later to see what happened.


Using a database has some clear advantages over simply saving the links to a text file or keeping them in memory. For example:

  • It organizes the data in a way that’s easy to search, filter, or sort.

  • It keeps the data safe—even if the script stops halfway, nothing is lost.

  • It lets us build smarter features later, like skipping links we’ve already saved so we don’t waste time re-scraping.


All in all, this function helps us stay organized, efficient, and ready for the next stage in our project.


Orchestration: Managing the Scraping Process

# Function to handle the scraping of one category
async def scrape_category(category, category_url, cursor):
    """
    Handle the complete scraping process for a single category.
    
    Args:
        category (str): The category name (women, men, unisex)
        category_url (str): The URL for the category's product listing page
        cursor: SQLite cursor for database operations
    
    The function:
        1. Initializes a browser and page
        2. Scrapes all product URLs from the category
        3. Closes the browser when finished
    
    If browser initialization fails, the error is logged.
    """
    browser, page = await init_browser()
    if browser and page:
        print(f"Starting to scrape {category}...")
        await scrape_category_urls(page, category_url,category, cursor)
        await close_browser(browser)
    else:
        logging.error(f"Failed to initialize browser for {category}")

Now let’s look at one of the key functions that manages the actual scraping process—the scrape_category() function.


This function is designed to handle everything needed to scrape one fragrance category, from start to finish. Here’s what it does:

  • First, it starts the browser using Playwright, so we’re ready to navigate the website.

  • Then, it runs the scraping steps to collect product URLs from that specific category.

  • Once that’s done, it closes the browser properly to free up memory and resources.


The beauty of this function is how neat and reusable it is. If you only want to scrape, say, women’s perfumes or just one specific category, you can simply call this function without touching the rest of the code. It keeps things modular and avoids the need for major changes when switching between tasks.


So whether you’re scraping one section or planning to cover the entire site, this function keeps your workflow clean and flexible.

# Main function to manage all categories
async def main():
    """Orchestrate the entire scraping process for all fragrance categories.
    
    The function:
        1. Sets up the database connection
        2. Defines the category URLs to scrape
        3. Iterates through each category and scrapes its product URLs
        4. Closes the database connection when finished
    
    Category URLs are constructed with filters for:
        - women's fragrances
        - men's fragrances
        - unisex fragrances
    
    """
    conn, cursor = setup_db()
    CATEGORY_URLS = {
        "women": "https://www.noon.com/uae-en/beauty/fragrance/?f[is_fbn][]=1&f[fragrance_department][]=women&sort[by]=popularity&sort[dir]=desc&limit=50&page=1&isCarouselView=false",
        "men": "https://www.noon.com/uae-en/beauty/fragrance/?f[is_fbn][]=1&f[fragrance_department][]=men&sort[by]=popularity&sort[dir]=desc&limit=50&page=1&isCarouselView=false",
        "unisex": "https://www.noon.com/uae-en/beauty/fragrance/?f[is_fbn][]=1&f[fragrance_department][]=unisex&sort[by]=popularity&sort[dir]=desc&limit=50&page=1&isCarouselView=false"
    }

    # Loop through categories and scrape URLs
    for category, category_url in CATEGORY_URLS.items():
        await scrape_category(category, category_url, cursor)

    # Close the SQLite connection
    conn.close()

The main() function is the final piece of the puzzle—it’s what actually starts the whole scraping process.


Here’s what it takes care of:

  • It first connects to the database, so we’re ready to store the links we collect.

  • Then, it defines the fragrance categories we want to scrape. Each category (like men’s or women’s) is paired with its corresponding URL. This setup makes it super easy to update later—just add a new entry to the list if Noon adds a new category.

  • It then loops through each category and calls the scraping function to collect product links.

  • Once everything is done, it closes the database to make sure all data is saved properly and resources are released.


This function keeps things tidy and gives us a clear starting point for the entire script. And because we’re using asynchronous programming, there’s room for improvement too. In the future, you could easily modify the script to scrape multiple categories at the same time—making the whole process even faster and more efficient.


Putting It All Together: The Script Execution

if __name__ == "__main__":
    # Entry point: Run the main async function
    asyncio.run(main())

This line ensures that the main() function is executed only when the variables are executed directly, rather than imported from a different file. It tells the script to initiate the asynchronous scraping process by calling asyncio.run().


Using asyncio also allows the script to accomplish other tasks while waiting on pages to load or waiting on network responses to return, thus making the scraping process much more efficient.


Deep Data Extraction


In the second section of our web scraping project, we are going to go a step further than simply collecting the product URLs.


Now it’s time to extract detailed product information from each of the product URLs we previously collected. While the first script was all about finding and saving links, this one will actually visit each product page and collect some useful data, such as:Product title, Pricing, Ratings, Descriptions and Specifications.


This step in the web scraping operation is usually known as the "data enrichment" part of the project. We’re further transforming a set of basic URLs into a dataset with rich, structured data that can be analyzed for price comparisons, trends, or market analysis and everything in between.


Let's dive into how the code works to achieve this outcome.


Foundational Setup: Imports and Logging

import asyncio
import sqlite3
import logging
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import random
import json

logging.basicConfig(filename="scraping.log",level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

At the very start of our script, as usual we import all the Python libraries we need to make everything work smoothly.


We bring in:

  • asyncio – to handle asynchronous tasks, allowing our script to do multiple things at once.

  • sqlite3 – to manage our local database where all the product URLs will be stored.

  • playwright and BeautifulSoup – these are the core tools we use to load webpages and extract the information we care about.

  • random – to help us rotate user agents, which makes our scraper look more like a real person browsing the site and reduces the chance of being blocked.


But one of the most important additions to our script is logging.


We’ve set up logging so that every message it prints includes a timestamp. This helps us track the script’s progress, figure out where something might go wrong, and even see how long each part of the process takes.


When you’re scraping a lot of data, like we did here, having logging is essential. It’s like having a behind-the-scenes logbook that tells you exactly what happened, when it happened, and what might need fixing.


With all of this in place, our script isn’t just functional—it’s stable, reliable, and ready for real-world use. Whether you’re working on a side project or building something for production, these small details make a big difference.


User Agent Rotation: Avoiding Detection

def load_user_agents(path="user_agents.txt"):
    """
    Load a list of user agents from a text file.
    
    Args:
        path (str): Path to the text file containing user agents, one per line
        
    Returns:
        list: A list of user agent strings
        
    The user agents will be used to randomize the browser's identity for each request,
    helping to avoid detection and blocking by the website's anti-scraping measures.
    """
    with open(path, "r") as f:
        return [line.strip() for line in f if line.strip()]
def get_random_user_agent(user_agents):
    """
    Select a random user agent from the provided list.
    
    Args:
        user_agents (list): List of user agent strings
        
    Returns:
        str: A randomly selected user agent string
    """
    return random.choice(user_agents)

One valuable improvement in this script is the use of user agent rotation. This small but powerful change helps the scraper behave more like a real human browsing the web—rather than a robot making repeated requests.


Here’s how it works: every time the script loads a new page, it randomly selects a user agent from a list. A user agent is basically a short message your browser sends to a website saying, “Hi, I’m Chrome on Windows” or “I’m Safari on an iPhone.” Without rotation, the scraper would always look like the same browser and device, which makes it easy for websites to notice unusual patterns.


To make rotation happen, the script uses two simple functions:

  • One that reads user agents from a file (like "user_agents.txt")

  • And another that picks one at random whenever a new request is made


This approach helps the scraper stay under the radar. Many websites track requests from the same user agent, and if they see too many too quickly, they might block or slow down access. By changing how the scraper “introduces itself” on each page, we make it seem like different people are visiting the site—which is much closer to normal traffic.


Database Management: Tracking Progress and Storing Results

def connect_db():
    """
    Connect to the SQLite database containing product URLs.
    
    Returns:
        Connection: SQLite database connection object
    """
    return sqlite3.connect("product_urls.db")

def ensure_scraped_column(cursor):
    """
    Ensure the product_urls_old table has a 'scraped' column to track progress.
    
    Args:
        cursor: SQLite cursor for database operations
        
    The function checks if the 'scraped' column exists and adds it if not.
    The column stores a boolean flag (0/1) indicating whether a URL has been processed.
    """
    cursor.execute("PRAGMA table_info(product_urls_old)")
    columns = [col[1] for col in cursor.fetchall()]
    if "scraped" not in columns:
        cursor.execute("ALTER TABLE product_urls_old ADD COLUMN scraped INTEGER DEFAULT 0")

def fetch_unscraped_urls(cursor):
    """
    Fetch all product URLs that have not been scraped yet.
    
    Args:
        cursor: SQLite cursor for database operations
        
    Returns:
        list: A list of tuples containing (url, category) for unscraped products
    """
    cursor.execute("SELECT url, category FROM product_urls_old WHERE scraped = 0")
    return cursor.fetchall()

def save_product_data(cursor, data):
    """
    Save extracted product data to the database.
    
    Args:
        cursor: SQLite cursor for database operations
        data (tuple): Product data tuple containing:
            - url: Product URL (primary key)
            - category: Product category (women, men, unisex)
            - name: Product name
            - rating: Product rating score
            - total_rating: Number of ratings/reviews
            - sale_price: Current sale price
            - price: Original price
            - discount: Discount percentage
            - description: Product description
            - specifications: JSON string of product specifications
    
    The function creates the product_data table if it doesn't exist and
    inserts or updates the product data using the URL as the primary key.
    """
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS product_data (
            url TEXT PRIMARY KEY,
            category TEXT,
            name TEXT,
            rating TEXT,
            total_rating TEXT,
            sale_price TEXT,
            price TEXT,
            discount TEXT, 
            description TEXT,
            specifications TEXT
        )
    ''')
    cursor.execute('''
        INSERT OR REPLACE INTO product_data (url, category, name, rating, total_rating, sale_price, price, discount, description, specifications)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    ''', data)

def mark_url_as_scraped(cursor, url):
    """
    Mark a URL as having been scraped in the database.
    
    Args:
        cursor: SQLite cursor for database operations
        url (str): The URL to mark as scraped
    
    This prevents re-scraping the same URLs if the script is run multiple times.
    """
    cursor.execute("UPDATE product_urls_old SET scraped = 1 WHERE url = ?", (url,))

This script takes a much smarter and more thoughtful approach when it comes to handling the database. It’s no longer just collecting links and hoping for the best—it actually keeps track of what’s been scraped and what still needs to be done.


To make this happen, the script introduces several helpful functions:

  • connect_db() sets up a connection to our database.

  • ensure_scraped_column() checks that we have a way to mark whether a URL has been scraped or not.

  • fetch_unscraped_urls() pulls out only the links we haven’t visited yet.

  • save_product_data() saves the detailed product information we scrape.

  • And mark_url_as_scraped() updates the database to show that we’ve already collected data from a particular link.


These functions work together to build a scraping process that’s smart and efficient. If something goes wrong—like your internet cuts out or the script crashes—it doesn’t mean starting over. The script can simply resume from where it left off, saving time and preventing duplicate work.


The database itself has also been improved. There’s now a new table called product_data, where we store everything we extract from each product page. This includes basic details like the product URL and category, as well as more specific info like:Product nam, Price and discount, Customer rating, Description and Technical specifications


Some of the more detailed info—like product specs—is stored in JSON format, which keeps the data tidy and flexible. Since different products can have different types of specs, this format makes it easier to handle that variety.


To avoid duplicates or errors, the script uses a special SQL command: INSERT OR REPLACE. This means if we happen to scrape a product again, it simply updates the existing data instead of creating a messy duplicate. The result? A clean, reliable dataset that’s easy to maintain and ready for deeper analysis later.


This kind of setup makes the script feel more like a serious data pipeline—something that’s robust enough for large scraping projects and smart enough to handle real-world interruptions without falling apart.


Parsing Functions: Extracting Structured Data

def parse_name(soup):
    """
    Extract the product name from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The product name if found, None otherwise
    
    Uses a CSS selector to locate the product name heading on the page.
    """
    tag = soup.select_one("#catalog-page-container > div > div.ProductDetailsDesktop_fullWrapper__DGQAo.ProductDetailsDesktop_noGap__qQjap > div:nth-child(2) > div > div.ProductDetailsDesktop_primaryDetails__6r9u9 > div.ProductDetailsDesktop_coreCtr__ZVN_b > div > h1")  
    return tag.get_text(strip=True) if tag else None

def parse_rating(soup):
    """
    Extract the product rating score from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The product rating (e.g., "4.5") if found, None otherwise
    """
    tag = soup.select_one("#catalog-page-container > div > div.ProductDetailsDesktop_fullWrapper__DGQAo.ProductDetailsDesktop_noGap__qQjap > div:nth-child(2) > div > div.ProductDetailsDesktop_primaryDetails__6r9u9 > div.ProductDetailsDesktop_coreCtr__ZVN_b > div > div.CoreDetails_offerContainer__BNBZp > div.CoreDetails_offerItemCtr__ONued > div.CoreDetails_ratingsAndVariantsCtr__rLNx8 > a > div > div.RatingPreviewStar_starsCtr__cXGit > span")
    return tag.get_text(strip=True) if tag else None

def parse_no_rating(soup):
    """
    Extract the total number of ratings/reviews from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The number of ratings (e.g., "128") if found, None otherwise
    """
    tag = soup.select_one("#catalog-page-container > div > div.ProductDetailsDesktop_fullWrapper__DGQAo.ProductDetailsDesktop_noGap__qQjap > div:nth-child(2) > div > div.ProductDetailsDesktop_primaryDetails__6r9u9 > div.ProductDetailsDesktop_coreCtr__ZVN_b > div > div.CoreDetails_offerContainer__BNBZp > div.CoreDetails_offerItemCtr__ONued > div.CoreDetails_ratingsAndVariantsCtr__rLNx8 > a > div > div.RatingPreviewStar_ratingsCountCtr__VHpPi > div > span")
    return tag.get_text(strip=True) if tag else None

def parse_saleprice(soup):
    """
    Extract the current sale price from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The sale price (e.g., "AED 329.00") if found, None otherwise
    """
    tag = soup.select_one("#catalog-page-container > div > div.ProductDetailsDesktop_fullWrapper__DGQAo.ProductDetailsDesktop_noGap__qQjap > div:nth-child(2) > div > div.ProductDetailsDesktop_primaryDetails__6r9u9 > div.ProductDetailsDesktop_coreCtr__ZVN_b > div > div.CoreDetails_offerContainer__BNBZp > div.CoreDetails_offerItemCtr__ONued > div.CoreDetails_priceCtr__ZfUY9 > div > div > div > div > span.PriceOffer_priceNowText__08sYH")
    return tag.get_text(strip=True) if tag else None

def parse_price(soup):
    """
    Extract the original price from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The original price (e.g., "AED 455.00") if found, None otherwise
    """
    tag = soup.select_one("#catalog-page-container > div > div.ProductDetailsDesktop_fullWrapper__DGQAo.ProductDetailsDesktop_noGap__qQjap > div:nth-child(2) > div > div.ProductDetailsDesktop_primaryDetails__6r9u9 > div.ProductDetailsDesktop_coreCtr__ZVN_b > div > div.CoreDetails_offerContainer__BNBZp > div.CoreDetails_offerItemCtr__ONued > div.CoreDetails_priceCtr__ZfUY9 > div > div > div.PriceOffer_oldAndNewPricesCtr__yhHvc > div.PriceOffer_priceWasCtr__qwKoN > div > span:nth-child(2)")
    return tag.get_text(strip=True) if tag else None

def parse_discount(soup):
    """
    Extract the discount percentage from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The discount percentage (e.g., "28% OFF") if found, None otherwise
    """
    tag = soup.select_one("#catalog-page-container > div > div.ProductDetailsDesktop_fullWrapper__DGQAo.ProductDetailsDesktop_noGap__qQjap > div:nth-child(2) > div > div.ProductDetailsDesktop_primaryDetails__6r9u9 > div.ProductDetailsDesktop_coreCtr__ZVN_b > div > div.CoreDetails_offerContainer__BNBZp > div.CoreDetails_offerItemCtr__ONued > div.CoreDetails_priceCtr__ZfUY9 > div > div > div.PriceOffer_savingPriceCtr__DRd7p > div.PriceOffer_priceSaving__ajbD4 > span")
    return tag.get_text(strip=True) if tag else None

def parse_description(soup):
    """
    Extract the product description from the page HTML.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str or None: The product description text if found, None otherwise
    """
    tag = soup.select_one("#catalog-page-container > div > div:nth-child(2) > div:nth-child(1) > section > div > div > div.OverviewTab_container__2ewCs > div > div.OverviewTab_overviewDescriptionCtr__d5ELj")  # Adjust selector as needed
    return tag.get_text(strip=True) if tag else None

One of the things that makes this script easy to understand and maintain is how it handles parsing—that is, pulling specific bits of information from each product page.


Instead of writing one long block of code to grab everything at once, the script is neatly divided into small, focused functions. Each one has a clear name and a single job.

Each of these functions uses a CSS selector to locate exactly the part of the page it needs. This setup is based on a solid understanding of how Noon’s website is structured, so the code can move quickly and accurately through each page.


All of these functions follow the same simple pattern:They search for the element, pull out the text if it exists, and if not, they just return None. This keeps things clean and predictable—and it avoids unnecessary errors if a certain detail isn’t available on the page.


This approach has two major benefits:

  1. It’s easy to update. If Noon changes the layout of their site, you only need to tweak the one function related to that change—without having to rewrite your whole script.

  2. It’s resilient. If a product page is missing some info, the scraper doesn’t crash. That field is just left empty, and the script keeps going, smoothly collecting the rest of the data.


In short, this modular design makes the scraper more reliable and much easier to maintain, especially when working with a large number of products or handling sites that occasionally change their structure.


def parse_product_specifications(soup):
    """
    Extract product specifications from the page HTML and format as JSON.
    
    Args:
        soup (BeautifulSoup): BeautifulSoup object containing parsed HTML
        
    Returns:
        str: A JSON string containing the product specifications as key-value pairs
    
    The function:
        1. Locates the specifications table on the product page
        2. Extracts each specification row (key-value pair)
        3. Builds a dictionary of specifications
        4. Converts the dictionary to a JSON string with indentation
        
    If no specifications are found or an error occurs, an empty JSON object is returned.
    All exceptions are logged for troubleshooting.
    """
    try:
        specifications_section = soup.select(
            "#catalog-page-container > div > div:nth-child(2) > div:nth-child(1) > section > div > div > div.SpecificationsTab_container__uBaMs > div > div > table > tbody > tr"
        )

        if not specifications_section:
            logging.warning("No product specifications found in the given HTML.")
            return json.dumps({})

        specifications = {}

        for tr in specifications_section:
            try:
                tds = tr.find_all("td")
                if len(tds) >= 2:
                    key = tds[0].get_text(strip=True)
                    value = tds[1].get_text(strip=True)
                    specifications[key] = value
                else:
                    logging.warning("Missing <td> elements in <tr> while parsing specifications.")
            except Exception as e:
                logging.error(f"Error processing a <tr> in specifications: {e}")
                continue

        return json.dumps(specifications, indent=4)

    except Exception as e:
        logging.error(f"Error parsing product specifications: {e}")
        return json.dumps({})

The parse_product_specifications() function is a little different from the other parsing functions in the script—and that’s because it deals with more complex, detailed data.


Instead of pulling out just one piece of text, this function goes through a full table of product specifications. These are the technical details about a product—like size, material, scent family, and more. Each row in the table usually has a feature name (like “Brand” or “Weight”) and a value that goes with it.


The function loops through each row and builds a dictionary, where each feature becomes a key, and its value is stored alongside it. This creates a structured collection of all the technical details for that product.


Once that dictionary is complete, it’s turned into a JSON string—a flexible format that works really well for storing this kind of semi-structured data. The script then saves this JSON string in the database.


This method has two big advantages:

  • It preserves the structure of the information, so we can easily analyze or display it later.

  • It’s tolerant to missing or messy data. If one part of the specs table has an issue, the rest of the function still works just fine—the scraper won’t crash.


Because different products often have different sets of specifications, this flexible approach is a perfect fit. It handles all that variation smoothly and ensures we don’t lose valuable data, even when things aren’t perfectly formatted.


Core Scraping Logic: Browser Automation and Extraction

async def scrape_product_page(url, category, user_agents):
    """
    Scrape a single product page to extract all product data.
    
    Args:
        url (str): URL of the product page to scrape
        category (str): Category of the product (women, men, unisex)
        user_agents (list): List of user agent strings to choose from
        
    Returns:
        tuple or None: A tuple containing all extracted product data if successful,
                      None if an error occurs
                      
    The function:
        1. Selects a random user agent for the browser
        2. Launches a browser with the chosen user agent
        3. Navigates to the product URL
        4. Extracts all product information using the parsing functions
        5. Returns the compiled data as a tuple
        
    If any error occurs during scraping, it is logged and None is returned.
    """
    try:
        ua = get_random_user_agent(user_agents)
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context(user_agent=ua)
            page = await context.new_page()
            await page.goto(url, timeout=60000)
            await page.wait_for_timeout(3000)

            content = await page.content()
            soup = BeautifulSoup(content, "html.parser")

            name = parse_name(soup)
            rating = parse_rating(soup)
            total_rating = parse_no_rating(soup)
            sale_price=parse_saleprice(soup)
            price = parse_price(soup)
            discount=parse_discount(soup)
            description = parse_description(soup)
            specifications=parse_product_specifications(soup)

            await browser.close()
            return (url, category, name, rating, total_rating, sale_price, price, discount, description, specifications)

    except Exception as e:
        logging.error(f"Failed scraping URL: {url} - {str(e)}")
        return None

The scrape_product_page() function is really the core of the entire scraping process—it’s where everything comes together.


The function begins by launching a browser using Playwright. Before visiting the page, it picks a random user agent from our list. This small trick helps the scraper act more like a real user and reduces the chances of being blocked by the website.


Once the browser is up, it navigates to the product’s URL and waits for the page to fully load. This includes all the content that’s generated by JavaScript—something many modern websites, like Noon, rely on heavily. Waiting ensures we don’t miss any important details.


After the page is ready, the HTML is captured and sent to BeautifulSoup, which parses it and passes it through the parsing functions we created earlier. These functions pull out all the key product information: name, price, ratings, description, and technical specifications.


When all the data is collected, the browser is closed properly to free up system memory—especially important when scraping many products in a row.


This function is powerful because it combines the browser automation of Playwright (for handling dynamic content) with the simplicity of BeautifulSoup (for extracting clean data). On top of that, it includes error handling, so even if one product fails to load or parse correctly, the script keeps going smoothly.


Orchestration: The Main Function and Program Flow

async def main():
    """
    Main function to orchestrate the product data scraping process.
    
    The function:
        1. Loads user agents for browser fingerprint randomization
        2. Connects to the database and ensures required columns exist
        3. Fetches all unscraped product URLs
        4. Scrapes data from each product URL
        5. Saves successful results to the database
        6. Marks URLs as scraped to avoid reprocessing
        
    Progress is logged throughout the process, with success and error messages
    to aid in monitoring and troubleshooting.
    """
    user_agents = load_user_agents()
    conn = connect_db()
    cursor = conn.cursor()
    ensure_scraped_column(cursor)
    conn.commit()

    urls = fetch_unscraped_urls(cursor)
    logging.info(f"Total unscraped URLs: {len(urls)}")

    for url, category in urls:
        logging.info(f"Scraping: {url}")
        product_data = await scrape_product_page(url, category, user_agents)
        if product_data:
            save_product_data(cursor, product_data)
            mark_url_as_scraped(cursor, url)
            conn.commit()
            logging.info(f"✅ Saved data for: {url}")
        else:
            logging.warning(f"⚠️ Skipped URL due to error: {url}")

    conn.close()
    logging.info("🎉 Done scraping all products.")

if __name__ == "__main__":
    # Entry point: Run the main async function
    asyncio.run(main())

The main() function is the central hub of the entire scraping process. It’s where everything gets started and tied together.


It begins by loading the list of user agents and connecting to the database. This setup step makes sure the database is ready with the right tables and structure to store all the product details we’re about to collect. Then, it pulls a list of product URLs that haven’t been scraped yet, so we only work on fresh data. This avoids wasting time on duplicates or re-scraping old products.


From there, the function goes through each URL one by one. For each product page, it:

  1. Calls the scrape_product_page() function to collect the data

  2. Saves the product info to the database

  3. Marks that URL as scraped, so it won’t be touched again next time


This step-by-step process may sound simple, but it’s very intentional. By processing the URLs one at a time, the script avoids overloading the website with too many requests. It’s a respectful and responsible approach to scraping.


Throughout the run, the function uses logging messages to show what’s happening in real time. These logs include emojis—like ✅ checkmarks for successful scrapes and ⚠️ warnings when something goes wrong. This makes the updates more readable and even a little fun, which is especially nice when you're scraping hundreds of pages and want to keep track without reading dry messages.


Conclusion


Together, these two scripts form a smart and efficient scraping system—one script collects the product links, and the other dives into each link to gather detailed product information.


What makes this setup strong isn’t just the tools it uses, but the way it’s built. By rotating user agents, the script avoids getting flagged or blocked by the website. It’s also designed with error handling, so if one product page fails, the rest of the process keeps running without a hitch. All the data gets stored neatly in a database, and the script keeps track of what’s already been scraped—saving time and avoiding duplicate work.


Another great aspect is how it’s designed to respect the website. It doesn’t overload the servers by making too many requests at once, which makes it more sustainable for long-term use.


In the end, this system collects rich, well-structured data that can be used for all kinds of insights—like analyzing market trends, tracking pricing strategies, or simply exploring what kinds of fragrance products are being offered. Whether you’re a beginner exploring web scraping or someone working on a more advanced project, this setup gives you a strong foundation for collecting and working with real-world data.


FAQ SECTION


1) Is it legal to scrape data from Noon?

Web scraping legality depends on how the data is collected and used. Publicly available product information can generally be collected for research or competitive analysis, but you should always review the website’s terms of service and robots.txt file. Avoid scraping personal data and follow ethical scraping practices.


2) Why use Playwright instead of BeautifulSoup alone?

BeautifulSoup is great for parsing static HTML. However, Noon loads product listings dynamically using JavaScript. Playwright automates a real browser, allowing the page to fully render before extracting data, making it ideal for dynamic ecommerce websites.


3) What data can be extracted from Noon’s fragrance store?

You can extract product names, brand names, prices, ratings, number of reviews, product descriptions, availability status, and category information. This data can be used for price tracking, brand analysis, and market research.


4) Why store scraped URLs in a SQLite database?

Using SQLite helps organize and preserve scraped data efficiently. It prevents data loss, supports filtering by category, and makes it easier to scale the scraping workflow for future analysis without re-scraping the same pages.


5) How can scraped fragrance data be used for business insights?

Fragrance data can reveal pricing trends, brand popularity, discount patterns, rating distribution, and assortment gaps. Businesses can use these insights for competitive intelligence, pricing optimization, and product strategy decisions.


AUTHOR


I’m Shahana, a Data Engineer at Datahut, where I design and develop scalable data pipelines that transform messy, complex web content into clean, structured datasets—especially for e-commerce, pricing intelligence, and product tracking.


In this blog, I shared a step-by-step walkthrough of a real-world scraping project where we collected detailed product data from Noon’s fragrance category. Using tools like Playwright, BeautifulSoup, and SQLite, we built a robust solution that handles dynamic content, rotates user agents to avoid detection, and stores the data in a clean, organized format for analysis.


At Datahut, we focus on building practical, responsible web scraping systems that are ready for real-world challenges—whether it’s managing large-scale data collection or ensuring the scraper can recover gracefully from interruptions.


If your team is looking to automate product data collection in the eyewear space or beyond, reach out to us through the chat widget on the right. We’d love to help you build a solution that fits your goals.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page