How to Scrape Boutiqaat for Actionable Perfume Market Insights?

Q: Which tools or technologies are best for scraping Boutiqaat?

Python libraries like BeautifulSoup, Scrapy, or Selenium work well. For scalable or automated scraping, you may consider using Playwright, Puppeteer, or a no-code tool like n8n.

Anusha P O
Jul 16
38 min read

Updated: Nov 18

How to Scrape Boutiqaat for Actionable Perfume Market Insights?

Have you ever thought about how companies manage to track thousands of products, prices, and even trends in multiple online stores all at once? That's where web scraping comes in: the automated way of collecting data from websites, comparable to a personal assistant that works around the clock. Web scraping helps business aggregators retrieve product information such as details, prices, and reviews much faster than the time it would take doing it manually.

Boutiqaat is one of the many companies transforming the beauty and lifestyle market in the Middle East. With Arabic Fragrances, Niche Perfumes, and Scents from almost every corner of the world, the list of products seems endless. However, using automation for intelligence solves consumers' pervasive problem of being unable to sort and analyze information.

For this project, we utilized up-to-date technologies to scrape data from the website Boutiqaat, focusing on three categories of interest: Arabic Fragrances, International Fragrances, and Niche Fragrances.

Automatically collecting perfume listings and storing them in a tidy, logical format boosts customers’ and businesses’ awareness. Customers can browse through the myriad of options available, and retailers can utilize accurate information to make better business decisions.

Whether you're a curious techie, a data analyst, or a brand strategist, this guide breaks down how we built a reliable, efficient scraping system to collect perfume data from Boutiqaat—step by step, and in a way that’s easy to understand.

About Boutiqaat

Boutiqaat is not simply an eCommerce platform—it is a breathtaking beauty and lifestyle destination which has transformed shopping behavior for the Middle Eastern consumers. Initially started in Kuwait, Boutiqaat integrates the luxury goods of regional traders with the tremendous influence of celebrities and beauty influencers from the region. Ahead of its competitors, Boutiqaat incorporates an extensive selection of makeup, skincare, and fragrances that appeal both regionally and from international labels. Unlike ordinary online shops, Boutiqaat accounts for trust and authenticity, which transforms shopping into an experience unlike any other. In this project, we aimed at obtaining product data from three primary categories of fragrances within the women’s section: Arabic Fragrances, Niche Perfumes, and an all-encompassing Fragrances category. This data has great significance in terms of consumer and business insight. Customers receive a better understanding of the available options, trends, and pricing while businesses are able to re-calibrate their offerings and marketing approaches based on the data. By shelving the data, we this way remove the gap between the offered supply and the actual market demand.

Automated and Intelligent Data Collection

We developed Boutiqaat’s data in two phases:collecting product page links and preparing them for further analysis. We focused on three major fragrance categories from the women’s section—Arabic Fragrances, International Fragrances, and Niche Perfumes.

Step 1: Gathering Product Links

We began with making an outline of all the product links available under each subheading for the fragrance categories. Like other modern shopping websites, Boutiqaat appears to have an endless scrolling feature or auto-loads products. This means that not all products on the page are listed at once. This requires a more specialized approach. Fortunately, we developed an intelligent scraper with Playwright which enables us to control web browsers and scroll down the page as a human would. No advanced masking techniques were employed; it was all about timing with Playwright.

Our agents have been programmed to slowly scroll down the page, wait for new items to load, and harvest all relevant product links. We also taught it to deal with annoying pop-up screens that obscure vision—our agent checks for these and closes them if they come into view. This makes the scraping process more streamlined and efficient.

We saved each product link and the category it belongs to in a SQLite database rather than dumping everything into a single file. This facilitates organization and makes it simple to retrieve data at a later time. To keep things tidy and effective, we even made sure the scraper doesn't save duplicate links.

We were prepared for the following stage of our data journey by the time this phase ended, having gathered a substantial collection of clear, functional product URLs from each of the three fragrance sections.

Step 2: Data Extraction from Product Links

When the product URLS are stored in the SQLite database, the detailed information on each product is yet to be scraped. Pages of each product need to be visited individually to extract particular information about them. Current browsers available like Playwright help mitigate issues around being noticed during scraping as they portray themselves as humans. This ensures that the pages are properly rendered before gathering information.

Tools, for instance, gather the following product information; the item name, its current and budget price, reviews count, description, brand name, discount offered, specification(s) such as size and SKU, availability for purchase, etc. The above information is stored in a mongodb database. This facilitates easier access and subsequent analysis of the data.

With script reliability improvements, every product is accompanied with a JSONL file containing the information in a stripped manner. Freezing, crashing, or halting the scraping procedures mid-way does not result in lost data.

In conclusion, numerous approaches all in all aid facilitate the effective collecting, documenting, and storage of richly detailed product data.

Data Cleaning

The URLs of the products have already been added to the SQLite database and now they need the information from the specific websites for each product. The Playwright script has been implemented because it mimics a user behavior for traversing the website and helps in avoiding detection ensuring all the pages are fully loaded. The tool collects critical information for each individual product which consists of the product’s name, brand, current price, old price, discount, review count, description, characteristics like size or SKU, stock availability, and others. All of this is now stored in a MongoDB database which makes it easier to handle and analyze data later. In order to make the data more reliable, the script also creates a local product backup in a JSONL file for every product. When the computer crashes without notification or the scraping process halts mid-way, this approach guarantees data redundancy. Clearly, the data collection and storage described above helps in reliable storage during scrapping and capturing extensive product datasets.

Powerful Tools and Libraries for Smarter Data Extraction

The data extraction procedure applies an ingenious combo of Python libraries and tools, all of which serve different purposes— from gathering the information and automating the browsers to storing them and dealing with errors. We will try to explain the above as clearly and simply as possible without losing professionalism.

sqlite3:When dealing with data, SQLite is considered a lite weight and file based database when you're running it within Python, as there is no additional setup or server required. It works as an elegantly organized digital notebook that keeps record of urls and tells you which links have been processed. And considering it saves all the information in a well structured manner, it prevents you from working over and over again on the same tasks especially when the web scrapes thousands of pages.

logging:If there are people who prefer the quiet Kyrgyz Republic, the logging module can be considered its politest citizen, the first being on the scene but staying quiet and neutral. It assists the script silently and does each and every task including scraping a link for data, saving data, hitting an error and even tracebacking an already stored error. Instead of using print() statements, hassle free logs are preserved in the project directory which allows clear trace back of what caused the error, and is quite valuable in case of overnight/scalable scraping sessions.

playwright.sync_api:Playwright serves as the motor driving automation within the Web. It performs actions like page openings, scrolling, clicking buttons and many more just like a human does. Modern websites that use JavaScript for rendering content demand such capabilities. In your case, the version that is easier to follow (synchronous called sync_playwright) is employed, so that the code remains straight forward by executing one operation at a time.

time: As mentioned Before, the time module offers basic functionalities, but is an integral part of abiding by scraping rules. While human emotions are expected, using time.sleep() to pause is a natural action among humans. This step further decreases the chances of getting flagged as a bot and creates the impression of actual browsing behavior.

BeautifulSoup: BeautifulSoup is a library specifically designed for parsing sections of HTML to get the data wanted. Post a playwright’s performance in loading a webpage, BeautifulSoup is able to access the content and systematically pull out clear and structured text like product names, amounts, or descriptions. It simplifies the task of navigating complex webpage layouts.

json:This module deals with the saving of data in a standard, small structure that is easily shared, backed up, or understood. In this case, it serves for the creation of .jsonl (JSON Lines) files where every single product is stored in its own line. This is ideal for backup and debugging, or simply sharing data with others.

urllib.parse: URLs are sometimes overly long, messy, or random. The module urllib.parse takes care of cleaning duplicate URLs using a simplified method so that the same link is not stored twice. It organizes and rebuilds the URLs in a specific format.

Pymongo: Now your scraper can be linked into MongoDB, a flexible NoSQL database, with pymongo. Unlike SQLite, which organizes information in rows and tables, MongoDB employs a document-information structure resembling the JSON format. This is suitable for product data since not all items contain the same fields.

With all these tools combined, they make a very powerful advanced web scraping infrastructure for professional-grade responsible data extraction, cleaning, storing, and backing up. Each library has a unique purpose and together they reinforce the entire system and its flexibility.

STEP 1: Scraping Product URLs from Women's Perfumes on Boutiqaat

Importing Libraries

import sqlite3

import logging

from playwright.sync_api import sync_playwright

import time

The scraping adventure would commence from the women's section of Boutiqaat through combining a set of libraries that possessed distinct features geared toward the process’s efficiency, reliability, and scalability. First, its core, sqlite3 has a lightweight file based database system. It has mechanisms for tracking and storing product URLs, which from further manual processes would enable resuming from where scraping had left off. This eliminates repetitive tasks which is beneficial for larger projects. Also the logging module is very important for purposefully organizing records or logs of wheels—record accomplishments warnings successes and or errors—which simplifies debugging and progress tracking in long scraping sessions.Finally, control of a browser in real-time is executed using playwright.sync_api from Playwright, permitting multi-step interactions with JavaScript laden websites such as Boutiqaat which do not render fully with static HTML. This is also very important when scraping modern e-commerce platforms. Together with adding simulation of natural browsing behavior to reduce chances of being flagged, the time module built into Python contributes through the addition of pauses in scripts where needed. All these stated libraries provide a cohesive environment with reduced risk and increased efficiency when extracting data from dynamic websites.

Keeping Track of Progress with Logging

# Setup Logging

logging.basicConfig(

   filename="boutiqaat_url_scraper.log",

   level=logging. INFO,

   format="%(asctime)s - %(levelname)s - %(message)s"

)

"""

Configures the basic logging system for the application.

Logging Setup:

-------------

The logging configuration creates a detailed record of the script's operation:

- All logs are written to the file 'boutiqaat_url_scraper.log'

- Only messages with INFO level priority or higher are recorded (INFO, WARNING, ERROR, CRITICAL)

- Each log entry includes:

 * Timestamp: When the event occurred

 * Level: How important the message is (INFO, WARNING, ERROR, etc.)

 * Message: Description of what happened

"""

To make the scraper efficient and problem traceable, logging is a significant part of this project. Python’s built-in logging module is used by the script to monitor the scraping procedure, logging detailed operational activities of the scraper. The log data is outputted to a file named boutiqaat_url_scraper.log which is created, if not available, automatically. The logging system is configured to log all messages files including but not limited to warnings, errors, and critical failures at or above the INFO level. Each log message is preceded with a timestamp marking and explicitly tagged with the event’s severity so that developers know when and where something happened and they can debug easily. This is particularly helpful when scraping hundreds of product pages because it gives insight into progress and knows exactly which pages failed or were skipped. Rather than having developers wonder where the scraper could have left off, the logs serve as a timeline of the script's progress—a virtual paper trail that helps debugging and monitoring become so much easier. In much the same way that databases monitor data, logging monitors the health and activity of your scraper.

Database and Website Configuration

# Database Configuration

DB_NAME = "/home/anusha/Desktop/DATAHUT/Boutiqaat/Data/boutiqaat_full_urls.db"

BASE_URL = "https://www.boutiqaat.com/en-kw/"

"""

Database and Website Configuration:

DB_NAME:

 The full path to the SQLite database file where scraped URLs will be stored.

 This database acts as persistent storage for all collected product links.



BASE_URL:

 The root domain of the Boutiqaat website. This is used to convert relative URLs

 to absolute URLs when necessary.

 """

The configuration section of the script sets the foundation for storing and organizing the data extracted from the Boutiqaat Women’s section. The DB_NAME variable points to the full file path of the SQLite database — a lightweight and file-based database system. In this case, the database is named boutiqaat_full_urls.db and is located in a folder named Data on the user's desktop. This database acts as a secure container for holding all the URLs of the women's perfume products scraped from the website.

Next, the BASE_URL is defined as "https://www.boutiqaat.com/en-kw/", which serves as the main address of the website. It plays an essential role in combining partial links (called relative URLs) from the site with this base address to form complete product URLs (absolute URLs). This setup ensures that even if the site only gives part of a link, the script can convert it into a full, usable web address. Altogether, these settings are crucial for organizing and guiding the scraping process smoothly from the very beginning.

Setting Up User-Agent Headers for Safe and Seamless Scraping

# User-Agent Headers

HEADERS = {

   "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"

}


"""

Browser Headers:

HEADERS:

 Contains the User-Agent string that identifies our web browser to the website.

 This helps the script appear as a regular web browser rather than an automated tool.

 The User-Agent mimics Chrome on Linux to ensure compatibility with the website.

 Proper headers help avoid being blocked by the website's security measures.

"""

When we scrape information from websites such as Boutiqaat's Women's page, we should ensure that our script does not appear suspicious or bot-like. That is where browser headers are useful — namely, the HEADERS dictionary in our script. This section contains a User-Agent, and this is simply a tiny bit of data that informs the website that we are using what type of browser.

In this instance, the User-Agent is mimicking Google Chrome on a Linux machine, something that lots of actual users utilize. Adding this is assisting our scraping script to fit in and act like a typical person surfing the website. Without this, the site may see that a bot is making a visit and block or limit access to the pages.

Building the Foundation: Creating a Database for Perfume Links

def create_db():

    """
    Initialize the SQLite database with a 'links' table.
    
    This function creates a new SQLite database (if it doesn't exist) with a table
    structured to store product URLs and their associated categories. It includes:
    - id: An auto-incrementing primary key
    - url: A unique text field to store product URLs
    - category: A text field to store the category of each product
    
    If the database or table already exists, this function will not modify them.
    
    Returns:
        None
    
    Raises:
        Exception: Logs any error that occurs during database creation
    """
    try:
        conn = sqlite3.connect(DB_NAME)
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS links (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT UNIQUE,
                category TEXT
            )
        """)
        conn.commit()
        conn.close()
        logging.info("Database initialized.")
    except Exception as e:
        logging.error(f"Database creation failed: {e}")

We should have a way to save links to the product page of each perfume safely before we start data collection on women's perfumes from the Boutiqaat website. All of this part of the script is for that purpose only. It creates a small but efficient database using SQLite, which is a Python library that allows you to create and use databases without having to download anything else.

The role played here is create_db(), and it's meant to specify the database structure. Inside it, we create a table named links, where each row will have one perfume product URL and its category (e.g., "Niche Perfumes" or "Arabic Fragrances"). Each record is assigned an automatically incremented id number, which makes each link distinct.

Even when this function gets executed more than once, it will not foul up any old data — it first checks to determine whether or not the table already exists, and only then creates it if it doesn't. This both makes the code reusable and makes it safe. If there happens to be a problem during this operation (such as if the database connection isn't working correctly), the script will record what's gone wrong through the use of the logging system, which makes it straightforward to determine where things went awry.

In brief, this section of code goes quietly about laying the groundwork for your scraping efforts — a spot to house and keep tabs on each perfume link prior to further data extraction.

Saving Product URLs Without Duplicates

# Save URL to database

def insert_link_to_db(link, category):



   """

   Insert a product URL and its category into the database.

  

   This function attempts to add a new URL to the database. The UNIQUE constraint

   on the 'url' field prevents duplicate entries. If a URL already exists in the

   database, it will be skipped and logged accordingly.

  

   Args:

       link (str): The complete product URL to be stored

       category (str): The category name associated with the product

   """

   

   try:

       conn = sqlite3.connect(DB_NAME)

       cursor = conn.cursor()

       cursor.execute("INSERT OR IGNORE INTO links (url, category) VALUES (?, ?)", (link, category))

       if cursor.rowcount == 0:

           logging.info(f"Duplicate URL skipped: {link}")

       else:

           logging.info(f"Inserted URL: {link}")

       conn.commit()

       conn.close()

       logging.info(f"Inserted URL: {link}")

   except Exception as e:

       logging.error(f"Error inserting link: {link} → {e}")

When web scraping women's perfume products on the Boutiqaat website, you should store each product URL in some sort of organized way so you'll be able to come back and scrape in-depth information later. This part of the code does that—it saves each perfume's URL along with its category ("Niche Perfume" or "Arabic Fragrance") into a database.

This function is declared here: insert_link_to_db(link, category). It's called to add perfume product URLs into a SQLite database, which is such a clean place to save all your links. Whenever the script finds a new product link, it calls this function to store it.

But you don't need to save the same link more than once. To avoid duplications, the database rule is "only unique URLs allowed." So if the script attempts to add an existing URL, it will silently skip it and log a message in the log file indicating that it was a duplicate. If the link is new, it's saved and a successful message is logged.

It also employs something referred to as logging to monitor what's going on—whether the link was saved successfully or skipped because it already existed. If something fails, such as a database connection error, it logs that as well, so it's easier to debug later.

In short, this function ensures that all of the perfume product links are saved cleanly, without repetition, and any issues encountered along the way are logged for examination. This is an important step because it sets the foundation for the second stage of your project—extracting the real perfume data from each of these saved links.

Smart Scrolling to Gather Perfume URLs from Boutiqaat

def fetch_page_content(url, category, max_scrolls=300):



     """

Visit a category page and extract all product URLs by scrolling through the page.



This function uses Playwright to automate a browser session that:

1. Navigates to the specified category URL.

2. Repeatedly scrolls down to trigger lazy loading of products.

3. Extracts product URLs from anchor tags and ensures they are unique.

4. Stores these unique URLs in the database.

5. Detects when no new content is loaded after scrolling and stops scrolling.



It also includes mechanisms to:

- Close any popup dialogs that might interfere with scraping.

- Track URLs that have already been processed to avoid duplicates.

- Stop scrolling if no new content is added after several attempts.



Args:

     url (str): The category page URL to scrape. This is the page that contains the product links.

     category (str): The category of products being scraped (e.g., 'Niche Perfumes', 'Arabic Fragrances').

     max_scrolls (int, optional): The maximum number of scroll operations to perform. Defaults to 300.

                                   This ensures that the scraping process doesn't run indefinitely if there are a large number of products.



Returns:

     None: This function doesn't return any values. It performs scraping and data insertion directly.



"""



   try:

       with sync_playwright() as p:

         """Launch a visible browser for debugging purposes

           Set to 'False' for debugging, change to 'True' for headless mode"""

           browser = p.chromium.launch(headless=False)

           context = browser. new_context(user_agent=HEADERS["User-Agent"]) 

           page = context. new_page()



           logging.info(f"Navigating to {url}")

           page.goto(url, timeout=60000)

           page.wait_for_load_state("networkidle")

           time.sleep(3)   

             """Navigate to the page with a 60-second timeout

               Wait until the page is fully loaded

               Allow some time for the page to stabilize"""



           try:

               if page.is_visible("button.close"):

                   page.click("button.close")

               elif page.is_visible(".popup-close"):

                   page.click(".popup-close")

           except Exception:

               pass 

           """Close popup if visible 

                Check if there's a close button for a popup

                Check for another possible popup close button 

                If no popup is found or couldn't be closed, continue"""



           seen_urls = set()

           same_scroll_count = 0

           last_height = 0



        """Set to track URLs we've already processed

         Counter to track if new content is loaded after a scroll

         Track the page's height to detect when it's not scrolling anymore"""



           for scroll in range(max_scrolls):

               page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

               time.sleep(3)



               current_height = page.evaluate("document.body.scrollHeight")



               anchors = page.query_selector_all("a[href*='/p/']")

               logging.info(f"Scroll #{scroll+1}: {len(anchors)} links found")



                  """ Scroll up to `max_scrolls` times

                       Scroll to the bottom

                       Wait for the page to load new content

                       Get the new page height

                       Find all product links by matching 'href' attribute

                       Log the number of links found"""





               for a in anchors:

                   try:

                       href = a.get_attribute("href")

                       if href and href.endswith("/p/") and "/product/" not in href:

                           full_url = BASE_URL + href if href.startswith("/") else href

                           if full_url not in seen_urls:

                               seen_urls.add(full_url)

                               insert_link_to_db(full_url, category)

                   except Exception as link_error:

                       logging.warning(f"Failed to process link: {link_error}")



                   """Loop through all found anchor tags (product links)

                 Get the href attribute (URL)

                     Filter out invalid links

                       Complete the URL if it's relative

                       Only process new URLs

                       Add to seen URLs

                       Insert the link into the database

                 Log any issues with processing links"""



               if current_height == last_height:

                   same_scroll_count += 1

                   if same_scroll_count >= 5:

                       logging.info("No more new content, scrolling ended.")

                       break

               else:

                   same_scroll_count = 0

                   last_height = current_height



           browser.close()



   except Exception as e:

       logging.error(f"Exception while scraping {url} → {e}")



                      """Log any errors that occur during scraping"""

This method, fetch_page_content, is designed to navigate to a perfume category page on the Boutiqaat site (e.g., "Niche Perfumes" or "Arabic Fragrances") and automatically extract the URLs for individual perfume items. Since all the items are not loaded simultaneously, the page must be scrolled several times—just as a human user would—to cause new items to become visible.

The primary intention is to scroll down and retrieve all the links of the products, recognize every product on the page, and store that link into a database where it could be analyzed subsequently.

Controlling the Browser with Playwright: The function employs a library named Playwright, which is a robot that opens and operates a browser (Chrome/Firefox) like a human. It can scroll down, press buttons, and wait for pages to load.

The browser is initiated in visible mode (headless=False) which is perfect for debugging — you actually get to see the script in action in real time as it clicks, scrolls, and fetches links. Only when satisfied everything runs perfectly fine, you can turn it "invisible" (headless) mode so you can execute it quicker and behind the scenes.

Dealing with Popups and Page Loading: A few websites display popups that may obstruct the screen (such as offers or cookie alerts). The script tries to detect and close such popups automatically. It waits for the page to load fully, giving a couple of seconds to settle before it starts scrolling and scraping data.

This prevents the scraper from being interrupted by unrelated distractions that could prevent it from accessing the perfume products.

Smart Scrolling to Load More Products: The product pages use a technique called lazy loading, so all items are not shown at once. Instead, when you scroll, more items are shown. The function mimics this behavior by:

Scrolling to the bottom page
Waiting a few seconds for more content to be loaded
Checking how far it scrolled by comparing the page height
Repeating this cycle anywhere from 300 times (or until nothing new loads)

If it realizes that no new content loads for five scrolls consecutively, it stops scrolling—because that likely indicates that everything has loaded.

Gathering and Saving Product Links: Whenever new products load, the function searches for certain links (those with "/p/" in their URL, which typically indicates it's a product page). It ensures:

The link is correct and not a duplicate
It hasn't been saved before
It gets converted to a full URL (if necessary)
It gets stored in a database to be analyzed later

A set named seen_urls is utilized to keep track of which links have already been gathered, so there's no redundancy.

Logging Every Event Which Happens: With every action, the aim logs itself. It logs if:

A page is navigated
Every scroll is finished and the number of links encountered
A link has been saved successfully or skipped successfully.
There has been an error.

If something goes wrong (like the page not loading or the structure not being the same), the script doesn't fail silently; instead, it logs the exact error so that it can be easily fixed later.

In Short

This tool is a complete automation process that:

Traverses a perfume category on Boutiqaat
Scrolls to load all items available
Detects and closes popups
Collects and stores all perfume product links
Keeps everything tidy and logged

It's built to run smoothly, prevent duplicates, and handle errors gracefully—making it a trustworthy tool for bulk data collection from an e-commerce website like Boutiqaat.

Starting the Scraper: Organizing Perfume Data Collection from Boutiqaat

# Main function

def main():

 """

Main entry point of the scraper.



This function performs the following tasks:

1. Calls `create_db()` to initialize or ensure the SQLite database is set up correctly.

2. Defines a dictionary of fragrance category URLs and their corresponding labels:

     - "Niche Fragrances"

     - "International Fragrances"

     - "Arabic Fragrances"

3. Iterates over each URL-category pair:

     - Logs the start of the scraping process for each category.

     - Calls `fetch_page_content(url, category)` to scrape and process product data from that category.



This setup helps organize and automate scraping for multiple product categories on the Boutiqaat website.

"""

   create_db()



   urls = {

       "https://www.boutiqaat.com/en-kw/women/fragrances/niche-perfumes-1/l/": "Niche Fragrances",

       "https://www.boutiqaat.com/en-kw/women/fragrances-1/c/": "International Fragrances",

       "https://www.boutiqaat.com/en-kw/women/arabic-fragrances-1/c/": "Arabic Fragrances"

   }



   for url, category in urls.items():

       logging.info(f"Starting category: {category}")

       fetch_page_content(url, category)

The main() function is such that it is the beginning or control center of your web scraping script. It's the point where the whole process of data collection starts and gets coordinated in a structured manner.

Step-by-Step Breakdown

1.Creating a Database

This function first calls create_db(). This guarantees that the perfume database you intend to create is prepared. A database is created if one doesn't already exist. If it is present, it guarantees that everything is set up properly. This keeps your data organized and accessible in the future.

2. Fragrance Categories & Their URLs

Then, a dictionary called urls is created. This dictionary contains the top categories of women's perfumes you wish to scrape from the Boutiqaat website. It contains:

Niche Fragrances (more high-end or luxury fragrance)
International Fragrances (common international brands)
Arabic Fragrances (perfumes mimicking typical Arabian fragrances)

3. Each category has a particular web address (URL) that takes one to the products under it.

4. Looping Through Each Category

Then, the method goes through every category individually. For each category of perfume:

A message is logged (stored) using the logging tool. Such a message is, for example, "Starting category: Arabic Fragrances," useful to follow progress or debug in case something goes wrong.
It then calls a function called fetch_page_content(url, category). This function is responsible for visiting the webpage of the category, loading all the products, and fetching the data.

Entry Point to Execute the Script

if name == "__main__":

   main()

 """

   Entry point when run as script:

   - Executes main scraping function

   - Handles any top-level errors

   """

Here, main() is the central function which initiates the whole process of scraping the perfumes from the Boutiqaat site. Hence, when you execute this script — e.g., by simply double-clicking the Python script or running it through the terminal — the following line ensures that the main() function initiates and everything goes into action: from fetching product links, opening perfume pages, scraping the information, and writing them out.

If this file were imported elsewhere (e.g. in some other program), Python would not automatically execute main(). This is helpful when your script may be reused in a larger project.

STEP 2: Extracting Complete Product Information from Each Link

Importing Libraries

import sqlite3

import logging

import json

from pymongo import MongoClient, errors

from bs4 import BeautifulSoup

from playwright.sync_api import sync_playwright

from urllib.parse import urlparse, urlunparse

This set of libraries collaborates to enable web scraping from Boutiqaat's women section effortless and trustworthy. The playwright.sync_api library is utilized to automate the browser in loading product pages just as a human being would—clicking, scrolling, and waiting for content to render—most important for those heavily dependent on JavaScript. BeautifulSoup parses the downloaded page to retrieve clean data, like names, prices, and descriptions of scents. sqlite3 keeps an in-memory local database with product URLs and whether they have been processed to avoid scraping the same product more than once. pymongo allows storing information in MongoDB, a light-weight NoSQL database well-suited to dealing with large and complex data sets. logging keeps a record of everything that's going on in the background, allowing developers to see where problems are and how far along the scraper has gotten. Lastly, urllib.parse cleans up and normalizes URLs before they're used. Overall, these are an easy-to-use yet powerful way to extract and clean product data.

Key Configuration Setup for Structured Data Collection

# Constants

BASE_URL = "https://www.boutiqaat.com" 

DB_PATH = "/home/anusha/Desktop/DATAHUT/Boutiqaat/boutiqaat_full_urls.db"  

BACKUP_FILE = "boutiqaat_data_backup_.jsonl" 

MONGO_URI = "mongodb://localhost:27017/"

MONGO_DB = "boutiqaat"  

MONGO_COLLECTION = "products"      

 """
Application configuration settings:

-BASE_URL: The root URL of the website being scraped

- DB_PATH: Location of SQLite database file

             - MONGO_URI: MongoDB connection string

- MONGO_COLLECTION: This sets the table name to "products" where all scraped items will be saved

             - MONGO_DB: MongoDB database name 

- BACKUP_FILE: Backup location for MongoDB data

"""

This section of the code determines the underlying settings for where and how the data scraped from the Boutiqaat Women's Perfume category will be stored and processed. These constants are like a map—nominally referring to the tools and files that the scraper will be using in the process.

BASE_URL specifies the top-level website address where the scraper starts gathering information, in this case, women's perfumes on Boutiqaat. The DB_PATH instructs the script where to store the gathered product links in the form of SQLite, a light database that is stored in a local file.

BACKUP_FILE defines where a copy of the extracted data will be saved in a .jsonl format—this acts as a safety net in case anything goes wrong. For long-term storage and scalability, data is also saved in a MongoDB collection, which is set up using MONGO_URI, MONGO_DB, and MONGO_COLLECTION. These settings help the scraper stay organized, efficient, and fail-safe by ensuring data is always backed up and easily accessible for further analysis or review.

Tracking Progress with Logging for Reliable Scraping

# Setup Logging

logging.basicConfig(

   filename="boutiqaat_data_scraper.log", 

   level=logging. INFO,  

   format="%(asctime)s - %(levelname)s - %(message)s" 

)



"""

Configures the logging system for the Boutiqaat data scraper.



This setup configures logging to output messages to a file named `boutiqaat_data_scraper.log`. The log messages

will include:

- The timestamp when the log entry was created.

- The log level (e.g., INFO, WARNING, ERROR).

- The actual log message.



This is essential for tracking the script's progress, debugging issues, and auditing scraping activities.



Parameters:

-----------

- filename (str): The name of the log file where log entries are saved. In this case, `boutiqaat_data_scraper.log`.

- level (int): The logging level that determines which log messages are captured. `logging. INFO` captures informational messages and above (INFO, WARNING, ERROR, CRITICAL).

- format (str): The format for log messages. It includes the timestamp (`asctime`), log level (`levelname`), and the log message itself (`message`).



Logging Levels:

---------------

- `INFO`: Captures general information about the scraping process (e.g., progress updates, data extraction successes).

- `WARNING`: Used for warnings (e.g., when a product is skipped due to missing data).

- `ERROR`: Used for error messages (e.g., when an exception occurs during scraping or data extraction).

"""

This section of the script sets up a logging system to track everything that happens when scraping. Instead of flooding your console with messages, logs are neatly written to a file called boutiqaat_data_scraper.log. This log is akin to a diary for your scraper—it keeps track of important actions, successes, warnings, and even errors that occur when scraping data from the Boutiqaat site.

The logging mode is designed to show the time when something happens exactly, the level of the message (e.g., INFO for regular updates or ERROR if there is a break), and a simple description of what is happening. Having the logging level as INFO is saying that all the significant updates, warnings, and errors are being recorded without drowning in too much technical chatter. This simplifies debugging and also gives a clean history of the scraping session, which is particularly helpful when dealing with large amounts of data or running automated scrapers for extended periods. It's a tiny setup with a huge effect on having control and visibility over your scraping process.

Seamless Data Storage with MongoDB Integration

# MongoDB connection

# Creates a connection to the MongoDB database for storing scraped product data

mongo_client = MongoClient(MONGO_URI)  # Connect to MongoDB server

mongo_collection = mongo_client[MONGO_DB][MONGO_COLLECTION]  # Access specific collection

This section of the script establishes a connection to MongoDB, a dynamic NoSQL database for storing all the product information scraped from the Boutiqaat website. While normal databases store data in rigid tables, MongoDB enables us to store information in a JSON-like format, which is ideal for dealing with web data that is often inconsistent in structure. The script then starts by getting the connection to the MongoDB server via MongoClient and the previously defined MONGO_URI. Upon connection, it is accessing the respective database and collection where the product information is going to be stored. In this way, the entire perfume information gathered—brand names, prices, and descriptions—is maintained in an organized and accessible format. Using MongoDB is easier to handle a lot of data in an effective way, especially when handling modern web content that is not necessarily in a standard format.

Cleaning Up URLs for Consistent Data Storage

def clean_url(url):

  
   """

   Cleans the input URL by removing duplicate segments in the URL path.



   This is helpful in ensuring uniformity and avoiding multiple entries for

   the same product due to slight differences in the URL structure.



   Args:

       url (str): The original product page URL.



   Returns:

       str: A sanitized URL with duplicate segments removed.

   """


   parsed = urlparse(url)

   path_parts = []

   seen = set()

   for part in parsed.path.split("/"):

       if part and part not in seen:

           path_parts.append(part)

           seen.add(part)

   cleaned_path = "/" + "/".join(path_parts)

   return urlunparse((parsed.scheme, parsed.netloc, cleaned_path, '', '', ''))

This section of the script is intended to normalize product URLs using a function called clean_url. When scraping product pages in Boutiqaat, the same product might be listed with slightly varying URLs due to redundant or unnecessary elements in the web address. These small differences can cause duplicate records in the database if not cleaned, i.e., dirty or inconsistent data.

To avoid this, the clean_url function employs Python's urlparse and urlunparse functions to parse the URL into its various components. It then reads through the path component of the URL, eliminating any duplicate segments but maintaining their order. The end result is a clean, streamlined version of the original URL, such that every product ends up with only one unique, consistent link stored in the database. It is a small step, but it is all about a big leap in the quality and integrity of the scraped data, particularly when you are scraping product links in hundreds or thousands.

Setting Up the SQLite Database to Manage Product URLs

def setup_database():



   """

   Initializes and sets up the SQLite database to store product URLs and

   their processing status.


   - Creates a `links` table if it doesn't already exist.

   - The table contains:

       - id: Primary key

       - url: The product URL (must be unique)

       - category: The category to which the product belongs

       - processed: An integer flag (0 or 1) to track if data has been scraped

   - Adds the `processed` column if it's missing.



   This database helps in managing progress across multiple runs.

   """



   conn = sqlite3.connect(DB_PATH)

   cursor = conn.cursor()

   cursor.execute("""

       CREATE TABLE IF NOT EXISTS links (

           id INTEGER PRIMARY KEY AUTOINCREMENT,

           url TEXT UNIQUE,

           category TEXT,

           processed INTEGER DEFAULT 0

       )

   """)

   try:

       cursor.execute("ALTER TABLE links ADD COLUMN processed INTEGER DEFAULT 0")

   except sqlite3.OperationalError:

       pass

   conn.commit()

   conn.close()

The function setup_database is very crucial in data organization and maintenance of the scrape process. When we scrape product pages, we would need something that will be able to remember the URLs we have already processed and save other information like the category of the product. This function precisely does this by initializing a database based on SQLite, a minimalist and lightweight database system.

This is the way the function operates: it opens an SQLite database, storing the data through a path specified by DB_PATH. After being opened, the script creates a table named links if the table does not already exist. This is the table that we utilize to store the URLs for the products that we scrape and some other data such as the category for the product and a flag known as processed. The processed flag is initialized to 0, indicating the product has not yet been scraped for data. After creating the table, the function looks for the processed column if it does not exist and tries to add it if needed, ensuring the table remains updated at all times.

The purpose of using a database like SQLite is to keep track of what URLs are yet to be scraped and what URLs have been scraped. It makes the scraping more efficient, particularly when you need to execute the script several times or when you are scraping a huge number of product pages. The function ensures that there will be no data loss and you can track the progress of your scraping by keeping product URLs stored and organized.

This setup is done for the ease of suspending and resuming the scraping operation at will without losing track of which URLs need to be scraped or re-scraped again. It makes the process both efficient and correct throughout the entire web scraping procedure.

Fetching Unprocessed Product Links for Scraping

def fetch_unprocessed_links():

   """

   Fetches all product URLs from the database that have not yet been processed.


   Returns:

       list of tuples: Each tuple contains:

           - id (int): ID of the database row

           - url (str): Product page URL

           - category (str): Product category

   """

   conn = sqlite3.connect(DB_PATH)

   cursor = conn.cursor()

   cursor.execute("SELECT id, url, category FROM links WHERE processed = 0")

   results = cursor.fetchall()

   conn.close()

   return results

The fetch_unprocessed_links function is used to load all the unprocessed product URLs. This is an important step in a prolonged web scraping process over the course of multiple runs or sessions because it will only process unprocessed URLs and will not repeat processing, instead prioritizing new, unvisited URLs.

This is done the following way: the method initially establishes connection with the SQLite database in which all the URLs of products are stored. Having established connection, it performs a SQL query that retrieves all URLs stored in the database whose processed flag is set to 0, i.e., such URLs are yet to be scraped. The SQL query also retrieves the id of the record (to keep track of the record) and the category under which a product belongs. This is in order to get control over processing of scraping by type of the product that will be scraped.

Once the data is fetched, the function stores it as a list of tuples containing id, url, and the category of the item. This list of unprocessed links is then passed on to the script, which can now use the data to scrape the corresponding product pages for further information.

The importance of this function is that it allows the scraper to pick up where it stopped. In the event that the scraping process had been cut short or in the event that it is to be resumed, the function allows the previously scraped URLs not to be scraped again, hence conserving time and resources. Essentially, it assists in the effective handling of the process of scraping by only operating on the yet-to-be-scraped URLs. This method makes the whole web scraping process more structured and effective.

Marking Product Links as Processed to Avoid Re-Scraping

def mark_as_processed(link_id):


   """

   Marks a product link as processed in the SQLite database, so it won’t be scraped again.



   Args:

       link_id (int): The ID of the link to be marked as processed.

   """



   conn = sqlite3.connect(DB_PATH)

   cursor = conn.cursor()

   cursor.execute("UPDATE links SET processed = 1 WHERE id = ?", (link_id,))

   conn.commit()

   conn.close()

The mark_as_processed is an important component of the process as it makes sure that all product links are scraped only once. Once a link has been scraped and data is received, the function makes sure that progress is made by marking the database as the link has been processed.

This is how it works: when you call the function, it takes a single link_id as an argument. This link_id is the individual product link just processed. The function establishes a connection to the SQLite database where all the product URLs are located. It then employs an SQL query to mark the processed field of the specific link_id as 1, indicating that the link has been processed. This mark informs the scraper that this link has been scraped and will not be scraped again.

Once the update is complete, the modifications are committed to the database, and the connection is closed. It is a tidy but effective mechanism that maintains the scraper clean and only unprocessed links are scraped when the scraper is executed the next session. With this mechanism, the script can maintain progress, avoid re-scraping already processed data, and enhance performance overall, especially with lengthy scraping sessions or when the scraper is executed repeatedly.

Scraping Product Pages Using Playwright for JavaScript Rendering

def scrape_html(url):


   """

   Uses Playwright to render a JavaScript-powered product page and retrieve its HTML content.



   Args:

       url (str): The product page URL to load.



   Returns:

       str or None: The full HTML content of the page if successful, otherwise None.



   Logs:

       Errors in loading or rendering the page are logged for later review.

   """



   try:

       with sync_playwright() as p:

           browser = p.chromium.launch(headless=True)

           page = browser. new_page()

           page.goto(url, timeout=60000)

           page.wait_for_selector("h1.product-name-h1", timeout=10000)

           html = page.content()

           browser.close()

           return html

   except Exception as e:

       logging.error(f"Failed to load page {url}: {e}")

       return None

The scrape_html function would be applied on JavaScript-heavy product pages which load their content via JavaScript. Web scraping libraries would be unable to scrape such pages since the information we require may not be easily accessible in the HTML source code. Playwright, an automation library for browsers with high performance, bridges this gap and helps us simulate a real user web browsing experience and load JavaScript-heavy pages.

The process starts by initializing the Playwright browser in headless mode, i.e., without displaying the browser window. The product page URL is provided to the process, and Playwright loads the page. In order to finish loading the page and wait until the page is ready to scrape, it waits until some specific element, say the product name (h1.product-name-h1), shows up on the page. This ensures that the page is finished rendering and all of the content of interest has loaded. When the page is ready, Playwright goes there.

Once the page is loaded completely, the function retrieves the page's HTML by using page.content(). This is the raw page structure with all the product information that would be parsed subsequently. The browser instance is then closed, freeing up resources.

If there is any form of error, e.g., timeout or page load error, the function raises the exception and prints an error. This error message records what URLs could have led to the error and makes it easier to debug later.

This web scraping method is effective when sites load information dynamically using JavaScript, such as product descriptions, images, and prices, and therefore is an unrivaled approach to web scraping information from sophisticated online shopping sites.

Extracting Structured Product Data from HTML

def extract_product_data(html, url, category):


   """

   Extracts structured product details from the HTML of a Boutiqaat product page.



   It parses the HTML using BeautifulSoup and safely retrieves:

       - Product name

       - Brand name

       - Current price

       - Old price (if available)

       - Discount percentage

       - Review count

       - Full product description

       - Specifications

       - Availability status (in stock / notify me)



   Args:

       html (str): The rendered HTML content of the product page.

       url (str): The original URL of the product page.

       category (str): The product's category (used for tagging).



   Returns:

       dict: A dictionary with all extracted and cleaned product information.

   """



   soup = BeautifulSoup(html, 'html.parser')



   def get_price_with_kwd(selector):

       tag = soup.select_one(selector)

       return tag.get_text(strip=True) if tag else "N/A"



   def safe_select_text(selector):

       tag = soup.select_one(selector)

       return tag.get_text(strip=True) if tag else "N/A"



   def extract_description():

       description_tag = soup.select_one("div.content-color")

       if description_tag:

           paragraphs = description_tag.find_all("p")

           return "\n".join(p.get_text(separator=" ", strip=True) for p in paragraphs if p.get_text(strip=True))

       return "N/A"



   availability_raw = safe_select_text("div.pro-details-add-to-cart a")

   if "Buy Now" in availability_raw:

       availability = "yes"

   elif "Notify Me" in availability_raw:

       availability = "no"

   else:

       availability = "unknown"



   return {

       "url": url,

       "category": category,

       "product_name": safe_select_text("h1.product-name-h1"),

       "brand_name": safe_select_text("a.brand-title strong"),

       "price": get_price_with_kwd("div.pro-details-price.discount span. new-price"),

       "old_price": get_price_with_kwd("div.pro-details-price.discount span.old-price"),

       "discount_percentage": safe_select_text("div.pro-details-price.discount span.discount-price"),

       "review_count": safe_select_text("div.product-review-order span"),

       "description": extract_description(),

       "specifications": safe_select_text("li.heading-tag-sku-h1 span.attr-level-val"),

       "availability": availability

   }

The extract_product_data function is built to extract key details about a product from the HTML content of a product page on Boutiqaat. Once the page is loaded and its HTML content is retrieved, this function takes over to parse and organize the information in a structured format.

The function begins by using BeautifulSoup, a Python library designed to parse HTML. It safely navigates the HTML content, looking for specific elements that contain important product details such as the product's name, brand, price, description, availability, and more.

To extract the prices, the function looks for elements containing the current price, old price (if available), and discount percentage. It uses a helper function called get_price_with_kwd, which retrieves the price text from the HTML and handles cases where the price might not be available, returning "N/A" in such cases.

Next, the safe_select_text helper function is used to safely extract text from various HTML elements, such as the product name, brand name, and review count. If a certain element is not found, the function ensures that it doesn't break the program by returning a default value ("N/A").

The product description is handled separately. The function looks for a specific section of the page that holds the detailed description. If found, it extracts all paragraphs and joins them together into a single text block, ensuring the description is properly formatted.

The function also checks the product's availability by looking for indicators like "Buy Now" or "Notify Me" in the HTML. This helps determine if the product is in stock or out of stock, providing a simple "yes," "no," or "unknown" status.

Finally, the function returns all this collected data as a dictionary, which includes:

Product name
Brand name
Current price
Old price (if available)
Discount percentage
Review count
Full product description
Specifications
Availability status

By organizing the extracted data into a dictionary, this function ensures that all product details are cleanly structured, making it easier to analyze, store, and later use in any application or database. This approach is crucial for handling large amounts of product data efficiently, especially in e-commerce scraping.

Storing Data Efficiently with JSON Lines (JSONL)

def append_to_jsonl(data, file_path):


   """

   Appends a single dictionary entry to a JSON Lines (JSONL) file.


   JSONL format stores one JSON object per line and is efficient for streaming or incremental backups.


   Args:

       data (dict): The data dictionary to append.

       file_path (str): The full path to the backup file.


   Logs:

       - Success messages when data is appended.

       - Errors if the file cannot be written to.

   """


   try:

       with open(file_path, "a", encoding="utf-8") as f:

           f.write(json.dumps(data, ensure_ascii=False) + "\n")

       logging.info(" Backup entry written to JSONL.")

   except Exception as e:

       logging.error(f" Failed to write to JSONL backup: {e}")

The append_to_jsonl function is responsible for saving the scraped product data to a local file in a reliable and organized manner. Instead of saving all data at once or using a bulky format, it takes a more efficient route—appending one entry at a time to a JSON Lines (JSONL) file.

JSONL is a simple and powerful file format where each line is a standalone JSON object. This makes it perfect for logging and storing large amounts of structured data incrementally. For instance, if your scraper collects thousands of product entries, you don’t have to keep all of them in memory or risk losing everything if the script crashes. Instead, each product's data is written line-by-line, one at a time.

The function receives two inputs:

data: A Python dictionary containing the product's details (like name, price, availability, etc.).
file_path: The full path where the JSONL file will be saved or updated.

Inside the function, it opens the file in append mode using open(...), which ensures the file is closed properly after writing. The json.dumps() method converts the dictionary into a JSON-formatted string, and ensure_ascii=False ensures that non-English characters (like Arabic or accented letters) are preserved. Each JSON object is then written to the file with a newline character, making it easy to parse later.

To maintain reliability and traceability, the function logs messages throughout the process. If the entry is successfully written, a confirmation is logged. If something goes wrong (like the file being locked or missing permissions), it catches the exception and logs an error message with a clear description.

This design not only keeps the data backed up incrementally but also allows the scraper to resume seamlessly in case of an interruption—an essential feature when working with large e-commerce websites like Boutiqaat. JSONL files are also compatible with many data processing tools, making them a great choice for long-term data storage and analysis.

The Heart of the Scraper: Coordinating the Workflow with main()

def main():

   """

   Main entry point for running the Boutiqaat product scraper.



   This function Coordinates the entire scraping process, from setting up the database to processing product links

   and saving the scraped data to MongoDB and a local JSONL file.



   Workflow:

   ---------

   1. Database Setup: Initializes the SQLite database by setting up the required table and columns if they do not already exist.

   2. Fetch Unprocessed Links: Retrieves all unprocessed product links from the SQLite database. This ensures that only new products are scraped.

   3. Scraping Loop: For each unprocessed product link:

       - Cleans the URL to standardize it and remove duplicates.

       - Loads the product page using Playwright, waits for the page to fully render, and retrieves the HTML content.

       - Extracts product details (such as name, price, brand, description, availability) using BeautifulSoup.

       - Saves the extracted data to a local JSONL backup file.

       - Attempts to insert the data into a MongoDB collection for persistence.

       - Marks the URL as processed in the SQLite database to avoid scraping the same link again in the future.

   4. Logging: Throughout the process, logs are generated to keep track of progress, successes, and errors:

       - Successes are logged when a product is processed and saved successfully.

       - Warnings are logged for duplicates in MongoDB or missing HTML content.

       - Errors are logged for any failures during scraping or data extraction.

      

   The function ensures that the scraper runs smoothly, continues processing links, and avoids duplicate scraping.



   This function is executed when the script is run directly, and serves as the main driver for the entire scraping workflow.



   Error Handling:

   ---------------

   - If the page cannot be loaded (e.g., due to timeouts or missing elements), a warning is logged and the link is skipped.

   - If data extraction fails, the error is logged and the scraper moves on to the next link.

   - Duplicate entries in MongoDB are detected, and a warning is logged without attempting to insert them again.

   """

  

   setup_database()

   unprocessed_links = fetch_unprocessed_links()

   logging.info(f"Found {len(unprocessed_links)} unprocessed product links.")



   for link_id, url, category in unprocessed_links:

       logging.info(f"Processing: {url}")

       url = clean_url(url)

       html = scrape_html(url)



       if not html:

           logging.warning(f"Skipping (no HTML): {url}")

           continue



       try:

           extracted = extract_product_data(html, url, category)



           # Save to JSONL backup (no id)

           append_to_jsonl(extracted, BACKUP_FILE)



           # Insert into MongoDB (no id)

           try:

               mongo_collection.insert_one(extracted)

           except errors.DuplicateKeyError:

               logging.warning(f" Duplicate MongoDB entry for URL: {url}, skipping insert.")



           mark_as_processed(link_id)

           logging.info(f" Success: {url}")



       except Exception as e:

           logging.error(f" Error processing {url}: {e}")

           continue

At the center of this entire scraping project lies the main() function. This function acts as the command center—responsible for coordinating every moving part of the scraper, from setting up the database to saving product data into both local files and a cloud database. Let’s break down what it does in an intuitive, step-by-step way.

1. Setting Up the Environment

The first step consists of calling the setup_database() function. From there, the script takes care of setting up the SQLite database in case it isn’t ready yet. During the first run of the script, the required table (links) is created. If it is already created, it just continues. This allows dealing with hundreds or thousands of product pages since the links which are processed already, are tracked.

2. Obtaining Product Links That Have Not Been Processed

Proceeding with the same logic, the script fetches unprocessed links using fetch_unprocessed_links(). As the name suggests, this filtering ensures only new or missed links are scraped, avoiding unnecessary redundancy. Each record also contains the product ID, corresponding link, and category (for example, “Colognes” or “Niche Perfumes”).

3. Looping Through Each Product Page

The main logic is embedded in a for loop that goes through each unprocessed link. This is how each part works within the loop:

URL Cleaning: Before doing anything, the URL is cleaned using clean_url() to ensure consistency and eliminate potential duplicates due to URL formatting issues.
HTML Rendering with Playwright: The script loads the complete webpage content by using the scrape_html() function. Boutiqaat's pages are JavaScript-dependent, and hence having a tool like Playwright becomes imperative to get the entire HTML after the rendering of the page.
Skip if HTML is Missing: If for some reason the page does not load or display correctly (e.g., due to slow internet or server issues), the script logs a warning and skips over such a link. This maintains efficiency and prevents unnecessary crashes.

4. Extract and Store Product Data

When the HTML document is retrieved, the function calls extract_product_data(), which retrieves the product name, price, brand, availability status, and description. This function uses BeautifulSoup to process the HTML and extract pertinent information.

Local Backup with JSONL: Whether MongoDB is reachable or not, the append_to_jsonl() method is always called which saves the information into a .jsonl file. This local backup ensures that no data is lost even when there is a connectivity problem.
Storing in MongoDB: Subsequent to Backup, the information is also added to a collection in MongoDB. MongoDB serves as the main database for maintaining a catalog of structured product information. When using the script for the first time, it attempts to insert a document with a unique primary key for each product. If the product already exists, which is checked using a duplicate key, then the script logs a warning and does not include it again—no crashes, no clutter.

5. Mark As Processed URLS

After performing all the necessary tasks for a product, mark_as_processed() is called to set the new status in the SQLite database in case the product has been processed. It is worth noting that this action does not seem very significant at first sight – however, it does guarantee that the same product is scraped only once in the life of the program unless its flag is forcefully changed.

6. Logging for Transparency

Throughout the operation, precise log messages are created. They consist of successful scrapes, skipped links, duplicate detection, and any unanticipated errors. Such live feedback is invaluable when you're debugging or watching over lengthy scraping sessions.

Entry Point to Execute the Script

# Entry point of the script

if name == "__main__":

   """

   Entry point when run as script:

   - Executes main scraping function

   - Handles any top-level errors

   """

   main()

The final piece in our Boutiqaat scraper is the entry point, invoked when the script is run in direct mode. This is treated in a dedicated Python clause: if name == "__main__":

While this may seem a bit enigmatic to beginners, it takes a long way in keeping your code modular and clean. In essence, this block guarantees that the scraper only executes when the file is run independently — not when it's imported as a module into another program. Within this block, we invoke the main() function, which serves as the core of the whole scraping process. By doing so, we initiate everything in order: from database setup and product page scraping, to data saving and progress logging. This is a Python best practice because it lets all other functions — such as HTML scraping or data extraction — be reused or tested independently without running the complete scraping process automatically. It's a neat yet effective way to keep your script tidy, secure, and production-ready.

Conclusion

In a world where thousands of products are listed online, doing everything by hand just isn’t practical. That’s why we built a smart and simple tool that automatically scrolls through Boutiqaat’s perfume pages, finds every product, and saves their links for us. It works just like a human browsing the site—scrolling down, closing popups, and picking out the right links—but does it faster, without getting tired.

This automation helps us collect a full list of perfumes in one go, making it easier to study trends, compare prices, or build cool features like search tools or dashboards. This is the equivalent of having a digital assistant that can manage boring tasks which allow us to direct our energy and time towards analyzing new data and obtaining actionable insights.

Whether you are a beginner in coding and just want to know how online data collection works, this project serves as a perfect example of effective and impactful automation.

Libraries and Versions

Name: pymongo

Version: 4.10.1

Name: playwright

Version: 1.48.0

Name: beautifulsoup4

Version: 4.13.3

FAQ SECTION

1. Is it legal to scrape data from Boutiqaat?

Web scraping publicly accessible data from Boutiqaat may be legally permissible for personal or research purposes. However, it’s crucial to review their Terms of Service and ensure compliance with ethical scraping practices, such as respecting robots.txt.

2. What kind of perfume-related data can I scrape from Boutiqaat?

You can extract product names, prices, discounts, brand information, bottle sizes, customer ratings, reviews, fragrance notes, gender segmentation, and availability status to analyze perfume market trends.

3. Which tools or technologies are best for scraping Boutiqaat?

Python libraries like BeautifulSoup, Scrapy, or Selenium work well. For scalable or automated scraping, you may consider using Playwright, Puppeteer, or a no-code tool like n8n.

4. How can scraped data from Boutiqaat help in market analysis?

The data can uncover pricing trends, top-selling fragrances, consumer preferences by brand or scent type, gender-based product distribution, and identify emerging perfume brands in the GCC market.

5. What are the risks of scraping Boutiqaat manually?

Manual scraping can trigger anti-bot protections, lead to IP bans, or result in outdated insights due to slow updates. Automated tools or scraping services can reduce these risks and provide cleaner, timely data.

How to Scrape Boutiqaat for Actionable Perfume Market Insights?

About Boutiqaat

Automated and Intelligent Data Collection

Step 1: Gathering Product Links

Step 2: Data Extraction from Product Links

Data Cleaning

Powerful Tools and Libraries for Smarter Data Extraction

STEP 1: Scraping Product URLs from Women's Perfumes on Boutiqaat

Importing Libraries

Keeping Track of Progress with Logging

Database and Website Configuration

Setting Up User-Agent Headers for Safe and Seamless Scraping

Building the Foundation: Creating a Database for Perfume Links

Saving Product URLs Without Duplicates

Smart Scrolling to Gather Perfume URLs from Boutiqaat

Starting the Scraper: Organizing Perfume Data Collection from Boutiqaat

Entry Point to Execute the Script

STEP 2: Extracting Complete Product Information from Each Link

Importing Libraries

Key Configuration Setup for Structured Data Collection

Tracking Progress with Logging for Reliable Scraping

Seamless Data Storage with MongoDB Integration

Cleaning Up URLs for Consistent Data Storage

Setting Up the SQLite Database to Manage Product URLs

Fetching Unprocessed Product Links for Scraping

Marking Product Links as Processed to Avoid Re-Scraping

Scraping Product Pages Using Playwright for JavaScript Rendering

Extracting Structured Product Data from HTML

Storing Data Efficiently with JSON Lines (JSONL)

The Heart of the Scraper: Coordinating the Workflow with main()

Entry Point to Execute the Script

Conclusion

Libraries and Versions

FAQ SECTION

Recent Posts

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?