How to Scrape Tata CLiQ for Reliable Personal Care Product Insights?
- Anusha P O
- Jul 11
- 41 min read
Updated: Nov 5
Let's say you had a really smart assistant who could go into a website and collect everything important for you; for example, products, prices, sales, and you'd never have to copy-and-paste anything. This is basically the equivalent of web scraping! It's like you have a robot that can go through a website and pick out the parts you care about in a timely manner. It's especially functional for retail as businesses can monitor changing prices, promotions, best-selling items, etc., and then put themselves in a better position to make good decisions. In our case as shoppers, it gives us the ability to find the best deals quickly for our preferred brands. When done right, web scraping does not harm the website or contravene any rules; it just works in the background without any obfuscation.
About Tata CLiQ Fashion
Tata CLiQ Fashion isn't just a shopping website -- it's as good as a friend when it comes to your online shopping needs for clothing, beauty, and lifestyle products. Tata CLiQ Fashion was started in 2016 by Tata group, which is pretty much the world's most trusted company (Tata group owned Tanishq and Tata motors, which will be appearing later on). What makes Tata CLiQ Fashion special is that it doesn't sell fake and cheap rubbish online -- Tata CLiQ Fashion only sells products that they acquire from more than 6,000 branded shops, so you know everything you purchase is not fake. Whether you are in the market for clothes, accessories, skincare, or stuff for your home, there is plenty of choice.
This project looked at the "personal care" part of the total items that they sell -- the lotions, face washes, etc. The aim is to save shoppers time by showing what shoppers are happy to spend their money on, and also provide business with insight into how shoppers spend their money and what's happening with pricing considerations. In the end, if both parties on either side of the commerce will understand each other, it will help make the online shopping process smooth and enjoyable for both sides.
Effortless Data Extraction
Our web scraping process contains two primary steps. URL Collection Phase and Data Extraction from Product Links.
URL Collection Phase
We began by scraping all the product URLs from the Personal Care category from the Tata CLiQ website. Tata CLiQ's website doesn't load all the products on the screen at once; it loads them as the user scrolls. So we used an open-source tool called Playwright Stealth that makes our script behave like a real user browsing the web to bypass the website's attempts to block us. As our tool was browsing through the category pages, it captured all of the product URLs as it was scrolling the screen. Then we encountered a problem: a screen popup asking users to subscribe was constantly blocking everything on the screen. So we added code to have the scraper recognize and close the popup before completing any action. Also, instead of a "next page" button, Tata CLiQ has a "Show More Products" button that loads more products. We took precautions to ensure our tool clicked that button every time we were collecting links so that we would not miss any products.
Data Extraction from Product Links
After saving all the product links to a simple database (named SQLite), the script visits each link, one by one, to gather more information — kind of like opening each product page in a browser to check its name, price, description, size, color, and if its in stock. It uses a tool called Playwright which allows the script to act like a real human while scrolling and clicking, so websites do not block it. Instead of saving the information to a file and uploading it, the information goes directly into another database, called MongoDB — this way, it keeps things faster and seamless. It also saves a backup as CSV files (similar to saving an Excel sheet) on the computer just to be safe, so nothing gets lost. This process is simple, straightforward, and keeps it under the radar while still obtaining essential information.
Data cleaning
Once you are ready to use your collected data, the next step (and very important step) is cleaning the data so that it makes sense and can be put to use. You can compare this to a pile of messy notes - they need to be sorted and cleaned up before you can use them for studying. Raw data typically contains a great deal of "mess" in the form of duplicate data, formatting issues, and other issues that don't seem to match up. If you haven't figured out your "messy data" situation, one great tool for this is OpenRefine. It is sort of a smart editing app for your data - you can identify and eliminate duplicates, bring standardization to attributes, and adjust any inconsistencies, all with a few clicks. Working with OpenRefine can be like organizing a messy spreadsheet while looking under a microscope! If you have more complex problems - such as unwanted HTML tags left in your text, or turning a column of numbers into a proper date - Python, with its library called pandas, will save the day. Pandas gives you the functionality to make these intricate edits, similar to having an intelligent assistant managing the background to sort your data. While it might not seem like the most exciting step in the process, cleaning your data will ultimately be a compulsory step if you want your final results to be of value!
Powerful Tools and Libraries for Smarter Data Extraction
The relevant code undertakes usage of several fundamental Python libraries that are distinct from each other in the sense that they make web scraping effective, effortless, and undetectable. I will highlight each of them in an easy to understand professional manner so that even those who are not well versed with the subject will find it easy to comprehend.
Asyncio: The first in the list is asyncio, which is a built-in track of a function in Python – simply put, it is multitasking. It acts as a ‘smart traffic controller’ that makes sure that various parts of your code run smoothly without bringing the whole program to a standstill waiting for one task to be completed. Ordinarily, after a webpage is scraped, there is time taken to load a webpage and you do not want other parts of the program to sit idle waiting for that event to execute. By utilizing asyncio, your script can ask for many webpages to be processed at the same time, thus saving a lot of time.
logging : Keeping track of the flow of the program is done through logging. Picture this: you’ve left a scraper running for hours, and later discover that an error midway caused the scraper to stop. Without sufficient logs, you would have no idea what went wrong. By noting key activities, errors, and important messages, logging assists in error resolution.Instead of cluttering the console with print statements, logs are saved systematically so they can be reviewed whenever necessary.
Playwright: Most of the code works on Playwright, a high-level web automation library. Unlike bare-bones scraping libraries that do nothing more than fetch webpage text, Playwright allows true interaction with web pages—just as a human does. It will scroll, press buttons, submit forms, and even handle pop-ups. It is useful in scraping websites whose data are dynamically loaded using JavaScript. Both async_playwright and sync_playwright are present in the code, with the former allowing multiple tasks to run at the same time to save time and the latter keeping it simple by running one task at a time. Either of these can be used based on the type of scraping task.
Playwright_stealth : But the majority of sites actively try to detect and prevent scrapers. This is where playwright_stealth comes in handy. In regular situations, sites check for bot-like behavior by looking at how the browser and the page interact. If the interaction is unnatural or is too fast, the site will deny access. playwright_stealth makes the interactions appear more human-like, avoiding sites from detecting that Playwright is controlling the browser, reducing the chances of being blocked.
SQLite : Then, sqlite3 offers a method of storing and handling data in a light database. SQLite is a file-based, self-contained database system that does not need an independent server, so it is a great option for small to medium projects. When you scrape data, you have to have somewhere to put it so you can analyze it later, and SQLite is there to assist you in doing that effectively. Rather than storing all the information in memory, which may make your program slow or crash, SQLite puts it in a structured format so that it can be easily retrieved when required.
Time: Another critical library in action is time, and it is primarily utilized to introduce delays when necessary. Sites have rate limits or anti-bot functionality, so sending requests too frequently may trigger alarm. A plain time.sleep() statement can retard things just sufficiently to make scraping seem more organic and prevent security features from being triggered.
Pymongo : To save databases, pymongo is used to interact with MongoDB, a NoSQL database. SQLite saves information in tables (like an Excel spreadsheet), while MongoDB saves information in a more loosely structured manner, similar to JSON. It is perfect for handling large amounts of unstructured or semi-structured data, such as web-scraped data. When thousands of entries need to be saved without the need to care about strict table structures, MongoDB is a great choice.
STEP 1 :Product URL Scraping From Tata Cliq Personal care appliances
Importing Libraries
import asyncio
import sqlite3
import logging from playwright.async_api
import async_playwright from playwright_stealth
import stealth_asyncThis configuration brings together strong tools to ensure web scraping is efficient and effective. asyncio enables multiple tasks to be performed at the same time, and hence data extraction is fast. sqlite3 offers a lightweight database to save scraped urls in an orderly fashion. logging makes it possible to trace back the errors and execution status to ensure smoothness. async_playwright makes web interaction easier, with handling sophisticated sites that integrate JavaScript elements. stealth_async makes it easy to conceal from bot discovery to enable easy scraping without being blocked. These libraries together form a solid foundation for effective data extraction, proper handling, and addressing common challenges faced in web scraping.
Setting Up Logging for Tracking and Debugging
# Logging setup
logging.basicConfig(
filename="Log/tatacliq_original.log",
level=logging. INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
)
"""
Configures the basic logging system for the application.
What this does:
- Saves all log messages to 'tatacliq_original.log' in the Log folder
- Records messages of level INFO and above (ignores DEBUG messages)
- Formats each log entry with: [Timestamp] - [Log Level] - [Message]
Note:
- The log file will be created automatically if it doesn't exist
- Existing log file will be appended to (not overwritten)
- Useful levels: DEBUG (detailed), INFO (normal), WARNING (problems), ERROR (failures)
"""This script is equipped with a logging system for monitoring the scraping process and it is simple to monitor, identify issues, and debug. It stores all log messages in a file "tatacliq_original.log" in the Log directory. The level of logging is INFO, and hence only required messages (i.e., status of script, warnings, errors) will be logged—no information at the debug level is logged to keep the log brief. Each log message follows a standard format: it has a timestamp (when the event happened), the log level (INFO, WARNING, ERROR), and a message (what happened). This makes all important events during scraping well-documented, a clean record of execution left behind. The log file is updated in real-time, with new entries being added instead of overwriting the old ones. This makes it possible to track script performance over time, diagnose failure, and seamless data extraction.
Setting Up the Database: Creating a Reliable Storage for Product URLs
# Database setup
def setup_database():
"""
Sets up and prepares the SQLite database for storing product URLs.
What this function does:
1. Creates a connection to 'Tata_cliq_original.db' in the Data folder
2. Creates a table named 'product_urls' if it doesn't already exist
3. The table has two columns:
- id: Auto-numbered unique identifier (automatically increases)
- url: Web address of the product (must be unique, no duplicates allowed)
4. Saves (commits) these changes
5. Returns the active database connection
"""
conn = sqlite3.connect('Data/Tata_cliq_original.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS product_urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL UNIQUE
)
''')
"""
- Creates the database file if it doesn't exist
- Won't overwrite existing data if table already exists
- 'UNIQUE' ensures no duplicate URLs can be stored
- Remember to close connection when done
"""
conn.commit()
return connThe function, setup_database(), establishes the database for storing product URLs that are scraped from the Tata Cliq Fashion site. It begins by creating a connection to an SQLite database file Tata_cliq_original.db that is located in the Data directory. SQLite will automatically create the file if it does not already exist. Under this database, a product_urls table is created (if it has not already been created). This table contains two significant columns: id, a unique identifier auto-increasing for every new entry, and url, which holds product web addresses with no duplicate links being stored, courtesy of the UNIQUE constraint. After setting the structure, the function commits (saves) such changes in the database and returns the active database connection. This setup provides a structured way of holding product links, free of redundancy and ease of tracking and retrieval of scraped data in the future. It is advisable to close the database connection after use to release system resources and avoid complications.
Saving Product URLs Without Duplicates
# Save URL to database
def save_url_to_db(conn, url):
"""
Saves a product URL to the database, preventing duplicates.
What this does:
- Takes a database connection and a URL
- Tries to add the URL to the 'product_urls' table
- If URL already exists, logs a warning and skips it
- If URL is new, saves it permanently to the database
Parameters:
- conn: An open database connection (from setup_database())
- url: The product URL to save (as a string)
What happens if URL exists:
- Logs: "Duplicate URL skipped: [url]"
- Doesn't crash or stop the program
- Just moves on without adding duplicate
"""
cursor = conn.cursor()
try:
cursor.execute('INSERT INTO product_urls (url) VALUES (?)', (url,))
conn.commit()
except sqlite3.IntegrityError:
logging.warning(f"Duplicate URL skipped: {url}")With the method save_url_to_db, product URLs can be added to the database without violating uniqueness constraints and resulting in corruption of data due to repetitive entries. Two arguments are passed to the method which are the connection to the database (conn) and the product URL (url) and its attempts to insert the URL into the "product_urls" within the database. If the URL does exist within the database, the program does not crash, but rather captures a warning “Duplicate URL skipped: [url]”, enabling it to continue executing smoothly. The function begins asynchronously creating a cursor for the database in order to run SQL commands. Next, he attempts to insert the new URL into the database with an SQL command INSERT. Upon successful capture of the new URL, it would be stored permanently in the database upon committing the transaction. If the captured URL does exist, then an sqlite3. IntegrityError is raised because the database already has the URL stored and unique entries are restricted. The absence of the system crashing is made possible via exception handling of this error, with a log statement created for future examination. As a result the rest of the database is unaffected which makes it clean and effective making data storing reliable.
Extracting Product URLs from Tata Cliq Efficiently
# Scrape URLs from the website
async def scrape_tatacliq():
"""
Main function to scrape product URLs from Tata CLiQ website.
What this function does:
1. Sets up database connection to store URLs
2. Opens a hidden browser (WebKit) to visit the website
3. Uses stealth techniques to avoid being blocked
4. Handles popups that might interfere with scraping
5. Collects all product page URLs while scrolling
6. Saves unique URLs to database
7. Shows progress in logs
Step-by-Step Process:
- Starts browser in visible mode (headless=False for debugging)
- Goes to personal care products page
- Tries to close any pop ups within 5 seconds
- Scrolls down to load more products
- Finds all product links on page
- Saves new URLs and counts them
Special Features:
- Remembers already seen URLs to avoid duplicates
- Logs every 10 URLs saved
- Uses 'network idle' wait to ensure page fully loads
- Gracefully handles popup errors without crashing
Note:
- Runs asynchronously (needs 'await' when called)
- Requires Playwright and stealth plugins
- Creates 'tatacliq_original.log' for progress tracking
- Needs internet connection to work
"""
conn = setup_database()
base_url = "https://www.tatacliq.com/personal-care/c-msh1236?&icid2=nav:regu:audnav:m1236:mulb:best:08:R3:clp:bx:010"
async with async_playwright() as p:
# Launch browser in non-headless mode for debugging
browser = await p.webkit.launch(headless=False)
page = await browser. new_page()
# Apply stealth to avoid detection(# Apply stealth techniques to mimic human behavior)
await stealth_async(page)
# Navigate to the base URL
logging.info(f"Navigating to {base_url}")
await page.goto(base_url)
await page.wait_for_load_state('networkidle')
# Handle popup overlay
try:
await page.wait_for_selector('#wzrk-cancel', timeout=5000)
close_button = await page.query_selector('#wzrk-cancel')
if close_button:
is_disabled = await close_button.is_disabled()
if is_disabled:
await page.evaluate("document.querySelector('#wzrk-cancel').removeAttribute('disabled')")
await close_button.click()
logging.info("Closed subscription popup.")
except Exception as e:
logging.warning(f"Popup handling failed: {e}")
unique_urls = set()
url_counter = 0
# Loop to load all products
while True:
# Scroll to bottom to trigger lazy loading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(3)
# Extract all product links on current page
product_links = await page.query_selector_all('a.ProductModule__aTag')
new_urls_found = False
for link in product_links:
product_url = await link.get_attribute('href')
if product_url:
# Ensure proper URL formatting
full_url = product_url if product_url.startswith("http") else f"https://www.tatacliq.com{product_url}"
if full_url not in unique_urls:
save_url_to_db(conn, full_url)
unique_urls.add(full_url)
url_counter += 1
new_urls_found = True
if url_counter % 10 == 0:
logging.info(f"Saved {url_counter} URLs.")
# Exit condition: no new URLs found in current iteration
if not new_urls_found:
logging.info("No new URLs found. Scraping complete.")
break
# Try clicking "Show More Products"
show_more_button = await page.query_selector('text=Show More Products')
if not show_more_button:
logging.info("No more products to load.")
break
try:
"""
Attempts to load more products by:
1. Finding and scrolling to the 'Show More product' button
2. Clicking it to load additional items
3. Waiting 3 seconds for content to load
Error Handling:
- If button missing: Stops with log message
- If click fails: Logs error and stops
Behavior:
- Scrolls button into view first
- Adds brief delay after click
- Breaks loop on any failure
"""
await show_more_button.scroll_into_view_if_needed()
await show_more_button.click()
logging.info("Clicked 'Show More Products'.")
await asyncio.sleep(3)
except Exception as e:
logging.error(f"Error clicking 'Show More Products': {e}")
break
logging.info(f"Scraping complete. Total unique URLs saved: {url_counter}")
await browser.close()
conn.close()
The scrape_tatacliq function is the heart of the web scraping process for Tata CLiQ’s personal care products section. The function handles the web scraping process of Tata CLiQ’s personal care products section by automating the process of visiting the website, gathering product page URLs, and saving them in a database. Overall, this function ensures efficiency and reliability.
What this function does:
Establish Database Connection – The first step in scraping requires the function to connect with an SQLite database where the extracted URLs will be stored.
Open Browser – The next step is to navigate to the Tata CLiQ website which requires opening a WebKit based browser that is not in headless mode (this means it is visible for debugging purposes).
Stealth Mode – The next step is to perform some human simulation movements so that the chances of getting blocked while scraping the site are reduced.
Popup Handle – Subscription popups that appear when an abrupt site entry takes place will be attempted to be closed in 5 seconds to keep scraping uninterrupted.
Dynamic Product Scroll Down – To enhance user experience, Tata CLiQ loads products as the user scrolls down the page, so the function will automatically scroll down to load more products.
Capture Hyperlink To Product – The function attempts to locate product links in the webpage. This has to be done in a manner that the links are unique and formatted uniquely before capturing them.
Ignore Captured URL – The capture step will be avoided if the URL has already been captured to maintain uniqueness.
Logs Progress – The system keeps track of the scraping process in a log file while logging every 10 URLs saved.
Clicks “Show More Products” Button - The function searches and clicks the button “Show More Products”, if it exists and awaits the information to load before proceeding to scrape.
Gracefully Handles Errors– If an error occurs, or the button is missing the function will resolve the issue by logging it in the system and stopping without crashing.
Step-by-Step Execution.
The function goes to the Tata CLiQ personal care products page. Instructed to navigate to the appropriate URL.
Waits for the page to load fully and attempts to close any popup that may show up.
Enters a loop of scrolling down the list extracting product URLS and then storing them in the database
Has the option of either resuming the process if URLS are detected or ceasing the process when none are detected.
Click the “Show More Products” button if it exists to be able to extract more products.
Logs the total amount of unique URLS obtained from the scrape, shuts the browser down, and reestablishes connection with the database after finishing.
Because of the structure, the software can complete tasks without having to be monitored the whole time. Each step provides detailed instructions that ensure account for issues like popups, lazy loading, duplicates or even restrictions from the website. It really makes it easy and fast to scrap webs.
Kicking Off the Scraping Process: Running the Main Function
# Main function
async def main():
"""
This is the starting point of the web scraping script.
What this function does:
- Logs the message "Starting scraping with WebKit..." to indicate the process has begun.
- Calls the scrape_tatacliq() function to start scraping product URLs.
- Uses 'await' to ensure the scraping process completes before the script moves forward.
Why this is important:
- Helps track when the scraping starts in the log file.
- Ensures that the main script runs in an organized and structured manner.
- Uses asynchronous execution for efficient handling of web scraping tasks.
"""
logging.info("Starting scraping with WebKit...")
await scrape_tatacliq()The main() function serves as an official starting function in the Tata CLiQ web scraping operation. In other words, it is akin to hitting the “Start” button of the scraper. When the main() function is called, it logs a message on the log file which reads “Starting scraping with WebKit…” This is helpful for anyone who is tracking the process in knowing when the exact scraping activities started. After providing this log message, the function proceeds to call the scrape_tatacliq() function with the await operator. This means the script will suspend execution until this function completes its work, so no further logic is executed till this scraping task is done. This ensures that all the product URLs are collected and saved properly without rushing or skipping any steps.
This function is done in asynchronous fashion, which is quite helpful with time intensive action such as web scraping. Asynchronous execution enhances the efficiency of a program by multitasking and not getting stuck on a web page or on waiting or on other simulation tasks. While the main() function is short, it serves a very critical purpose which is to control all the scraping processes – coordinate the start, do the checks needed to confirm that all the scraping uses are done and conduct them in an orderly and systematic manner. From an execution point of view this design pattern adds to performance, but it also improves overall aesthetics and maintainability of the scraping script.
Executing the Web Scraper Seamlessly
# Run the scraper
asyncio.run(main())The line asyncio.run(main()) handles initiating and managing the entire web scraping workflow in an effective and systematic process. Because the main() function is asynchronous, it cannot be run like a traditional function. asyncio.run() starts the asynchronous event loop and manages it to ensure all asynchronous tasks such as loading web pages, scrolling, and collecting product URLs all execute to completion without any interruptions. This command serves as the final action that puts the entire web scraping process into action; effectively making sure the script runs smoothly from start to finish, managing multiple actions in a concurrent manner.
STEP 2 :Scraping Comprehensive Product Details from Product Links
Importing Libraries
import sqlite3
import time
import logging
import pymongo
import subprocess
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_syncThis set of libraries gives decent functionality for scraping data from the web, and for managing that information as well. Pymongo offers a nice way to use MongoDB, a NoSQL database that is well suited for storing large volumes of semi-structured data. Subprocess allows system commands to be run from Python, which is useful when wanting to automate the use of opening a web browser or manipulating files. Sync_playwright from playwright.sync_api has added the capability of navigating and executing commands within complex websites using JavaScript, making automating web activities easier. Lastly, stealth_sync from playwright_stealth helps in avoiding bot detection by behaving like a human to avoid being blocked. These enable the user to scrape, store, and manipulate data with utmost ease whilst following the necessary precautions and being flexible to different website layouts.
Essential Settings for Scraping and Data Storage
# Configuration
"""
Application configuration settings:
- DB_PATH: Location of SQLite database file
- TABLE_NAME: Name of the URLs table
- LOG_FILE: Path for the log file
- MONGO_COLLECTION_NAME: This sets the table name to "products" where all scraped items will be saved
- MONGO_EXPORT_PATH: Backup location for MongoDB data
"""
DB_PATH = "/home/anusha/Desktop/DATAHUT/Tata_Cliq_Fashion/Data/Tata_cliq_original.db"
TABLE_NAME = "product_urls"
LOG_FILE = "/home/anusha/Desktop/DATAHUT/Tata_Cliq_Fashion/Log/Data_scraper_original.log"
MONGO_DB_NAME = "tata_cliq_db_original"
MONGO_COLLECTION_NAME = "products"
MONGO_EXPORT_PATH = "/home/anusha/Desktop/DATAHUT/Tata_Cliq_Fashion/Data/mongodb_data_original.bson"This part of the code specifies key configuration settings that describe where data will be stored, logged, and backed up when scraping the web. These settings make it easy and organized to collect and manage all data. Below is the description of each one:
DB_PATH: This is the file path to the SQLite database. The SQLite database is a storage space for product URLs that were extracted during the scraping process to reference later.
TABLE_NAME: This is the name of the table (in the SQLite database) the product URLs will be stored in; in this case, it is "product_urls."
LOG_FILE: The log file will store information about actions, errors, or events that occur during the scraping process that can be used for debugging and monitoring the scraping process.
MONGO_DB_NAME: Name of the MongoDB database where all extracted data will be saved. MongoDB is always a flexible NoSQL database that can efficiently store large and complex datasets.
MONGO_EXPORT_PATH: This is the file path where the MongoDB data will be backed up to. Backing up data allows your scraped data to become even safer, and allows restoring in case of failure.
All of these settings guarantee a structured data storage and logging flow, which ultimately produces a better experience for the scraper by making it more efficient, reliable, and easier to manage.
Setting Up Logging for Tracking and Debugging
# Logging Setup
"""
Configure logging system to:
- Write logs to specified file
- Record INFO level and above
- Include timestamp, log level and message
"""
logging.basicConfig(filename=LOG_FILE, level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s")This code provides a basic logging system that we can use to log and track important events that occurred during the web scraping process in order to provide smooth operation and facilitate easier debugging. All log messages are written to a given file (LOG_FILE) and we can therefore review them later instead of printing them (as we have done with the print statements). It uses a standard commit log level of INFO which means only meaningful updates, warnings and errors will be logged and unnecessary debug information can be filtered out. Each log item will shows the time stamp (to track exactly when the event occurred), the log level (to indicate whether it is an INFO or ERROR), and the message (to detail what event occurred). This makes it easier to maintain transparency, of course, but is also important to help with identifying and diagnosing problems quickly as information is core to achieving this, and lastly to also give assurance that the scraper runs smoothly over time.
Connecting to MongoDB for Storing Scraped Data
# MongoDB Setup
"""
Initializes MongoDB connection:
- Connects to local MongoDB instance
- Selects database and collection
- Will be used for storing product data
"""
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client[MONGO_DB_NAME]
collection = db[MONGO_COLLECTION_NAME]The purpose of this code is to link the script with a MongoDB database that will store the scraped product data. It begins by creating a client connection via pymongo.MongoClient("mongodb://localhost:27017/"), which tells the script to connect to the MongoDB server that should be running locally on the machine, at port 27017 (the default port for MongoDB). If the server is running, this connection will allow you to interact with the database. Next, db = client[MONGO_DB_NAME] allows you to select a specific database where the scraped product information will be stored with a specific structure, making it much easier to write and retrieve information. The last line, collection = db[MONGO_COLLECTION_NAME], selects a collection from the database. Although some may refer to it as a table, it has a similar concept in that multiple entries of product information will be stored. This is important in order to effectively manage and organize a large amount of scraped information. Whereas SQL databases employ a rigid schema, MongoDB uses a more flexible document-based structure which is suitable for a variety of product details with easier retrieval and analysis. Because the script will connect to the database and collection at the beginning of the scraping process, you will thus have a way to save all product information that is pulled in during the execution of the script.
Database Query Executor for SQLite Operations
def execute_query(query, params=None, fetch=False):
"""Helper function to execute SQLite queries.
Args:
query: SQL query string
params: Optional parameters for query
fetch: Whether to return results
Returns:
Query results if fetch=True, else None
Handles:
- Opening/closing connections
- Parameterized queries
- Commit operations
"""
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(query, params or ())
result = cursor.fetchall() if fetch else None
conn.commit()
conn.close()
return resultThe provided function, execute_query, is an auxiliary function that interacts with an SQLite database and is especially relevant for ensuring efficiency. It prevents the script from repeating similar connection and execution logic whenever an SQL statement needs to be executed. The function begins by getting a connection to the SQLite database located at the DB_PATH to make sure that any database interaction occurs with the appropriate database file. The function also creates a cursor, which is used to execute presumably safe SQL commands. The function will take in three parameters: query, which are the SQL commands to be executed; params, which are optional (allowed to pass the parameters to prevent SQL injection and be safer); and fetch, to request if it returns results (useful for SELECT statements). If fetch principle equals True, then it will return the queried results if run, or otherwise it would run SQL without return. Finally, the function is created to repeat the process by committing all changes once the query is run, it will then close the connection to release resources. Centralizing interactivity within this function for database management enables a script to continue being clean and readable while executing all required pre-tasks of opening and closing the database, executing safe parameterized queries, committing, or discarding changes—all safely and efficiently.
Database Setup for Tracking Scraping Progress
def setup_database():
"""
Initializes and prepares the database structure for URL tracking.
Purpose:
- Ensures the database table has a 'processed' column to track which URLs have been scraped
- Runs automatically at script startup to maintain data integrity
What it actually does:
1. Checks existing table structure for columns
2. If 'processed' column doesn't exist:
- Adds new column with default value 0 (unprocessed)
- Logs this modification
3. If column exists:
- Simply confirms its presence in logs
Database Schema Modification:
- Adds column: processed (INTEGER)
- Default value: 0 (False/Unprocessed)
- Values:
0 = URL not yet processed
1 = URL successfully scraped
Why this matters:
- Prevents duplicate scraping of the same URL
- Enables restart capability if script stops midway
- Maintains scraping progress tracking
"""
logging.info("Setting up the database...")
columns = execute_query("PRAGMA table_info(product_urls)", fetch=True)
column_names = [col[1] for col in columns]
if "processed" not in column_names:
execute_query("ALTER TABLE product_urls ADD COLUMN processed INTEGER DEFAULT 0;")
logging.info("Added 'processed' column to track processed URLs.")
else:
logging.info("'processed' column already exists.")The setup_database function is a critical component of the scraping pipeline, as it will ensure that we set up a structure that can easily accommodate tracking scraping progress. This function is run automatically when the script runs, so every URL that is stored in the database will provide a status indicator to demonstrate whether it has been scraped before or not. The function begins by logging that the database is currently in setup; this demonstrates what is currently occurring within the execution flow. The function will then call the existing structure of the product_urls table using an SQL command to check the PRAGMA table_info(product_urls); this will return the metadata for all columns in the table. The function first retrieves the rows from the column names and checks if the processed column is already included. The processed column is integral to knowing what URLs have been scraped successfully, so that we can understand when we are scraping duplicate URLs, which can reduce efficiency as well.
In the event that the column does not exist, the function will modify the schema of the database by issuing an ALTER TABLE SQL command to insert a new column called processed, with an integer type, and a default integer value of 0-- a value of zero means that the URL has not been processed, while a value of 1 will be assigned to the column when the URL has been processed without issues. This allows the function to restart from where it left off, rather than starting over, after an unplanned interruption. After it finishes adding the column the function will also log that it added the column to the logging database. In the event that the column already exists, the function will log the message that the column existed and no changes were made. This way of tracking the ETL process will improve the overall performance and reliability of the web scraping process with a systematic way to confirm progress, without necessitating extra scraping of data (through NULL values), and make resuming an objective upon interruptions.
Removing Duplicate URLs to Keep Data Clean and Unique
def remove_duplicates():
"""
Removes duplicate URLs from the database to ensure only unique records are retained.
Purpose:
- This function eliminates duplicate entries in the database while preserving the earliest recorded instance of each unique URL.
- Helps maintain data integrity and efficiency by reducing redundant entries.
How It Works:
1. Logs the start of the duplicate removal process.
2. Executes an SQL query that identifies and deletes duplicate URLs.
- The query groups records by the `url` column.
- It retains only the entry with the smallest `rowid` (earliest inserted record).
- All other occurrences of the same URL are deleted.
3. Logs a confirmation message once the process is complete.
SQL Query Explanation:
- The subquery `SELECT MIN(rowid) FROM {TABLE_NAME} GROUP BY url` finds the smallest `rowid` for each unique URL.
- The main `DELETE` query removes all rows where the `rowid` is NOT in the list of earliest `rowid` values.
- This effectively deletes all duplicate URLs while keeping only the first recorded instance.
Logging:
- The function logs both the start and completion of the duplicate removal process.
- This helps track database maintenance activities and ensures visibility into cleanup operations.
"""
logging.info("Removing duplicate URLs from the database...")
execute_query(f"""
DELETE FROM {TABLE_NAME}
WHERE rowid NOT IN (SELECT MIN(rowid) FROM {TABLE_NAME} GROUP BY url)
""")
logging.info("Duplicates removed successfully.")The remove_duplicates function helps to clean the database from duplicate product URLs and keeps only unique entries available. Duplicate URLs can happen as a result of multiple scraping sessions, a bad network connection, or simply because a URL was added again, resulting in processing the same data multiple times, and/or needlessly consuming space in memory that handling the data requires. The remove_duplicates function will start by logging an information log message that says it is going to start the process of duplicate removal. Following the information log, it runs an SQL query that will systematically identify and remove duplicate records, while leaving the first instance of each unique URL in the database.This process is accomplished using the rowid column which is unique to each row within a SQLite table. The SQL group by clause groups all identical URLs together and the select MIN(rowid) effectively removes each character added to the database but keeps the first character added to the database. In this manner, the duplicate removal function is able to keep the product data valid and remove the extra duplicates once the first instance has been maintained. After the query finishes running the duplicates are removed successfully, an information log is logged to notify duplicates are removed successfully. This process will help keep the database clean and efficient, and it reduces storage overhead while also preventing the scraper from repeatedly scraping the same URLs again and again. By running the function regularly, you can maintain the dataset as streamlined, improving the accuracy and reliability of the web scraping process overall.
Fetching unprocessed URLs for Scraping
def get_unprocessed_urls():
"""
Retrieve unprocessed product URLs from the SQLite database, ordered by ID
Returns:
List of (id, url) tuples ordered by ID
Logs:
- Count of found URLs
Filters:
- Only URLs with processed=0
- Ordered by ascending ID
"""
logging.info("Fetching unprocessed URLs ordered by ID...")
result = execute_query(f"SELECT id, url FROM {TABLE_NAME} WHERE processed = 0 ORDER BY id ASC", fetch=True)
logging.info(f"Found {len(result)} unprocessed URLs.")
return resultThe get_unprocessed_urls function retrieves product URLs from the database that have yet to be processed by the web scraper, allowing this programmatic script to avoid reprocessing a URL from a previous scraping session, and to resume scraping from this last session. The function logs that it is fetching unprocessed URLs so that you are auditing the process of scraping. The function afterwards runs an SQL select all query that selects all rows from the database products table that have a value of 0 in the processed column, meaning that the URL has not yet been scraped. The rows will be ordered in ascending order by id, to ensure the flow of processing information is organized and predictable. Finally, the function logs the count of unprocessed URLs found, to help you keep track of progress in the scraping process. The function then returns the taken URLs in a list of tuples, where each tuple contains an id, and url. This setup allows other sections of the script to loop through and process the URL in an efficient manner. By excluding URLs that have already been processed, this function minimizes the effort of scraping the eligible URLs, as there will not be any duplicate requests increasing overall performance.
Marking URLs as Processed to Prevent Duplicate Scraping
def mark_url_processed(url):
"""
Mark a URL as processed in the database
Updates a URL's status to 'processed' in the database to prevent re-scraping.
Functionality:
- Sets the 'processed' flag (1) for a specific URL in the database
- Provides feedback via logs about success/failure
- Ensures proper database connection handling
Parameters:
url (str): The exact product URL to mark as processed
Database Operation:
- Executes: UPDATE product_urls SET processed = 1 WHERE url = [provided_url]
- Uses parameterized queries to prevent SQL injection
- Commits changes immediately
Logging Behavior:
- Success: "Successfully marked as processed: [url]"
- Failure: "Failed to mark as processed: [url]" (if URL not found)
Error Handling:
- Implicit: Fails gracefully if URL doesn't exist (rowcount = 0)
- Explicit: Closes database connection even if errors occur
"""
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(f"UPDATE {TABLE_NAME} SET processed = 1 WHERE url = ?", (url,))
conn.commit()
if cursor.rowcount > 0:
logging.info(f"Successfully marked as processed: {url}")
else:
logging.warning(f"Failed to mark as processed: {url}")
"""
Verification and logging of database update status.
What this checks:
- cursor.rowcount: Number of rows affected by the UPDATE query
- > 0 means the URL was found and marked successfully
- == 0 means no matching URL was found
Logging Behavior:
- Success: Logs INFO level message with the processed URL
- Failure: Logs WARNING level message with the problematic URL
"""
conn.close()
The mark_url_processed function plays a crucial role in making sure that a product URL gets scraped only once. It updates the database to denote that a URL has been "processed," which avoids it getting scraped again, and more generally, to optimize the scraping process. When this function is called, it first connects to the SQLite database and creates a cursor object to use for connecting to the database. This function then executes an UPDATE SQL statement that sets the processed column to 1 for a specific URL. 1 denotes that the URL has been scraped and added to the database, and it will not scrape this URL again in future executions of the script. Also, to improve security, and to avoid SQL injection, the function uses parameterized queries, which ensure that the specified url is treated as data and is not part of the query structure. After the update is executed, the function commits the transaction, saving the changes made in the database before the database connection is closed.
The function then checks whether the update was successful by testing the cursor.rowcount which returns the number of rows affected by the operation. If rowcount > 0, it means that the URL was found in the database and was marked as processed. In this case, a log message is recorded at the INFO level to indicate that the URL has moved on to processed. If rowcount == 0 URL was not found in the database. This does not indicate an error, but that the URL could not be marked. This could be for various reasons but is most often because a misspelled URL was input or the entry is new. In this case, the log message indicates this situation at WARNING level that the URL could not be marked as processed.
Ultimately, the function guarantees that the database connection is always closed, no matter if the operation completed successfully or not. This is important for resource management to avoid memory leaks and database locks. Using this update method allows the scraping script to effectively track program progress, so it can always partially resume where it left off if something was interrupted.
Extracting Text from Web Pages using a CSS selector
def extract_text(page, selector):
"""
Extract text from the page using a CSS selector, handling missing elements gracefully
Args:
page: Playwright page object
selector: CSS selector string
Returns:
Extracted text or "N/A" if not found
"""
try:
return page.locator(selector).text_content().strip()
except:
return "N/A"
The extract_text function is used to get text content from a webpage via a CSS selector without causing the script to fail if the element does not exist. It has two parameters: page, which is a Playwright page object for the current webpage being processed, and selector, a string that defines the CSS selector for the target element. The function attempts to locate the element with Playwright's locate function and retrieve its text content. If successful, it strips leading/trailing spaces and returns the cleaned text. However, if the element cannot be found or if there is a failure in extraction, the function bypasses the error by returning "N/A" instead of killing the script. This gives stability to the web scraping process, preventing errors from stopping data harvesting when an assumed element is not found on the page.
Extracting general_features from Tata CLiQ Pages
def extract_general_features(page):
"""
Extracts product features from a Tata CLiQ product page using Playwright.
This function:
- Finds all feature containers on the page
- For each container, extracts feature names (headers) and their corresponding values
- Returns a dictionary of {feature_name: feature_value} pairs
Parameters:
page (playwright. page): The Playwright page object currently on a product page
Returns:
dict: A dictionary where:
- Keys are feature names (e.g., "Material", "Warranty")
- Values are the corresponding feature details
- Returns empty dict if no features found or error occurs
Error Handling:
- Catches and logs any exceptions during extraction
- Returns empty dict on error to allow graceful continuation
"""
try:
features = {}
elements = page.locator(".ProductFeatures__content").all()
for element in elements:
headers = element.locator(".ProductFeatures__header.ProductFeatures__description").all()
values = element.locator(".ProductFeatures__description").all()
if len(headers) > 0 and len(values) > 1:
key = headers[0].text_content().strip()
value = values[1].text_content().strip()
features[key] = value
return features
except Exception as e:
logging.error(f"Error extracting general features: {e}")
return {}The extract_general_features function scrapes important product information from the product page on Tata CLiQ using Playwright. When the function runs, it will examine the webpage for containers specifying product features and extract relevant information from those containers (i.e., material, warranty, specifications, etc.). Specifically, the function starts by declaring an empty dictionary, features, which will hold (or store) all details collected. From there, the function searches through the page for all instances of class .ProductFeatures__content, which holds product features. These instances will be stored as a list, and the function will loop through each one of those to extract feature names and values.
For each feature container, the function will look for its corresponding headers (examples "Material" and "Warranty") and their related descriptions. Headers will be found using the CSS selector .ProductFeatures__header.ProductFeatures__description and values will be accessed through .ProductFeatures__description. The function checks, before it continues to match values to headers, that there is at least one header present and more than one value. The text that has been extracted will be cleaned of unnecessary white space and stored in the dictionary as key-value pairs.
Should an error occur during the extraction process (e.g., the element is missing or the expected structure of the website changes), the function simply raises an exception and logs an appropriate error message. The scraping process does not terminate; the function returns an empty dictionary, allowing the script to continue execution without interruption. This is a robust error-handling process that guarantees the scraping script remains stable and reliable, which means it won't fail merely due to a minor inconsistency on the webpage.
Extracting Complete Product Details from Tata CLiQ Pages
def fetch_product_details(url, page):
"""
Extracts comprehensive product details from a Tata CLiQ product page.
This function:
- Navigates to the product page URL
- Waits for key elements to load
- Extracts multiple product attributes using CSS selectors
- Returns structured product data
- Handles errors gracefully with detailed logging
Parameters:
url (str): The complete product page URL to scrape
page (playwright.page): Playwright page object for browser automation
Returns:
dict: Structured product data with these fields:
- url (str): Product page URL
- product_name (str): Name of the product
- brand_name (str): Manufacturer/brand name
- brand_info (str): Additional brand information
- price (str): Current selling price
- mrp (str): Original maximum retail price
- discount (str): Discount percentage/amount
- rating_value (str): Numeric rating (e.g., "4.2")
- rating_count (str): Number of ratings
- review_count (str): Number of written reviews
- product_description (str): Full product description
- general_features (dict): Key-value pairs of product features
Returns None if scraping fails
Error Handling:
- Logs detailed error messages including the failed URL
- Returns None on failure to allow graceful error handling
- Uses generous timeouts (60-80 seconds) for slow-loading pages
Implementation Details:
- Uses a lambda helper 'extract()' for consistent element handling
- Returns "N/A" for missing fields rather than failing
- Combines both direct selector extracts and feature extraction
- Dependent on extract_general_features() for feature details
- Includes debug logging for tracking progress
Selector Notes:
- All selectors target specific Tata CLiQ DOM structures
- Uses :not(:empty) to avoid blank price elements
- Relies on itemprop attributes for rating metadata
"""
try:
logging.info(f"Scraping URL: {url}")
page.goto(url.strip(), timeout=80000)
page.wait_for_selector(".ProductDetailsMainCard__linkName > div:nth-child(1)", timeout=60000)
extract = lambda selector: page.locator(selector).text_content().strip() if page.locator(selector).count() > 0 else "N/A"
product = {
"url": url,
"product_name": extract(".ProductDetailsMainCard__linkName > div:nth-child(1)"),
"brand_name": extract("#pd-brand-name > span:nth-child(1)"),
"brand_info": extract("div.ProductDescriptionPage__detailsHolder:nth-child(1) > div:nth-child(1) > div:nth-child(4) > div:nth-child(2) > div:nth-child(1)"),
"price": extract(".ProductDetailsMainCard__price *:not(:empty)"),
"mrp": extract(".ProductDetailsMainCard__cancelPrice"),
"discount": extract(".ProductDetailsMainCard__discount"),
"rating_value": extract(".ProductDetailsMainCard__reviewElectronics[itemprop='ratingValue']"),
"rating_count": extract(".ProductDetailsMainCard__ratingLabel[itemprop='ratingCount']"),
"review_count": extract(".ProductDetailsMainCard__ratingLabel[itemprop='reviewCount']"),
"product_description": extract("div.ProductDescriptionPage__detailsHolder:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1)"),
"general_features": extract_general_features(page)
}
logging.info(f"Successfully scraped: {url}")
return product
except Exception as e:
logging.error(f"Error scraping {url}: {e}")
return NoneThe fetch_product_details function is intended to extract detailed product details from a Tata CLiQ product page using Playwright. The function is designed to handle the fetching of data, using the URL of the product page which gets processed into visiting that page. It makes sure that the key elements have been loaded before extracting the product title, brand, price details, discounts, ratings & reviews, and a product description in-depth. In addition to extracting those data points, the function extracts additional product specifications in the separate extract_general_features() function. One of the key features of this function is the extract helper lambda function. The purpose of the extract function is for the consistency and cleanliness of the data retrieval itself. The way it works is that it checks for whether an element is present for you to extract text from, and if it is not, the function will assign a default value of "N/A" rather than create a function error.
The procedure begins with a log of the URL of the product being processed, before then navigating to the product page in a longer time limit to accommodate slower page-load speeds. Once the main items are established, the product data is extracted according to the predetermined CSS selectors that apply to the Tata CLiQ Web page style. Once documented, the product dictionary is then filled with that extracted data to offer the structured data in a more human-readable format. Some notable attributes would include product_name, the capturing of its title, brand_name, for the name of the manufacturer, price and mrp prices reflecting the present and original prices respectively, and discount, which is the capture of any available discounts. Customer metrics are also available, such as rating_value, rating_count, and review_count, to provide a picture of customer interaction.
To enhance reliability, the function is equipped with a handful of error-handling techniques. If there is ever an issue detected while the program executes, something like a missing element or an unexpected change to the website, it will log the error and the URL of the failed process and return None in a controlled manner, rather than halt the execution all together. This gives the scraper the opportunity to continue on processing other URLs. The function also permits generous timeouts to load a page or an element so that it does not lose data for slow loading networks. In addition, all of the members of the logs provide a chance for members to glean scraping as it happens, and to try to make the 'dotting the I's and crossing the T's' on errors as easy as possible. Debug logs connect all previous discussions with the event that indicates the scraping is still progressing. By linking structured data extraction, robustness to failure, and log progress, the fetch_product_details function offers a way to reliably scrape and collect product data from Tata CLiQ to ingest for analysis and/or storage.
Saving Scraped Product Data to MongoDB Without Duplicates
def save_to_mongodb(data):
"""
Safely saves scraped product data to MongoDB while preventing duplicates.
This function:
- Checks if the product URL already exists in MongoDB
- Transforms and stores new records with proper ID mapping
- Provides detailed logging of all operations
- Gracefully handles errors during database operations
Parameters:
data (dict): A dictionary containing product details with these required keys:
- id (int): The SQLite primary key (will become _id in MongoDB)
- url (str): The product URL (used for duplicate checking)
- Other product attributes (name, price, etc.)
Behavior:
1. Duplicate Check:
- Uses the URL field to check for existing records
- Skips insertion if URL already exists (idempotent operation)
2. Data Transformation:
- Moves SQLite 'id' → MongoDB '_id' field
- Removes the original 'id' field to avoid data duplication
3. Database Operations:
- Performs atomic insert if record is new
- Commits changes immediately
4. Logging:
- Success: "Saved to MongoDB: [url]"
- Duplicate: "Skipped duplicate: [url]"
- Errors: "Error saving to MongoDB: [error_details]
Error Handling:
- Catches and logs all database exceptions
- Prevents crashes from duplicate key errors
- Maintains data consistency
Notes:
- Depends on global 'collection' MongoDB collection object
- Designed for use with Tata Cliq scraping pipeline
- Preserves original SQLite record relationships via _id
"""
try:
if collection.find_one({"url": data["url"]}) is None:
# Use the SQLite `id` as the `_id` in MongoDB
data["_id"] = data["id"]
# Remove the `id` field to avoid duplication
del data["id"]
# Insert the document into MongoDB
collection.insert_one(data)
logging.info(f"Saved to MongoDB: {data['url']}")
else:
logging.info(f"Skipped duplicate: {data['url']}")
except Exception as e:
logging.error(f"Error saving to MongoDB: {e}")The save_to_mongodb function stores web-scraped product information into a MongoDB database without inserting duplicate records. As the function calls, it will check first whether the product URL has already been inserted into the MongoDB collection or not. This function's duplicate-checking functionality prevents a product from being inserted twice, resulting in no redundancy of data being stored and making the database performance-optimized. The operation anticipates data in dictionary format, with the product information of SQLite id, url, etc., and additional information like name and price. It also transforms the SQLite id field into the MongoDB _id field to create a uniform and unique identifier. To avoid undesirable duplication, it drops the original id field from the dictionary before inserting it into MongoDB. If the product URL is not found in the database, the function adds the new record and outputs a success message. If the URL already exists, it outputs that the record was skipped and maintains processed entries. In addition, the function is error resilient—if there is any error during execution of the database operation, for example, errors related to the connection or unknown errors, it catches the exception and outputs an error message rather than crashing the script. This is to ensure smooth operation and data integrity. With the integration of deep logging, the function provides real-time feedback on database interactions, and debugging and monitoring of the web scraping pipeline becomes easier. The method is particularly tailored for the Tata Cliq web scraping process and is instrumental in storing structured product data in a way that does not influence the integrity of the dataset.
Backing Up MongoDB Database to a Compressed File
def export_mongodb():
"""
Exports the entire MongoDB database to a compressed archive file using mongodump.
This function:
- Creates a gzipped backup of the specified MongoDB database
- Saves the backup to the predefined export path
- Provides detailed logging of the export process
- Handles errors gracefully with specific error logging
Workflow:
1. Initiates mongodump command with these parameters:
- --db: Specifies the database name (from MONGO_DB_NAME)
- --archive: Outputs to a single compressed file (MONGO_EXPORT_PATH)
- --gzip: Enables compression to reduce file size
2. Logs start/stop messages for tracking:
- "Exporting MongoDB database..." (when starting)
- "MongoDB export successful!" (on completion)
- Specific error messages if failed
Configuration Requirements:
- MONGO_DB_NAME: Must be set to a valid database name
- MONGO_EXPORT_PATH: Must be a writable file path with .bson extension
- mongodump must be installed and in system PATH
Error Handling:
- Catches subprocess.CalledProcessError specifically
- Logs detailed error message including the actual command failure
- Does not crash the application on failure
Notes:
- Requires MongoDB tools installed (mongodump specifically)
- Runs as a blocking operation (will pause script during export)
- Output file uses BSON format (MongoDB's binary JSON)
- Compression reduces file size significantly (--gzip flag)
- Preserves all collections in the database
"""
try:
logging.info("Exporting MongoDB database...")
subprocess. run(
["mongodump", "--db", MONGO_DB_NAME, "--archive=" + MONGO_EXPORT_PATH, "--gzip"],
check=True
)
logging.info("MongoDB export successful!")
except subprocess.CalledProcessError as e:
logging.error(f"MongoDB export failed: {e}")The export_mongodb function has been implemented to export the whole MongoDB Database and store it in compressed format. Therefore, all scraped product information is secured and can be retrieved later when needed. The function calls the mongodump command, which is a native tool from MongoDB that will dump the data in the database to a defined bson, which is binary json. The output file is then compressed using the gzip feature enabled by the --gzip flag, which heavily compresses the output without sacrificing data integrity. When executed, the function writes a message first indicating that export has started. It then executes the mongodump command with arguments that pass the database name (MONGO_DB_NAME) and output path (MONGO_EXPORT_PATH). The script completes the exporting process successfully with the use of the check=True parameter that causes it to terminate with an error if any command fails. It returns a success success message that indicates export was successful on successful export.In case any process fails within it—i.e., installing mongodump, setting the database wrongly, or permissions—it traps the exception and publishes an extensive message for the failure. This does not cause the whole scraping process to crash but gives useful debugging information. mongodump must be installed and available on the system PATH for this function to work. Also, MongoDB database name and export file location must be set correctly. The backup file is saved in BSON format, so thereafter it can be restored by using MongoDB's mongorestore command. This position is critical for data safeguarding and preventing precious scraped information from loss due to system failures or crashes.
Scraping and Storing Product Data from URLs in MongoDB
def scrape_all():
"""
Scrape all product URLs stored in the database in order of ID and save to MongoDB progressively.
This main controller function:
1. Initializes and prepares the database
2. Cleans existing URL data
3. Processes all un-visited product pages systematically
4. Stores results in MongoDB
5. Exports the final dataset
Workflow Steps:
----------------------------
1. Database Setup:
- Ensures proper schema exists
- Removes duplicate URLs
- Retrieves unprocessed URLs ordered by ID
2. Browser Initialization:
- Launches headless WebKit browser
- Configures stealth mode to avoid detection
- Creates fresh browsing context
3. URL Processing:
- Processes URLs sequentially
- Validates URL format
- Extracts product details
- Handles failures gracefully
4. Data Management:
- Enriches data with SQLite ID
- Saves to MongoDB (with duplicate prevention)
- Updates URL status in SQLite
- Includes 2-second delay between requests
5. Completion:
- Exports MongoDB collection
- Cleans up resources
- Provides completion logging
Error Handling:
- Skips invalid URLs (logged as warnings)
- Continues on individual URL failures
- Preserves state between runs via 'processed' flags
- Comprehensive error logging at each stage
Configuration:
- Uses headless browser for efficiency
- 2-second delay between requests (adjustable)
- Timeout values inherited from helper functions
Notes:
- Maintains state via SQLite 'processed' flags
- Processes URLs in ID order for consistency
- Requires MongoDB and SQLite to be properly configured
- All actions are logged for progress tracking
"""
setup_database()
remove_duplicates()
url_entries = get_unprocessed_urls()
if not url_entries:
logging.info("No unprocessed URLs found. Exiting...")
return
with sync_playwright() as p:
browser = p.webkit.launch(headless=True)
context = browser. new_context()
page = context. new_page()
stealth_sync(page)
for product_id, url in url_entries:
if not url.startswith("http"):
logging.warning(f"Skipping invalid URL: {url}")
continue
try:
product = fetch_product_details(url, page)
if product:
# Add the SQLite `id` to the product data
product["id"] = product_id
# Save to MongoDB
save_to_mongodb(product)
# Mark the URL as processed in SQLite
mark_url_processed(url)
logging.info(f"Successfully processed and saved: {url}")
else:
logging.warning(f"Failed to scrape product details for: {url}")
except Exception as e:
logging.error(f"Error processing URL {url}: {e}")
time.sleep(2)
browser.close()
export_mongodb()
logging.info("Scraping process completed successfully.")The scrape_all function is the primary function that controls the complete data scraping process of Tata Cliq fashion products pages. It retrieves all the stored product URLs in the database one by one, fetches the necessary information, and saves it to mongodb while making sure that the program runs efficiently without causing any errors. This function has its own workflow which the user divides in several steps. Firstly, it goes to the database and initializes it by creating the appropriate schema and getting rid of repeat urls so that unique urls are the only tags that get processed. Then, it fetches all the unsorted product urls in the database which have not yet been worked on, ordered by the database id of the table, which helps in maintaining consistency. If there isn’t an unprocessed url, the function captures this and exits so as to not exhaust system resources.
Once the URLs are obtained, the next thing the function does is start up a headless WebKit browser through Playwright because this enables quick and discreet scraping. The way the browser works will decrease the likelihood of being spotted by the anti scraping techniques of Tata Cliq. After that it creates a new browsing context for every session which provides a blank and separate space from which data can be fetched. Every url will be handled one at a time so that no product page will be missed.Before scraping, the function verifies if the URL is valid, and if it is not properly formatted (e.g., missing "http"), it logs a warning and moves to the next URL without crashing the entire process.
For each valid product URL, the function calls fetch_product_details, which extracts all relevant product information. If successful, it enriches the extracted data by adding the SQLite database ID before storing the structured information into MongoDB. The save_to_mongodb function is responsible for handling the database insertion while preventing duplicate entries. Once data is successfully stored, the function marks the URL as processed in the SQLite database using mark_url_processed, ensuring that the same URL is not re-scraped in future runs. If any issues occur during the scraping process, such as a page failing to load or missing data, the function catches the exception, logs an error message, and moves on to the next URL without interrupting the entire workflow.
To avoid overloading the website with requests and to mimic human-like behavior, the function includes a two-second delay between each request. This delay can be adjusted as needed. After processing all URLs, the function closes the browser to free up system resources and then calls export_mongodb to create a compressed backup of the MongoDB database.
This ensures that all scraped data is safely stored and can be restored later if needed. Finally, it logs a completion message indicating that the entire scraping process was executed successfully. Throughout the process, extensive logging is used at every step to track progress, identify issues, and ensure transparency.
This function is designed to be highly robust, preserving its state across multiple runs by using the processed flag in SQLite. This means that if the script is interrupted for any reason, it can resume from where it left off without having to start over. The scrape_all function is a crucial component of the Tata Cliq data scraping pipeline, integrating various helper functions to efficiently extract, store, and manage product data in an automated and structured manner.
Entry Point to Execute the Script
if name == "__main__":
"""
Entry point when run as script:
- Executes main scraping function
- Handles any top-level errors
"""
scrape_all()The if name == "__main__": clause is the principal entry point of the script and guarantees that the scraping operation occurs only when the script is run directly. It is a popular Python convention preventing accidental execution in case the script is imported as a module for another program. In this block, the scrape_all() function is invoked, which is the main controller for handling the overall web scraping process for Tata Cliq fashion items. This function coordinates all major operations such as database initialization, fetching of product information, and storing within MongoDB, making sure that the pipeline of scraping works effectively from beginning to end.
Further, this entry point is designed to gracefully handle root-level errors in order to keep the script from crashing when facing unexpected failure. In the event of any failure in the process, it would be logged and hence will make debugging easy, and also in subsequent runs it will continuously run.
The design structure makes the script steadier and more modular to make it composite or reusable into more complex data processing streams. In effect, this block ensures that whenever the script is run as a stand-alone application, the scraping operation automatically occurs in a contained manner.
Conclusion
This data pipeline of the Tata Cliq Fashion website design is a manually automated property harvesting process that is well organized. The system employs Playwright for automated web browsing, SQLite for tracking progress, and MongoDB for data storage, ensuring that these goals are met on reliable, scalable, and complete systems.Having error handling, duplicate prevention and logging at every point makes this scraper efficient, robust and accurate all at the same time, which reduces data loss. This approach enables the system to be easily maintained and improved, thus providing a facility for changing data scraping requirements. From market analysis, product analysis to competition analysis, this scraping solution aids in accurate and dependable extraction of quality e-commerce data.
Libraries and Versions
Name: pymongo
Version: 4.10.1
Name: playwright
Version: 1.48.0
Name: playwright-stealth
Version: 1.0.6
FAQ's
1. Is it legal to scrape Tata CLiQ for product data?
Scraping publicly available data is often legal for personal or research use, but it's essential to review Tata CLiQ’s Terms of Service. Always respect their robots.txt file and avoid violating copyright or usage policies.
2. What kind of personal care data can I extract from Tata CLiQ?
You can scrape data such as product names, prices, discounts, ratings, reviews, ingredients, brand details, availability, and product descriptions.
3. Which tools are best for scraping Tata CLiQ?
Popular tools include Python libraries like BeautifulSoup, Scrapy, and Selenium. For large-scale scraping, tools like Playwright or Puppeteer can help with dynamic content rendering.
4. How can I ensure the scraped data is accurate and up to date?
Implement regular scraping intervals, data validation checks, and deduplication logic to ensure fresh and reliable insights from Tata CLiQ.
5. What are some use cases for scraping Tata CLiQ’s personal care section?
Use cases include competitor analysis, pricing strategy development, market trend tracking, brand benchmarking, and building recommendation engines for e-commerce.



