How to Scrape Eyewear Pricing Data from Noon Using Python
- Ambily Biju
- May 8
- 28 min read
Updated: 2 days ago
Introduction
Envision being able to obtain the information related to the latest eyewear prices, discounts, and sellers on Noon through web scrapping. This can benefit data analysts trying to track pricing shifts, or a company trying to get ahead of other competitors by exploiting scraping Noon’s eyewear listings. It is almost like having real time access to structured information.
This blog will teach you how to scrape eyeglasses products lists on Noon, the leading online marketplaces in the Middle Eastern region. This lesson will encompass all the necessary steps like retrieving product links, and the full details such as the current price, ratings, specifications, and seller information. By the end of the tutorial, you will understand how to effectively retrieve and properly store data using Playwright and BeautifulSoup for further analysis.
What is Web Scraping?
Web scraping is the extraction of information from websites using automated tools or techniques. It is frequently utilized in e-commerce, finance, research, and real estate to manage a company's prices, competitors, and other relevant information. Advanced facilities like Playwright and BeautifulSoup allow web scraping to extract structured data from complex websites, making it possible to store the information in a database for further analysis.
For our Noon Eyewear project, we executed a web scraping technique in order to retrieve relevant products from Noon.com. The project involved:
Product Link Scraping – Using Playwright to dynamically load category pages, scroll, and extract product URLs efficiently.
Data Extraction – Using BeautifulSoup to parse product pages and retrieve key details such as model number, ratings, seller information, and pricing.
Database Management – Storing extracted links and product data in SQLite, ensuring data integrity and easy access.
Error Handling & Automation – Implementing retry mechanisms, managing failed URLs, and automating the entire process for continuous data collection.
By automating this workflow, we successfully built a scalable and structured approach to gathering product insights, which can be useful for e-commerce analytics, pricing strategies, and business intelligence.
Libraries and Tools Used in Noon Web Scraping
In the Noon web scraping project, different Python tools and libraries were used to extract structured information from web pages that are loaded dynamically. Each of them helped in getting the data, processing it, and even storing it.
For automation of browsers, there was usage of Playwright and Selenium. This allows the script to work with webpages that utilize JavaScript. Unlike conventional static web scrapers, Playwright manages to scroll automatically, wait for elements to be loaded, and manage dynamic content. During this project Playwright helped retrieve product links and details by simulating clicking and scrolling, which guaranteed that all required information was available before extraction.
BeautifulSoup, which is part of the bs4 library, is used to parse the HTML content. It was able to change the captured raw webpage data into an easily parseable format so that the script is able to retrieve product links, ratings, specifications, and other elements through tag-based selectors. Extracting relevant detail from the Noon website was facilitated by BeautifulSoup within the html structure.
To manage and store the scraped data in an SQLite database, the sqlite3 module was used. Tracking product links and maintaining a processed URLs flag was done effectively so that duplicate entries are prevented. In addition to that, separate tables were created to store scrapy items.
STEP 1 : Product Link Scraping
Importing Necessary Libraries
import asyncio
import sqlite3
import random
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
Before we begin scraping data from the Noon website, we need to import several important libraries that will help us with different tasks in our script.
Database Setup
# Database setup
db_filename = "noon_eyeglasses.db"
To store the collected product links in an organized way, we need a database. In this script, we define a database file named "noon_eyeglasses.db" using the variable db_filename. This database will act as a storage system where we can save product links so that we don’t lose any data during the scraping process. Instead of keeping the links in a temporary file or a list in memory, using a database ensures that our data is safe even if the script stops running unexpectedly.
A database is especially useful for avoiding duplicate entries. If we run the script multiple times, we can check whether a link already exists before adding it again. This prevents unnecessary repetitions and keeps our data clean. By storing the links in a structured way, we can easily access and use them later for further processing, such as extracting detailed product information from each link.
Initializing the Database
def init_db():
"""
Initialize the SQLite database and create the
'product_links' table if it does not exist.
The table contains:
- 'id' (INTEGER, PRIMARY KEY, AUTOINCREMENT)
as a unique identifier.
- 'product_link' (TEXT, UNIQUE) to store
product URLs without duplicates.
This function:
- Connects to the SQLite database.
- Executes the table creation query.
- Commits the changes.
- Closes the database connection.
Returns:
- None
"""
conn = sqlite3.connect(db_filename)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS product_links (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_link TEXT UNIQUE
)
""")
conn.commit()
conn.close()
Before we start collecting product links, we need a structured way to store them. The init_db() function is responsible for setting up the database. It ensures that our database file, "noon_eyeglasses.db", has a table named "product_links", where we can store the URLs of products we scrape.
In this function, we first establish a connection to the SQLite database using sqlite3.connect(db_filename). If the database file does not exist, SQLite automatically creates it. Then, we create a cursor object, which allows us to execute SQL commands. We use the CREATE TABLE IF NOT EXISTS statement to create the product_links table. This table contains two columns:
id – A unique identifier for each entry, which is automatically incremented when a new link is added.
product_link – A text field where each product URL is stored. The UNIQUE constraint ensures that duplicate links are not added.
After executing the table creation command, we commit the changes to save them in the database and close the connection. This function ensures that the database structure is always in place before we start adding product links, preventing errors and maintaining data consistency.
Retrieving Category URLs
def get_category_urls():
"""
Read category URLs from a text file and return them
as a list after stripping whitespace.
Returns:
- list[str]: A list of cleaned category URLs.
The function:
- Opens 'data/category_urls.txt' in read mode.
- Strips leading/trailing spaces from each line.
- Filters out empty lines.
- Returns the cleaned URLs as a list.
"""
with open("data/category_urls.txt", "r") as f:
return [
line.strip() for line in f
if line.strip()
]
The get_category_urls() function is responsible for reading category page URLs from a text file and returning them as a list. These category URLs act as starting points for our scraping process, as each category page contains multiple product listings that we need to extract.
Inside the function, we open the file "data/category_urls.txt" in read mode. This file contains a list of category page links, each written on a new line. We then read each line, remove any extra spaces at the beginning or end using the strip() function, and ensure that empty lines are ignored. The cleaned URLs are collected into a list and returned.
This approach keeps the category URLs separate from our code, making it easier to update them without modifying the script. If we need to scrape different categories, we can simply edit the "category_urls.txt" file instead of making changes in the Python code.
Saving Product Links to the Database
def save_links_to_db(cursor, links):
"""
Insert new product links into the 'product_links' table.
Args:
- cursor (sqlite3.Cursor): Database cursor to execute
SQL commands.
- links (list[tuple[str]]): A list of tuples, where
each tuple contains a single product link.
Returns:
- None
The function:
- Uses executemany() to insert multiple links efficiently.
- Ignores duplicate entries using a UNIQUE constraint.
- Prints a message if a duplicate entry is encountered.
"""
try:
cursor.executemany(
"INSERT INTO product_links (product_link) VALUES (?)",
links)
except sqlite3.IntegrityError:
print("Skipping duplicate entry...")
The save_links_to_db() function is responsible for storing the collected product links in the database. Since we are dealing with multiple links at once, this function efficiently inserts them into the "product_links" table using a batch operation.
The function takes two inputs: cursor, which is a database cursor used to execute SQL commands, and links, which is a list of tuples, where each tuple contains a single product link. Using executemany(), the function attempts to insert all the links into the database at once. This method is faster and more efficient than inserting links one by one.
Since the "product_links" table has a UNIQUE constraint on the product_link column, the database will not allow duplicate entries. If an attempt is made to insert a link that already exists, SQLite raises an IntegrityError. The function catches this error and prints "Skipping duplicate entry..." instead of stopping the script. This ensures that the scraper continues running smoothly without interruptions, even if some links are already present in the database.
Extracting Product Links from the Webpage
def extract_product_links(soup):
"""
Extract product links from the page's HTML content
using BeautifulSoup.
Args:
- soup (BeautifulSoup): Parsed HTML content of the
webpage.
Returns:
- list[str]: A list of product URLs extracted from
the page.
The function:
- Finds all product containers using their CSS class.
- Extracts the 'href' attribute from anchor tags.
- Appends the full product URL to a list.
- Returns a list of extracted product links.
"""
product_divs = soup.find_all(
"div",
class_="sc-57fe1f38-0 eSrvHE"
)
links = []
for div in product_divs:
anchor = div.find("a", href=True)
if anchor:
link = "https://www.noon.com" + anchor["href"]
links.append(link)
return links
The extract_product_links() function is responsible for finding and extracting product links from a webpage's HTML content. Since the product links are embedded within the webpage structure, we use BeautifulSoup to process the HTML and locate the relevant information.
First, the function looks for all <div> elements that match a specific CSS class (sc-57fe1f38-0 eSrvHE). These <div> elements contain product details, including links to individual product pages. Inside each <div>, the function searches for an <a> (anchor) tag with an href attribute, which holds the actual link to the product.
Once the link is found, it is combined with "https://www.noon.com" to create a complete product URL. This is necessary because the href attribute usually contains only a partial link, and we need to add the main website domain to make it a valid URL. The extracted links are stored in a list and returned.
This function ensures that we collect all product links displayed on a given category page so that they can be further processed to extract detailed product information.
Scraping Product Links from a Category
async def scrape_category(page, cursor, base_url):
"""
Scrape product links from a given category,
iterating through multiple pages.
Args:
- page (playwright.async_api.Page): The Playwright
page instance used for browsing.
- cursor (sqlite3.Cursor): Database cursor to execute
SQL queries and store product links.
- base_url (str): The base URL of the category page
to be scraped.
Returns:
- None: The function saves the extracted links
directly into the database.
The function:
- Visits each page of the category sequentially.
- Waits for the page to fully load before scraping.
- Scrolls down multiple times to ensure all products load.
- Extracts product links using BeautifulSoup.
- Checks for duplicates before saving to the database.
- Stops scraping if no new links are found.
"""
page_number = 1
while True:
url = f"{base_url}&page={page_number}"
print(f"\nVisiting Page {page_number}: {url}")
await page.goto(
url,
wait_until="load",
timeout=200000
)
await page.wait_for_load_state(
"domcontentloaded"
)
# Scroll down to load all products
for _ in range(10):
await page.evaluate(
"window.scrollTo(0, document.body.scrollHeight)"
)
await asyncio.sleep(2)
content = await page.content()
soup = BeautifulSoup(
content,
"html.parser"
)
new_links = [
(link,) for link in extract_product_links(soup)
if not cursor.execute(
"SELECT COUNT(*) FROM product_links WHERE product_link=?",
(link,)
).fetchone()[0]
]
if new_links:
save_links_to_db(cursor, new_links)
cursor.connection.commit()
print(
f"Scraped {len(new_links)} new URLs "
f"from Page {page_number}"
)
else:
print(
f"No new URLs found on Page {page_number}, "
"moving to next category..."
)
break # Stop if no new URLs are found
page_number += 1
await asyncio.sleep(
random.uniform(5, 7) # Randomized delay (5 to 7 seconds)
)
The purpose of the scrape_category() function is to get product links from a certain category on the Noon website while going through several pages. Since products are listed across a variety of pages, this function makes sure that we collect links from every page possible.
The function starts with a specific page_number set to 1, meaning the first page of the category. Thereafter, a loop is created to go through each page in order. The page URL is set by adding the page parameter for the base category url. The script goes to the page with Playwright and waits for it to load.
Noon, just like many other modern websites, has an infinite scrolling feature where products will load upon scrolling down. To mimic this, the function repeatedly calls window.scrollTo(0, document.body.scrollHeight). This technique ensures that all products on the page are loaded before starting the scraping. The scrolls also have a low time duration between them to simulate real browsing and prevent detection.
When the page is completely loaded, product links are extracted through linking BeautifulSoup with HTML parsing. The links are only saved to the database after the system checks for duplicates. Any new links detected will be saved and the script will continue with the next page. In case of no new links being found,
Fetching Product Links from All Categories
async def fetch_product_links():
"""
Main function to scrape product links from all categories.
This function:
- Reads category URLs from a file.
- Launches a Chromium browser instance using Playwright.
- Sets custom HTTP headers for better request handling.
- Iterates through each category URL and scrapes product links.
- Stores extracted product links in an SQLite database.
- Ensures proper cleanup by closing database connections
and the browser after execution.
Returns:
- None (Results are stored in the database).
"""
category_urls = get_category_urls()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.set_extra_http_headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
})
conn = sqlite3.connect(db_filename)
cursor = conn.cursor()
for base_url in category_urls:
print(f"\nScraping category: {base_url}")
await scrape_category(page, cursor, base_url)
conn.close()
await browser.close()
print("\nFinal scraping completed! All links saved to the database.")
The fetch_product_links() function serves as the overall entry point to scrape product links within various categories. The function controls the entire process of scraping by coordinating fetching category URLs, opening the web browser, and saving the scraped links in the database.
First, the function retrieves the list of category URLs by calling the get_category_urls() function, which reads the URLs from a file. Second, in Playwright, a session of the browser Chromium is opened, and the script can interact with the web pages as if a human were doing so. Custom HTTP headers are set to simulate a real user, e.g., setting the User-Agent, Accept-Language, and Accept-Encoding headers. These headers help prevent the scraper from being caught by the website for automated traffic.
For each category URL, it calls the scrape_category() function, which performs the scraping for each and every category. While collecting the links, they are stored in an SQLite database so that the data is properly structured and easily accessed for future use.
After parsing all the categories, the process properly closes the browser and database connection for resource release. The script concludes by outputting a success message, confirming all product links successfully scraped and saved in the database.
Running the Scraping Process
# Initialize database
init_db()
# Run the async function
asyncio.run(fetch_product_links())
The last section of the script creates the database and subsequently calls the main scraping function in an asynchronous manner.
First, the init_db() function is executed to create the SQLite database and define the required table (product_links) to store the scraped product links. This way, the database is prepared before scraping begins.
Then, asyncio.run(fetch_product_links()) command is invoked to execute the fetch_product_links() function. Because fetch_product_links() is an async function, we utilize asyncio to handle the running of asynchronous operations such as browsing web pages and waiting for the content to load. Invoking asyncio.run() initiates the whole process of scraping, which imports the links of products from all categories, stores them in the database, and closes resources safely once the operation is done.
This organization guarantees that the script runs smoothly, manages asynchronous operations correctly, and cleans up afterwards, offering a seamless and dependable scraping experience.
STEP 2 : Product Data Scraping From Product Links
Importing Libraries
import sqlite3
import random
import time
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
imported necessary libraries
Database Setup and Operations
# Database setup and operations
def connect_db():
"""
Establish a connection to the SQLite database.
This function:
- Connects to the 'noon_eyeglasses.db' database.
- Prints a message indicating the connection status.
- Returns a SQLite connection object.
Returns:
- sqlite3.Connection: A connection object to interact with the database.
"""
print("Connecting to the database...")
return sqlite3.connect("noon_eyeglasses.db")
The first step in this code is to establish a connection with the SQLite database where all the scraped product data will be stored. The function connect_db() connects to a database named "noon_eyeglasses.db". When the function is called, it prints a message to indicate that the connection process is starting. Once the connection is successfully made, it returns a connection object, which is used for further interactions with the database. This connection object allows the program to execute commands such as inserting new data or querying the existing data, helping us store and manage the product information efficiently.
Modifying the Product Links Table
def alter_product_links_table():
"""
Modify 'product_links' table to add the 'status' column
if it does not exist.
This function:
- Connects to the SQLite database.
- Retrieves the table schema using PRAGMA.
- Checks if the 'status' column is already present.
- If missing, adds 'status' (INTEGER, DEFAULT 0) to track
processing status.
- Commits the changes and closes the connection.
The 'status' column helps in managing scraping workflows
by indicating whether a product link has been processed.
"""
print(
"Checking if 'status' column exists in product_links table..."
)
conn = connect_db()
cursor = conn.cursor()
# Check if the 'status' column exists
cursor.execute("PRAGMA table_info(product_links)")
columns = cursor.fetchall()
# Check if the 'status' column is present in the table
if any(col[1] == 'status' for col in columns):
print(
"Column 'status' already exists. No changes made."
)
else:
# If the column doesn't exist, add it
print(
"Altering product_links table to add 'status' column..."
)
cursor.execute(
"ALTER TABLE product_links ADD COLUMN status INTEGER DEFAULT 0"
)
conn.commit()
print("Column 'status' added successfully.")
conn.close()
In this section, the function alter_product_links_table() checks whether the 'status' column exists in the product_links table. This 'status' column is crucial for tracking the progress of product links during the scraping process. The function starts by connecting to the SQLite database and retrieving the table schema, which contains details about the columns of the table. It then checks if the 'status' column is already present. If the column is found, the function simply prints a message stating that no changes are needed. However, if the column is missing, it proceeds to add the 'status' column with a default value of 0. This column will later be used to mark product links as processed (status 1) or unprocessed (status 0). Once the modification is done, the function commits the changes to the database and closes the connection, ensuring that the changes are saved and the system is ready for the next steps.
Creating the Product Data Table
def create_noon_product_data_table():
"""
Create the 'noon_product_data' table if it does not exist.
This table stores extracted product details, including:
- 'url' (TEXT): Product page URL.
- 'page_title' (TEXT): Title of the product page.
- 'model_number' (TEXT): Product's model number.
- 'rating' (TEXT): Average customer rating.
- 'review_count' (TEXT): Number of customer reviews.
- 'sold_by' (TEXT): Name of the seller.
- 'seller_rating' (TEXT): Seller's overall rating.
- 'positive_review_percentage' (TEXT): Percentage of
positive reviews.
- 'specifications' (TEXT): Product specifications in text.
- 'sale_price' (TEXT): Discounted price of the product.
- 'discount_price' (TEXT): Original price before discount.
- 'savings' (TEXT): Amount saved due to discount.
The function:
- Connects to the database.
- Creates the table with necessary fields.
- Commits changes and closes the connection.
"""
print("Creating 'noon_product_data' table...")
conn = connect_db()
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS noon_product_data (
url TEXT,
page_title TEXT,
model_number TEXT,
rating TEXT,
review_count TEXT,
sold_by TEXT,
seller_rating TEXT,
positive_review_percentage TEXT,
specifications TEXT,
sale_price TEXT,
discount_price TEXT,
savings TEXT
)""")
conn.commit()
conn.close()
This section of the code defines the create_noon_product_data_table() function, which ensures that a table named noon_product_data exists in the database to store all the product details that will be scraped. The table is designed to hold essential information about each product, such as the product’s URL, title, model number, customer ratings, review count, seller details, specifications, prices, and the savings from discounts. If the table doesn’t already exist, the function creates it with the necessary fields to store this data in a structured way. After setting up the table, the function commits the changes to the database and then closes the connection, ensuring that everything is saved correctly for future use. This table will be crucial in organizing the scraped data for easy retrieval and analysis later on.
Creating the Failed URLs Table
def create_failed_urls_table():
"""
Create the 'failed_urls' table if it does not exist.
This table stores URLs that failed during scraping, along
with the failure reason for debugging.
Table Structure:
- 'failed_url' (TEXT): The URL that could not be scraped.
- 'reason' (TEXT): Description of the failure cause.
The function:
- Connects to the database.
- Creates the table if it is not already present.
- Commits the changes and closes the connection.
"""
print("Creating 'failed_urls' table...")
conn = connect_db()
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS failed_urls (
failed_url TEXT,
reason TEXT
)""")
conn.commit()
conn.close()
The create_failed_urls_table() function is responsible for setting up a table called failed_urls in the database to store information about URLs that could not be scraped. Sometimes, due to network issues, website changes, or other technical difficulties, certain URLs may fail during the scraping process. This table helps keep track of those failed attempts by storing the URL along with a description of the reason for the failure. The table includes two main columns: one for the URL (failed_url) and one for the failure reason (reason). If this table does not already exist, the function will create it, ensuring that any failed URLs are logged for future analysis or retrying. After creating the table, the function commits the changes to the database and closes the connection, ensuring that the system is updated and ready for further operations.
Retrieving URLs to Scrape
def get_urls_to_scrape():
"""
Retrieve product URLs that need to be scraped.
This function:
- Connects to the database.
- Queries the 'product_links' table for URLs where
the 'status' column is set to 0 (pending scraping).
- Fetches and returns a list of URLs.
Returns:
list: A list of product URLs that need scraping.
"""
print("Fetching URLs to scrape with status 0...")
conn = connect_db()
cursor = conn.cursor()
cursor.execute(
"SELECT product_link FROM product_links WHERE status = 0"
)
urls = cursor.fetchall()
conn.close()
print(f"Found {len(urls)} URLs to scrape.")
return [url[0] for url in urls]
The get_urls_to_scrape() function is designed to fetch the product URLs that need to be scraped from the database. It connects to the database and queries the product_links table for URLs where the 'status' column is set to 0, which indicates that these URLs are pending and have not been processed yet. The function retrieves these URLs and returns them in a list format. By doing so, it ensures that only the unprocessed URLs are selected for scraping. After fetching the URLs, the function closes the database connection and prints the number of URLs found. This helps in tracking how many URLs are waiting to be scraped in the current session.
Updating URL Status
def update_url_status(url, status):
"""
Update the scraping status of a product URL.
This function:
- Connects to the database.
- Updates the 'status' column in 'product_links'
for the given product URL.
- Commits the change and closes the connection.
Args:
url (str): The product URL to update.
status (int): The new status value.
Returns:
None
"""
print(f"Updating status of URL {url} to {status}...")
conn = connect_db()
cursor = conn.cursor()
cursor.execute(
"UPDATE product_links SET status = ? WHERE product_link = ?",
(status, url)
)
conn.commit()
conn.close()
The update_url_status() function is responsible for updating the scraping status of a specific product URL in the database. After scraping a product page, the status of that URL needs to be updated to reflect whether the scraping was successful or not. This function connects to the database, locates the product_links table, and updates the status column for the given URL with the new status value. The status is typically used to track the progress of scraping, where a status of 0 might indicate pending, and a status of 1 might indicate completed. Once the update is made, the function commits the changes to the database and then closes the connection. This ensures that the database always reflects the most current status of each URL.
Saving Failed URLs
def save_failed_url(url, reason):
"""
Save a failed URL along with the failure reason.
This function:
- Connects to the database.
- Inserts the failed URL and its reason into the
'failed_urls' table.
- Commits the change and closes the connection.
Args:
url (str): The product URL that failed to scrape.
reason (str): The reason for the failure.
Returns:
None
"""
print(f"Saving failed URL {url} with reason: {reason}...")
conn = connect_db()
cursor = conn.cursor()
cursor.execute(
"INSERT INTO failed_urls (failed_url, reason) VALUES (?, ?)",
(url, reason)
)
conn.commit()
conn.close()
The save_failed_url() function is used to store the URLs of products that could not be scraped, along with the reason for the failure. During the scraping process, sometimes a URL might fail due to various reasons such as network issues, changes in the website structure, or invalid links. This function ensures that these failed URLs are not ignored but instead saved for later review. It connects to the database and inserts the failed URL along with the reason into the failed_urls table. This helps in tracking and troubleshooting issues during the scraping process. Once the data is inserted, the function commits the changes to the database and closes the connection. This function helps maintain an accurate record of which URLs need to be revisited or debugged.
Browser Setup and Scraping Functions
# Browser setup and scraping functions
def setup_browser():
"""
Launch the browser using Playwright and create a new page.
This function:
- Starts a Playwright session.
- Launches a Chromium browser instance.
- Creates a new browser page.
The browser runs in non-headless mode by default.
Set `headless=True` to run in the background.
Returns:
tuple: (playwright, browser, page)
- playwright: Playwright instance.
- browser: Launched browser instance.
- page: New browser page.
"""
print("Setting up the browser...")
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False) # Set to True for headless mode
page = browser.new_page()
return playwright, browser, page
The setup_browser() function is responsible for setting up the environment necessary for the web scraping process by launching a web browser. It uses the Playwright library to manage the browser and navigate through web pages. This function initiates a Playwright session, launches a Chromium browser, and creates a new page for interacting with websites. By default, the browser runs in a non-headless mode, meaning you can see the browser window as it performs actions. If you want the browser to run in the background without a visible window, you can change the headless parameter to True. The function then returns the Playwright instance, the launched browser, and the newly created page, which are necessary for loading and interacting with the website during scraping.
Setting Custom HTTP Headers for Web Scraping
def set_headers(page, url):
"""
Set HTTP headers for the request using a random user agent.
This function:
- Reads user agents from 'data/user_agents.txt'.
- Selects a random user agent for each request.
- Sets headers like 'user-agent', 'referer', and 'accept'.
Args:
page (Page): Playwright page instance.
url (str): The target URL for setting the referer.
Returns:
None
"""
print("Setting headers for the page request...")
with open("data/user_agents.txt", "r") as f:
user_agents = f.readlines()
user_agent = random.choice(user_agents).strip()
print(f"Using user-agent: {user_agent}")
headers = {
"user-agent": user_agent,
"referer": url,
"accept": "application/json, text/plain, */*"
}
page.set_extra_http_headers(headers)
The set_headers() function is designed to configure custom HTTP headers for web scraping requests made via Playwright. It improves the scraping process by simulating real browser requests, making it harder for websites to detect bots. The function reads a list of user agents from a file (data/user_agents.txt) and randomly selects one for each request, mimicking different users. It also sets the referer header to the provided URL and the accept header to specify the types of content the browser can accept. This makes the request appear more legitimate, helping to bypass basic anti-scraping measures. By using Playwright’s page instance, the function ensures that all requests made from the page carry these headers, reducing the chances of getting blocked.
Visiting a Web Page with Random Delay
def visit_page(page, url):
"""
Visit the given URL with a random delay to prevent
detection and throttling.
This function:
- Introduces a delay between 3 to 5 seconds
before making the request.
- Navigates to the specified URL using Playwright.
- Uses an increased timeout to handle slow-loading pages.
Args:
page (Page): Playwright page instance.
url (str): The target URL to visit.
Returns:
None
"""
print(f"Visiting URL: {url}")
time.sleep(random.uniform(3,5)) # Random delay between 5 and 8 seconds
page.goto(url, timeout=200000) # Increase timeout if needed
print(f"Page {url} loaded successfully.")
The visit_page() function is crafted to navigate to a specified URL using Playwright, while introducing a random delay between 3 to 5 seconds before making the request. This randomness helps to simulate human-like behavior, making it more difficult for the website to detect and block the scraping activity. Additionally, the function sets an increased timeout (200,000 ms) to handle slow-loading pages, ensuring that even if the page takes longer to load, the function will still wait and complete the task. This approach minimizes the chances of encountering throttling or detection, promoting smoother scraping.
Extracting Page Title
def extract_title(page):
"""
Extract and return the page title.
This function:
- Retrieves the title of the current page
using Playwright's `title()` method.
- Prints the extracted title for debugging.
Args:
page (Page): Playwright page instance.
Returns:
str: The extracted page title.
"""
title = page.title()
print(f"Extracted page title: {title}")
return title
The extract_title() function is designed to retrieve the title of the current web page using Playwright's title() method. Once the title is extracted, it is printed for debugging purposes, helping to confirm that the correct page has been loaded. This function returns the page title as a string, which can be useful for various purposes, such as verifying that the right page has been scraped or for storing the title along with other product details in the database.
Extracting Model Number
def get_model_number(soup):
"""
Extract the model number from the HTML content.
This function:
- Searches for a `<div>` element with class
'modelNumber' using BeautifulSoup.
- Extracts and cleans the text content.
- Splits the text by " : " to isolate the
model number.
- Returns the model number or a default
message if not found.
Args:
soup (BeautifulSoup): Parsed HTML content.
Returns:
str: Extracted model number or a
'not found' message.
"""
model_number = soup.find(
"div",
class_="modelNumber"
)
if model_number:
model_text = model_number.text.strip()
model_num = model_text.split(" : ")[-1]
else:
model_num = "Model number not found."
print(f"Extracted model number: {model_num}")
return model_num
The get_model_number() function is designed to extract the model number from the HTML content of a product page. It searches for a <div> element with the class modelNumber using BeautifulSoup. If the element is found, it extracts the text, cleans it by removing any surrounding whitespace, and splits the text by " : " to isolate the model number. If the model number is not found, the function returns a default message, "Model number not found." The function prints the extracted model number and returns it as a string. This function is helpful for scraping product-specific information from e-commerce websites where model numbers are critical for identifying products.
Extracting Product Rating and Review Count
def get_review_and_rating(soup):
"""
Extract the product rating and number of reviews
from the HTML content.
This function:
- Searches for a `<div>` with class 'sc-9cb63f72-2 dGLdNc'
to extract the rating.
- Searches for a `<span>` with class 'sc-9cb63f72-5 DkxLK'
to extract the review count.
- Cleans and returns the extracted values.
- If elements are missing, returns a default message.
Args:
soup (BeautifulSoup): Parsed HTML content.
Returns:
tuple: (rating_value, review_count)
- rating_value (str): Extracted rating or
'Rating not found.'
- review_count (str): Extracted number of
reviews or 'Review count not found.'
"""
rating = soup.find(
"div",
class_="sc-9cb63f72-2 dGLdNc"
)
reviews = soup.find(
"span",
class_="sc-9cb63f72-5 DkxLK"
)
if rating:
rating_value = rating.text.strip()
else:
rating_value = "Rating not found."
if reviews:
review_count = reviews.text.strip()
else:
review_count = "Review count not found."
print(
f"Extracted rating: {rating_value}, "
f"Review count: {review_count}"
)
return rating_value, review_count
The get_review_and_rating() function is designed to extract both the product's rating and the number of reviews from the HTML content. It searches for a <div> element with the class sc-9cb63f72-2 dGLdNc to extract the product rating, and a <span> element with the class sc-9cb63f72-5 DkxLK to extract the review count. After extracting these values, the function cleans the text to remove any unwanted spaces. If the elements are not found, the function returns default messages: "Rating not found" for the rating and "Review count not found" for the reviews. The function prints the extracted rating and review count and returns them as a tuple. This function is essential for scraping product feedback data from e-commerce websites.
Extracting Seller Information
def get_sold_by(soup):
"""
Extract the seller information from the HTML content.
This function:
- Searches for a `<span>` with class 'allOffers'
to extract the seller name.
- Cleans and returns the extracted text.
- Returns a default message if the element is missing.
Args:
soup (BeautifulSoup): Parsed HTML content.
Returns:
str: The seller name or 'Seller information not found.'
"""
sold_by = soup.find(
"span",
class_="allOffers"
)
if sold_by:
sold_by_text = sold_by.text.strip()
else:
sold_by_text = "Seller information not found."
print(f"Extracted sold by: {sold_by_text}")
return sold_by_text
The get_sold_by() function is responsible for extracting the seller's information from the HTML content. It searches for a <span> element with the class allOffers to retrieve the seller's name. If the element is found, the function cleans the extracted text by stripping any unwanted spaces. If the element is not present, the function returns a default message, "Seller information not found." The function prints the extracted seller name and returns it as a string, which helps in identifying the vendor or seller associated with the product.
Extracting Seller Rating and Positive Review Percentage
def scrape_seller_details(soup):
"""
Extract the seller's rating and positive review percentage.
This function:
- Locates the seller rating using CSS selectors.
- Extracts the percentage of positive reviews.
- Returns 'N/A' if the data is not found.
Args:
soup (BeautifulSoup): Parsed HTML content.
Returns:
tuple: (seller_rating, positive_review_percentage) as strings.
"""
seller_rating_tag = soup.select_one(
"div.sc-fb51bf29-0 span.sc-fb51bf29-1"
)
if seller_rating_tag:
seller_rating = seller_rating_tag.text.strip()
else:
seller_rating = "N/A"
positive_rating_tag = soup.select_one(
"div.sc-cf1d50e0-4 span"
)
if positive_rating_tag:
positive_rating = positive_rating_tag.text.strip()
else:
positive_rating = "N/A"
print(
f"Extracted seller rating: {seller_rating}, "
f"Positive review percentage: {positive_rating}"
)
return seller_rating, positive_rating
The scrape_seller_details() function is designed to extract the seller's rating and positive review percentage from the HTML content. It uses CSS selectors to locate the relevant elements. The function first searches for the seller rating in the div.sc-fb51bf29-0 span.sc-fb51bf29-1 element and extracts the text, returning "N/A" if not found. Similarly, it looks for the positive review percentage in div.sc-cf1d50e0-4 span, and if this is missing, it also returns "N/A". The function prints the extracted values and returns them as a tuple (seller_rating, positive_review_percentage). This data is useful for evaluating the seller's reputation on the platform.
Extracting Product Specifications
def scrape_specifications(soup):
"""
Extract product specifications from the table and return them as a dictionary.
This function:
- Searches for the first <table> element in the HTML.
- Iterates through table rows (<tr>) and extracts key-value pairs from <td> elements.
- Stores specifications in a dictionary.
- Returns an empty dictionary if no table is found.
Args:
soup (BeautifulSoup): Parsed HTML content.
Returns:
dict: A dictionary containing product specifications.
"""
specs = {}
table = soup.find("table")
if table:
rows = table.find_all("tr")
for row in rows:
cols = row.find_all("td")
if len(cols) == 2:
key = cols[0].text.strip()
value = cols[1].text.strip()
specs[key] = value
print(
f"Extracted specifications: {specs}"
)
return specs
The scrape_specifications() function is designed to extract product specifications from a product page's HTML content. It first searches for the first <table> element, which typically contains the specifications. Then, the function iterates through each row (<tr>) of the table, extracting key-value pairs from the <td> elements. These pairs are stored in a dictionary where the key is the specification name (e.g., "Model", "Color") and the value is the specification detail (e.g., "XYZ123", "Red"). If no table is found, the function returns an empty dictionary. The function prints the extracted specifications and returns them in the form of a dictionary, providing structured data for further processing.
Scraping Price Details
def scrape_price_details(soup):
"""
Extract price details from the product page.
This function:
- Searches for elements containing sale price, discount price, and savings.
- Extracts and cleans the text from the corresponding <div> tags.
- Returns default messages if any element is missing.
Args:
soup (BeautifulSoup): Parsed HTML content.
Returns:
tuple: (sale_price, discount_price, savings), where each is a string.
"""
sale_price_tag = soup.find(
"div",
class_="priceNow"
)
discount_price_tag = soup.find(
"div",
class_="priceWas"
)
saving_tag = soup.find(
"div",
class_="priceSaving"
)
if sale_price_tag:
sale_price = sale_price_tag.text.strip()
else:
sale_price = "Sale price not found."
if discount_price_tag:
discount_price = discount_price_tag.text.strip()
else:
discount_price = "Discount price not found."
if saving_tag:
saving = saving_tag.text.strip()
else:
saving = "Saving details not found."
print(
f"Extracted prices:\n"
f" Sale price: {sale_price}\n"
f" Discount price: {discount_price}\n"
f" Savings: {saving}"
)
return sale_price, discount_price, saving
The scrape_price_details() function extracts key pricing information from a product page. It looks for three primary elements on the page: the sale price (priceNow class), the discount price (priceWas class), and the savings (priceSaving class). The function attempts to locate these elements, cleans the extracted text, and returns it as a tuple. If any of the elements are missing, it returns a default message indicating that the price information was not found. This function prints the extracted details for debugging purposes and returns the data in a structured format (tuples of strings) for further processing or storage.
Closing the Browser
def close_browser(playwright, browser):
"""
Main function to scrape product data and save it to the database.
This function:
- Launches the browser and sets request headers.
- Navigates to the product page and extracts relevant details.
- Parses the page content using BeautifulSoup.
- Extracts key product information including:
- Page title
- Model number
- Rating and reviews
- Seller details
- Specifications
- Pricing information (sale price, discount price, savings)
- Saves extracted data into the `noon_product_data` table in the database.
- Updates the URL status in the `product_links` table upon successful scraping.
- Handles errors by logging failed URLs in the `failed_urls` table.
- Closes the browser and database connection after processing.
Args:
url (str): The product URL to be scraped.
Returns:
None
"""
print("Closing the browser...")
browser.close()
playwright.stop()
The close_browser() function is responsible for safely closing the browser and stopping the Playwright session once the scraping process is complete. It ensures that the resources used by the browser and Playwright instance are properly released. This function accepts the playwright and browser instances as arguments, calls the close() method on the browser, and then stops the Playwright session with the stop() method. By calling this function at the end of the scraping task, it ensures a clean shutdown of the browser environment, preventing memory leaks and leaving the scraping environment ready for the next task.
Scraping and Saving Product Data
# Scraping and data insertion
def scrape_product_data(url):
"""
Scrapes product data from a given URL and inserts it into the database.
This function:
- Initializes a Playwright browser session and sets headers.
- Extracts product details such as title, model number, ratings,
reviews, seller details, specifications, and pricing information.
- Saves the extracted data into the `noon_product_data` table.
- Updates the `product_links` table to mark the URL as scraped.
- Handles errors by saving failed URLs and closing the browser session properly.
Args:
url (str): The product page URL to scrape.
Returns:
None
"""
try:
print(
f"Starting to scrape product data for URL: {url}"
)
playwright, browser, page = setup_browser()
set_headers(page, url)
visit_page(page, url)
title = extract_title(page)
# Get the page content and parse it using BeautifulSoup
html_content = page.content()
soup = BeautifulSoup(
html_content,
"html.parser"
)
model_number = get_model_number(soup)
rating, reviews = get_review_and_rating(soup)
seller = get_sold_by(soup)
seller_rating, positive_rating = scrape_seller_details(soup)
specifications = scrape_specifications(soup)
sale_price, discount_price, savings = scrape_price_details(soup)
# Save the data to noon_product_data table
conn = connect_db()
cursor = conn.cursor()
cursor.execute(
"""
INSERT INTO noon_product_data (
url,
page_title,
model_number,
rating,
review_count,
sold_by,
seller_rating,
positive_review_percentage,
specifications,
sale_price,
discount_price,
savings
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""",
(
url, title, model_number, rating,
reviews, seller, seller_rating,
positive_rating, str(specifications),
sale_price, discount_price, savings
)
)
conn.commit()
conn.close()
# Update the status of the URL in product_links
update_url_status(url, 1)
close_browser(playwright, browser)
except Exception as e:
print(
f"Error processing {url}: {e}"
)
save_failed_url(url, str(e))
close_browser(playwright, browser)
The scrape_product_data() function is responsible for scraping product details from a given URL and saving the extracted data into the database. It starts by initializing a Playwright browser session and setting up the necessary headers before visiting the product URL. The function then extracts key product information, including the page title, which is retrieved using the extract_title() function, as well as the model number, product rating, review count, seller information, seller rating, positive review percentage, product specifications, and pricing details such as the sale price, discount price, and savings. Once the data is extracted, it is inserted into the noon_product_data table in the database. After successful scraping, the status of the URL in the product_links table is updated to reflect that the URL has been processed. If any error occurs during the scraping process, the failed URL and the error message are saved in the failed_urls table, and the browser session is properly closed. This approach ensures a systematic handling of each product URL, with efficient error management and resource handling.
Scraping All URLs
def scrape_all_urls():
"""
Main function to scrape all product URLs.
This function:
- Retrieves all product URLs with a status of 0 from the `product_links` table.
- Iterates through each URL and calls `scrape_product_data(url)`.
- Ensures all URLs are processed sequentially.
Args:
None
Returns:
None
"""
print("Starting the scraping process for all URLs...")
urls = get_urls_to_scrape()
for url in urls:
scrape_product_data(url)
The scrape_all_urls() function is designed to scrape all product URLs that are marked with a status of 0 (indicating they need to be processed) from the product_links table in the database. The function retrieves these URLs and iterates through them, calling the scrape_product_data(url) function for each URL to extract and save the product data. This process ensures that all pending URLs are processed sequentially, one by one. The function manages the flow of the scraping process and ensures each URL is scraped and updated accordingly, providing a streamlined approach for handling multiple URLs.
Main Entry Point for the Scraping Script
if __name__ == "__main__":
"""
Entry point for the scraping script.
This script:
- Ensures necessary database tables (`product_links`, `noon_product_data`,
`failed_urls`) exist.
- Calls `scrape_all_urls()` to start the scraping process.
Execution:
Run this script to initiate web scraping for all available product URLs.
Args:
None
Returns:
None
"""
alter_product_links_table()
create_noon_product_data_table()
create_failed_urls_table()
scrape_all_urls()
The script's entry point is defined in the if name == "__main__": block, ensuring that the necessary database tables (product_links, noon_product_data, and failed_urls) are created or altered if they don't already exist. It then triggers the scrape_all_urls() function, which starts the scraping process for all URLs that need to be processed. By running this script, you initiate the web scraping process, which involves retrieving product data, inserting it into the database, and handling failed scraping attempts. The script ensures everything is set up and runs smoothly, starting from the table creation to the execution of scraping tasks.
Conclusion
The Noon Eyewear web scraping project is a highly advanced data gathering system coded in Python that applies a two-stage strategy for complete product data extraction. The initial stage utilizes an asynchronous framework with Playwright and BeautifulSoup to gather product URLs from category pages in an efficient manner, storing them in a SQLite database while managing pagination and applying dynamic content loading through smart scroll simulation. The second stage is dedicated to extracting detailed product information, running in parallel to provide accurate data scraping of product details, prices, seller names, and customer reviews. The system has built-in strong error handling features, such as random rotation of user agents, configurable timeouts, and extensive failure logging. Data integrity is ensured through a well-organized SQLite database with several tables monitoring scraping status and failures. The codebase exhibits careful consideration of performance optimization via asynchronous operations, database efficiency, and prudent request handling with random waits to evade detection. Some features that stand out include modular function design for ease of maintenance, concise logging throughout the process, and scalability through separate phases of scraping and status tracking for resume ability. Though the system entails cautious resource management by virtue of browser automation and perhaps the need to adjust for rate limiting in accordance with site policies.
It nevertheless offers a sound basis for collecting detailed eyewear product information from Noon.com in an assured manner while ensuring data quality and process dependability.
Connect with Datahut for top-notch web scraping services that bring you the valuable insights you need hassle-free.