How to Scrape Blinkit’s Fruits and Vegetables Data?
- Shahana farvin
- 8 minutes ago
- 23 min read
Have you ever wondered how businesses or analysts find out what groceries cost on websites like Blinkit? The trick is something called web scraping , a smart and automated way to gather information from websites.
Think of web scraping like a helpful robot assistant. It goes through web pages, picks out the bits we care about like product names, prices, or weights and saves them neatly for us to use. It’s quick, reliable, and way faster than doing it by hand.
Why Blinkit?
Blinkit, which you might remember as Grofers, is one of India’s most popular apps for quick grocery deliveries. It has a wide selection of fresh fruits and vegetables. If we can collect this data regularly, we can learn a lot — like how prices vary, which items are in stock, or how things change with the seasons. That kind of information is super helpful for both businesses and researchers.
How We Scrape Blinkit’s Fruits and Vegetables Data
We break the task down into two easy steps:
First, we gather all the links to individual products listed under the fruits and vegetables section.
Then, we visit each of those links one by one and collect details like the product name, price, weight, and a short description.
This step-by-step approach helps us stay organized and makes it easier to spot and fix any problems if something goes wrong.
Links Collection
In today’s world of e-commerce, data plays a huge role. Whether it's tracking prices or studying customer trends, good data is at the heart of smart decision-making. In our case, we’re in the early stage of the web scraping process — and this step focuses on collecting product links from Blinkit's fruits and vegetables section.
These product links are important because they act like doorways. Once we have them, we can step through each one to gather more useful details like pricing, weight, and descriptions. So, collecting the links is our foundation — the first layer of data we need before diving deeper.
Here’s how the process works behind the scenes:We point our scraper to Blinkit's website, scroll down the page so that all the fruits and veggies are loaded, then collect the links to each product. After that, we save these links in a database. This way, we can come back later and pull more detailed information from each link.
Now that we understand the purpose, let’s take a closer look at how this actually works step by step.
Setting Up the Environment
import sqlite3
import logging
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import time
from datetime import datetime
# Configure logging
logging.basicConfig(filename="scraper.log", level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s")
BASE_URL = "https://blinkit.com"
Let’s now look at the tools our scraper uses to do its job. Just like how a chef needs the right ingredients before cooking, our web scraping script needs a few key packages to run smoothly.
We start by importing everything we need:
SQLite3 helps us connect to a small, lightweight database where we’ll store the product data we collect. Think of it like a notebook for saving all our results.
Logging keeps track of what our scraper is doing. If something goes wrong, these logs will help us figure out what happened.
Playwright is what lets us control a web browser automatically — like telling it where to go, what to click, or when to scroll.
BeautifulSoup is the tool that reads the webpage and pulls out the information we’re interested in.
We also bring in time and datetime to manage waiting periods and to keep track of when our scraper runs.
Before we dive into scraping, we also set up our log file, called scraper.log. This file saves messages like when the scraper starts, what it's doing, or if any errors show up. Each message is time-stamped and labeled so we can understand the flow of the process easily.
Lastly, we define a small but useful variable called BASE_URL, which is just the Blinkit homepage link — https://blinkit.com. We’ll use this later to build full product links as we collect them.
Fetching Page Content
def fetch_page_content(url):
"""
Fetches the full HTML content of a webpage using Playwright with incremental scrolling.
This function launches a Chromium browser, navigates to the specified URL, and
simulates scrolling through the entire page to ensure dynamically loaded content
is captured. It uses a bottom-to-top scrolling technique to trigger lazy loading.
Parameters:
url (str): The URL of the webpage to fetch.
Returns:
str or None: The complete HTML content of the page if successful, None otherwise.
Raises:
Exception: Any exceptions that occur during the browsing session are caught,
logged, and None is returned.
Notes:
- Uses non-headless browser mode (visible) which may be changed to headless=True for production.
- Includes a timeout of 60 seconds (60000ms) for page loading.
- Logs success or failure information to the configured logger.
"""
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(url, timeout=60000)
scroll_position = 0 # Start from the top
while True:
# Scroll down in increments
scroll_position += 600
page.evaluate(f"window.scrollTo(0, {scroll_position})")
time.sleep(2) # Wait for new content to load
# Get new page height after scrolling
new_height = page.evaluate("document.body.scrollHeight")
# Stop if we can't scroll further
if scroll_position >= new_height:
break
content = page.content()
browser.close()
logging.info(f"Successfully fetched content from {url}")
return content
except Exception as e:
logging.error(f"Error fetching page content from {url}: {e}")
return None
One of the key parts of our scraper is the fetch_page_content function. This function is in charge of opening a web page, scrolling through it, and collecting the full content — kind of like someone visiting a webpage and slowly scrolling down to see everything.
Here’s how it works, step by step:
When we call this function, it opens the link in a visible Chrome browser using Playwright. This visible mode helps us make sure everything loads properly, especially on pages that load content as you scroll.
Starting from the top of the page, it scrolls down 600 pixels at a time. After each scroll, it pauses for 2 seconds. This small wait gives the page time to load more products — since many websites (like Blinkit) load items gradually as you move down.
As it keeps scrolling, it checks whether it’s reached the bottom of the page. This is important because we want to make sure all products — even the ones that show up only at the end — are included.
Once the full page has loaded, the function saves the entire HTML content, closes the browser, and sends the content back so we can extract the data we need.
If something doesn’t work — maybe the page didn’t load or the internet dropped — the function catches the error, logs it in our scraper.log file, and safely returns None so the script doesn’t crash.
Parsing Product Links
def parse_links(html_content):
"""
Parses HTML content and extracts all product links using BeautifulSoup.
This function takes HTML content, particularly from Blinkit product listing pages,
and extracts links to individual product pages by targeting specific CSS selectors
that identify product elements.
Parameters:
html_content (str): The HTML content to parse.
Returns:
list: A list of complete product URLs (with BASE_URL prefixed).
Raises:
Exception: Any exceptions during parsing are caught, logged, and an empty list is returned.
Notes:
- Uses BeautifulSoup's CSS selector to target elements with 'plp-product' data-test-id.
- Prefixes all relative links with BASE_URL to create absolute URLs.
- Logs the number of extracted links for monitoring.
"""
try:
soup = BeautifulSoup(html_content, 'html.parser')
links = [BASE_URL + a['href'] for a in soup.select('.ProductsContainer__ProductListContainer-sc-1k8vkvc-0 > a[data-test-id="plp-product"]')]
logging.info(f"Extracted {len(links)} links.")
return links
except Exception as e:
logging.error(f"Error parsing links: {e}")
return []
Once we have the full content of the page, the next step is to pull out the product links — and that’s exactly what the parse_links function does.
This function uses BeautifulSoup to read through the HTML content, just like skimming through a web page’s behind-the-scenes code. It looks for a specific section of the page that holds all the product listings. This section is marked with a special class name: 'ProductsContainer__ProductListContainer-sc-1k8vkvc-0'.
Inside this section, it searches for all the anchor (<a>) tags that have an attribute called data-test-id="plp-product". These tags hold the actual links to each product.
Each link it finds is just a part of the full address. So, to make it a complete and usable web link, the function adds the BASE_URL (which we defined earlier) at the beginning. This makes sure every link points to the right page on the Blinkit site.
Even if the layout of the site changes a bit — like headings move around — as long as the main product container stays the same, this function will still find the product links correctly.
Once all the links are collected, the function logs how many were found and returns them as a list. And just like earlier steps, if something goes wrong, it writes an error to the log file and returns an empty list instead of crashing.
Saving Links to Database
def save_links_to_db(links, category, db_name="scraped_links.db"):
"""
Saves extracted product links to an SQLite database with metadata.
This function connects to an SQLite database (creates it if it doesn't exist),
creates a table for storing links if needed, and inserts the links along with
their category, current date, and a flag indicating they haven't been scraped yet.
Parameters:
links (list): List of URLs to save to the database.
category (str): Category identifier for the links (e.g., "vegetables", "fruits").
db_name (str, optional): Name of the SQLite database file. Defaults to "scraped_links.db".
Returns:
None
Raises:
Exception: Any database-related exceptions are caught and logged.
Notes:
- Uses SQLite's UNIQUE constraint to prevent duplicate links.
- Sets a default 'scraped' value of 0 to indicate the link hasn't been processed yet.
- Logs warnings for duplicate links instead of failing.
- The table schema includes:
* id: Autoincrementing primary key
* url: The product URL (must be unique)
* category: Product category identifier
* scraped_date: Date when the link was added to the database
* scraped: Flag indicating whether detailed information has been scraped (0=no, 1=yes)
"""
try:
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
# Create table if not exists
cursor.execute("""
CREATE TABLE IF NOT EXISTS links (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
category TEXT,
scraped_date TEXT,
scraped INTEGER DEFAULT 0
)
""")
current_date = datetime.now().strftime("%Y-%m-%d")
for link in links:
try:
cursor.execute("INSERT INTO links (url, category, scraped_date, scraped) VALUES (?, ?, ?, ?)",
(link, category, current_date, 0))
except sqlite3.IntegrityError:
logging.warning(f"Duplicate link ignored: {link}")
conn.commit()
conn.close()
logging.info("Links saved successfully to the database.")
except Exception as e:
logging.error(f"Error saving links to database: {e}")
After collecting the product links, the next step is to save them somewhere safe — and that’s what the save_links_to_db function takes care of.
This function connects to a SQLite database, which is like a small local storage system. Once connected, it checks if there’s already a table named links. If the table doesn’t exist, it creates one.
The links table is set up with a few useful columns:
An ID that’s created automatically for each entry
The product URL, which is marked as unique so the same link isn’t saved more than once
The product category, like "fruits and vegetables"
The date the link was saved
A scraped flag, which starts at 0 to show that the product hasn’t been scraped yet
The function then goes through each link in the list. For each one, it tries to insert the product URL, category, current date, and the scraped flag (set to 0 for now).
If the link already exists in the database (a duplicate), the function simply skips it and logs a warning — instead of stopping the whole process. This way, it avoids errors and keeps things running smoothly.
Finally, it saves all changes, closes the connection to the database, and logs a message to confirm that everything was completed successfully.
Orchestrating the Scraping Process
def main():
"""
Main function that orchestrates the web scraping process.
This function defines a dictionary of URLs to scrape, each with its associated category,
and then processes each URL by:
1. Fetching the full page content
2. Parsing the content to extract product links
3. Saving those links to the database
Returns:
None
Notes:
- Currently configured to scrape vegetable and fruit categories from Blinkit.
- Can be extended by adding more URL-category pairs to the urls dictionary.
- Prints a success message when all operations are complete.
"""
urls = {
"https://blinkit.com/cn/fresh-vegetables/cid/1487/1489": "vegetables",
"https://blinkit.com/cn/fresh-fruits/cid/1487/1503": "fruits"
}
for url, category in urls.items():
html_content = fetch_page_content(url)
if html_content:
links = parse_links(html_content)
if links:
save_links_to_db(links, category)
print("Links saved successfully.")
if __name__ == "__main__":
main()
The main function is like the manager of the whole scraping process. It decides which Blinkit pages we want to scrape — for example, the fruits section or the vegetables section — and keeps track of the category each URL belongs to.
For every category and URL, it follows a clear three-step process:
It loads the full page
It extracts the product links
It saves those links into the database
This setup is clean and flexible. So, if you ever want to add more categories later — like dairy or snacks — you can easily plug them in without changing much of the code.
Now, when you run the script directly (without importing it into another program), the main function runs automatically. After everything is done, it prints a message to let you know that scraping was successful.
Data Collection
Now that we’ve collected all the product links from Blinkit’s fruits and vegetables section, it’s time to take things a step further. This part of the scraper focuses on gathering detailed information about each product — not just the links anymore.
Think of this as the second phase of the project. Earlier, we built a list of doors (product links), and now we’re opening each door to see what’s inside. For every product link, the script visits the page and pulls out useful details like the product’s name, price, weight, and more. All of this information is then stored in our database, ready for analysis later.
In this phase, we’re expanding our scraper’s reach — from just collecting URLs to capturing actual product data. Let’s walk through how this part works and look at the key steps that help keep the scraping process both reliable and efficient.
Imports and Configuration
import sqlite3
import logging
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import time
import json
import random
import re
from datetime import datetime
# Configure logging
logging.basicConfig(filename="blinkit_scraper.log", level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s")
DB_NAME = "scraped_links.db"
At the start of this script, we bring in all the important tools we’ll need to make everything work smoothly as we done before.
We begin by importing SQLite3, which lets us connect to and interact with our database — where we’ll store all the product details we scrape.
Next, we use Playwright to open a browser automatically and visit each product page. This makes the process hands-free and much faster than doing it manually.
To read the contents of each webpage and pull out the information we want, we use BeautifulSoup. It helps us navigate through the HTML structure and find exactly what we’re looking for.
We also include a few helpful modules like json, random, and re (short for regular expressions). These help us format the data properly, generate random choices when needed, and search for specific patterns in the data.
Finally, we set up logging so everything the scraper does is recorded in a file called blinkit_scraper.log. This is useful for spotting issues or understanding what happened if something goes wrong during scraping.
User-Agent Rotation
# Load user-agents from file
def load_user_agents():
"""
Reads user-agents from a file and returns them as a list.
This function opens and reads the user_agents.txt file, strips each line,
and returns only non-empty lines as a list of user agent strings.
Returns:
list: A list of user agent strings to be used for request rotation.
Raises:
Exception: Any file-related exceptions are caught, logged, and an empty list is returned.
Notes:
- The file should contain one user agent string per line.
- Empty lines are ignored.
- If the file cannot be read, an error is logged and an empty list is returned.
"""
try:
with open("user_agents.txt", "r") as file:
return [line.strip() for line in file.readlines() if line.strip()]
except Exception as e:
logging.error(f"Error reading user_agents.txt: {e}")
return []
USER_AGENTS = load_user_agents()
One smart feature in this script is something called user-agent rotation. This is handled by a function named load_user_agents, which reads from a text file filled with different user-agent strings.
A user-agent is basically a small piece of information sent to a website that says, “Hey, I’m a browser on this device.” It helps the website understand what kind of user is visiting — whether it’s someone on a phone, laptop, or using a specific browser like Chrome or Firefox.
Now, instead of using the same user-agent every time (which might make it obvious that we’re a bot), we randomly change the user-agent for each request. This makes it look like the visits are coming from different people using different devices.
This simple trick helps reduce the chances of getting blocked by the website. It also supports responsible scraping — we want to collect data without putting too much strain on the site or drawing unnecessary attention.
Retrieving Unscraped Links
def get_unscraped_links():
"""
Fetch all unprocessed links from the database where scraped = 0.
This function connects to the SQLite database and retrieves all records from the links table
where the scraped flag is set to 0, indicating they haven't been processed yet.
Returns:
list: A list of tuples containing (id, url, category) for unscraped links.
Each tuple contains:
- id (int): The primary key of the link record
- url (str): The product URL to be scraped
- category (str): The category the product belongs to
Raises:
Exception: Any database-related exceptions are caught, logged, and an empty list is returned.
Notes:
- The number of fetched links is logged for monitoring.
- If an error occurs, it is logged and an empty list is returned.
"""
try:
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
cursor.execute("SELECT id, url, category FROM links WHERE scraped = 0")
links = cursor.fetchall()
conn.close()
logging.info(f"Fetched {len(links)} unscraped links from the database.")
return links
except Exception as e:
logging.error(f"Error fetching unscraped links: {e}")
return []
The get_unscraped_links function helps us keep everything organized by acting like a to-do list for our scraper.
Here’s how it works: it connects to our SQLite database and looks for all the product links where the scraped value is still set to 0. This tells us that the product details haven’t been collected yet.
By keeping track of which links are done and which ones are pending, this function helps us manage our progress. So, if the script stops in the middle — maybe due to a network issue or system crash — we can easily pick up right where we left off, without starting over.
Each link returned by this function includes three things: the ID (from the database), the URL (where the product is), and the category (like fruits or vegetables). We’ll need all of this information in the next steps to scrape and save the product details properly.
Content Fetching
def fetch_page_content(url):
"""
Fetches the full product page content using Playwright with a rotating User-Agent.
This function launches a Chromium browser with a randomly selected user agent,
navigates to the specified URL, waits to ensure content is loaded,
and then captures the complete HTML content of the page.
Parameters:
url (str): The URL of the product page to fetch.
Returns:
str or None: The complete HTML content of the page if successful, None otherwise.
Raises:
Exception: Any exceptions during browsing are caught, logged, and None is returned.
Notes:
- Uses non-headless browser mode (visible) which may be changed for production.
- Randomly selects a user agent from USER_AGENTS list for request fingerprint randomization.
- Waits 10 seconds after page load to ensure dynamic content is rendered.
- Includes a timeout of 60 seconds (60000ms) for page loading.
- Contains commented code for saving HTML to file for debugging purposes.
- Logs the user agent used for each request for tracking.
"""
try:
user_agent = random.choice(USER_AGENTS) if USER_AGENTS else None # Pick a random User-Agent
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(user_agent=user_agent)
page = context.new_page()
page.goto(url, timeout=60000)
time.sleep(10)
content = page.content()
browser.close()
logging.info(f"Successfully fetched page content for {url} using User-Agent: {user_agent}")
return content
except Exception as e:
logging.error(f"Error fetching page content from {url}: {e}")
return None
The job of actually visiting each product page is handled by the fetch_page_content function. This function uses Playwright to open up a Chrome browser, go to the product link, and wait until the page is fully loaded.
We run the browser in non-headless mode, which means it opens up visibly on the screen. This helps make sure that all parts of the page — especially the content loaded through JavaScript — have a chance to appear properly.
To be extra safe, the function also waits 10 more seconds after the page says it’s done loading. This gives any slow-loading elements, like pop-ups or late-appearing product details, time to show up.
This extra wait ensures we capture all the product information, even the parts that don’t appear immediately when the page first opens.
Parsing Product Details
The parsing functions are the most important part of our scraper. Each one is designed to extract a specific detail about the product. These functions work like tools, each picking out different information from the page.
def parse_product_name(soup):
"""
Extracts the product name from a Blinkit product page.
This function uses a CSS selector to locate and extract the product name
from the BeautifulSoup object representing the product page.
Parameters:
soup (BeautifulSoup): The BeautifulSoup object of the product page.
Returns:
str: The product name if found, "N/A" otherwise.
Notes:
- Uses a specific CSS selector targeting the product name element on Blinkit pages.
- If the element is not found, logs a warning and returns "N/A".
- The selector may need updating if Blinkit changes their page structure.
"""
try:
return soup.select_one("#app > div > div > div:nth-child(3) > div > div.Product__ProductWrapper-sc-18z701o-3.eJdNpg > div.Product__ProductWrapperRightSection-sc-18z701o-5.hbMSwc > div.ProductInfoCard__ProductInfoWrapper-sc-113r60q-3.iwxjxo > h1").text.strip()
except AttributeError:
logging.warning("Product name not found.")
return "N/A"
The parse_product_name function finds and collects the product name from the webpage. It uses a CSS selector to locate the title section of the product quickly and accurately.
def parse_net_quantity(soup):
"""
Extracts the net quantity information from a Blinkit product page.
This function locates and extracts the net quantity (weight, volume, or units)
from the BeautifulSoup object of the product page.
Parameters:
soup (BeautifulSoup): The BeautifulSoup object of the product page.
Returns:
str: The net quantity information if found, "N/A" otherwise.
Notes:
- Uses a specific CSS selector targeting the quantity element on Blinkit pages.
- If the element is not found, logs a warning and returns "N/A".
- The selector may need updating if Blinkit changes their page structure.
"""
try:
return soup.select_one("p.ProductVariants__VariantUnitText-sc-1unev4j-6.dhCxof").text.strip()
except AttributeError:
logging.warning("Net quantity not found.")
return "N/A"
The parse_net_quantity function gets the product’s weight or size (like grams or kilograms). This helps us understand how much of the fruit or vegetable is being sold.
def parse_price_details(soup):
"""
Extracts the sale price and MRP (Maximum Retail Price) from a Blinkit product page.
This function locates and extracts both the current sale price and the original MRP
from the BeautifulSoup object of the product page.
Parameters:
soup (BeautifulSoup): The BeautifulSoup object of the product page.
Returns:
tuple or None: A tuple containing (sale_price, mrp) if successful, None otherwise.
- sale_price (str): The current selling price
- mrp (str): The Maximum Retail Price (original price)
Raises:
Exception: Any exceptions during parsing are caught, logged, and None is returned.
Notes:
- Uses specific CSS selectors targeting the price elements on Blinkit pages.
- If elements are not found, returns 'N/A' for the respective values.
- The selectors may need updating if Blinkit changes their page structure.
"""
try:
# Extract sale price
sale_price_element = soup.select_one("div.ProductVariants__PriceContainer-sc-1unev4j-7.gGENtH")
sale_price = sale_price_element.get_text(strip=True) if sale_price_element else 'N/A'
# Extract MRP
mrp_element = soup.select_one("span.ProductVariants__MRPText-sc-1unev4j-8.gNKjjk")
mrp = mrp_element.get_text(strip=True) if mrp_element else 'N/A'
# Return individual price details
return sale_price, mrp
except Exception as e:
logging.error(f"Error parsing price details: {e}")
return None
The parse_price_details function collects two key prices from the product page:
Sale Price – the current price the customer pays.
MRP (Maximum Retail Price) – the original price before any discounts.
This helps us compare prices, track discounts, and analyze pricing trends over time.
def parse_details(soup):
"""
Extracts detailed product information from a Blinkit product page.
This function locates and extracts all available product details such as
description, ingredients, nutritional information, etc. from the product
details section of the page.
Parameters:
soup (BeautifulSoup): The BeautifulSoup object of the product page.
Returns:
str: A JSON string containing key-value pairs of product details.
Returns an empty JSON object string "{}" if no details are found.
Raises:
Exception: Any exceptions during parsing are caught, logged, and an empty
JSON object string is returned.
Notes:
- Uses specific CSS selectors targeting the product details section.
- Each key-value pair in the details section is extracted and added to a dictionary.
- The dictionary is then converted to a formatted JSON string with indentation.
- If an error occurs processing a specific detail, that detail is skipped.
- The selector may need updating if Blinkit changes their page structure.
"""
try:
details_section = soup.select("#app > div > div > div:nth-child(3) > div > div.Product__ProductWrapper-sc-18z701o-3.eJdNpg > div.Product__ProductWrapperLeftSection-sc-18z701o-4.fkQLTf > div:nth-child(3) > div > div.ProductDetails__RemoveMaxHeight-sc-z5f4ag-3.fOPLcr > div")
if not details_section:
logging.warning("No product details found in the given HTML.")
return json.dumps({})
details = {}
for div in details_section:
try:
key_element = div.find("p")
value_element = div.find("div")
if key_element and value_element:
key = key_element.get_text(strip=True)
value = value_element.get_text(strip=True)
details[key] = value
else:
logging.warning("Missing key-value pair in a product highlight div.")
except Exception as e:
logging.error(f"Error processing a highlight div: {e}")
continue
return json.dumps(details, indent=4)
except Exception as e:
logging.error(f"Error parsing product highlights: {e}")
return json.dumps({})
The parse_details function extracts extra product information that might differ from one item to another. It looks through the product details section, collects key-value pairs (like "Shelf Life: 3 days" or "Country of Origin: India"), and organizes them into a clear JSON format. This makes the data easy to store and analyze later.
def parse_product_details(html_content):
"""
Parses all product details from the HTML content of a Blinkit product page.
This function serves as an orchestrator that calls individual parsing functions
to extract different components of product information and combines them into
a comprehensive product data dictionary.
Parameters:
html_content (str): The raw HTML content of the product page.
Returns:
dict or None: A dictionary containing all parsed product details if successful,
None otherwise. The dictionary includes:
- name (str): Product name
- net_quantity (str): Product quantity/weight
- sale_price (str): Current selling price
- price (str): Original MRP (Maximum Retail Price)
- product_details (str): JSON string of additional product details
Raises:
Exception: Any exceptions during parsing are caught, logged, and None is returned.
Notes:
- Creates a BeautifulSoup object from the HTML content for parsing.
- Calls specialized parsing functions for each data point.
- Logs the extracted product data for monitoring and debugging.
"""
try:
soup = BeautifulSoup(html_content, 'html.parser')
product_data = {
"name": parse_product_name(soup),
"net_quantity": parse_net_quantity(soup),
"sale_price": parse_price_details(soup)[0],
"price":parse_price_details(soup)[1],
"product_details":parse_details(soup)
}
logging.info(f"Extracted product details: {product_data}")
return product_data
except Exception as e:
logging.error(f"Error parsing product details: {e}")
return None
The parse_details function acts like the master coordinator. It brings together all the smaller parsing functions to collect complete product information in one go.
It calls each of the specialized functions — the ones that extract the name, quantity, price, and other details — and then neatly combines everything into a single dictionary. This structured format makes it easy to work with the data later.
This modular approach keeps the code clean and easy to update. If you ever want to extract more details from the page, you can simply add another parsing function and plug it into this one.
Database Storage
def save_product_data(product_data, category, scraped_date, url):
"""
Saves product details into the database and returns success status.
This function connects to the SQLite database, creates the products table
if it doesn't exist, and inserts the product data along with category,
scraping date, and URL information.
Parameters:
product_data (dict): Dictionary containing parsed product details.
category (str): Category of the product (e.g., "vegetables", "fruits").
scraped_date (str): Date when the product was scraped (YYYY-MM-DD format).
url (str): URL of the product page.
Returns:
bool: True if data was successfully saved, False otherwise.
Raises:
Exception: Any database-related exceptions are caught, logged, and False is returned.
Notes:
- Creates the products table if it doesn't exist.
- The table schema includes:
* url: URL of the product page
* name: Product name
* net_quantity: Weight/quantity of the product
* sale_price: Current selling price
* price: Original MRP (Maximum Retail Price)
* category: Product category
* product_details: JSON string of additional product details
* scraped_date: Date when the product was scraped
- Logs success or failure for debugging and monitoring.
"""
try:
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
url TEXT,
name TEXT,
net_quantity TEXT,
sale_price TEXT,
price TEXT,
category TEXT,
product_details TEXT,
scraped_date TEXT
)
"""
)
cursor.execute("""
INSERT INTO products (url, name, net_quantity, sale_price, price, category, product_details, scraped_date)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (url, product_data["name"], product_data["net_quantity"], product_data["sale_price"],
product_data["price"], category, product_data["product_details"], scraped_date))
conn.commit()
conn.close()
logging.info(f"Product data saved successfully for {url}")
return True
except Exception as e:
logging.error(f"Error saving product data to database: {e}")
return False
The save_product_data function plays a key role in saving the scraped product details to our database.
First, it checks if there’s a table called products in our SQLite database. If it doesn’t exist yet, the function creates one with all the necessary columns — like the product name, quantity, price, other specifications (saved as JSON), URL, category, scraping date, and more.
Once the table is ready, the function takes the product information we’ve collected and inserts it into the database.
This organized structure makes it easy to access and work with the data later — whether for analysis, reporting, or even visualizations using tools like Excel or Power BI.
def mark_link_as_scraped(link_id):
"""
Updates the `scraped` column in the `links` table to mark the link as processed.
This function connects to the SQLite database and updates the scraped flag
to 1 for the specified link ID, indicating that the link has been processed.
Parameters:
link_id (int): The ID of the link record to mark as scraped.
Returns:
None
Raises:
Exception: Any database-related exceptions are caught and logged.
Notes:
- Sets the scraped flag to 1 to prevent re-processing in future runs.
- Logs the ID of the link being marked for tracking and debugging.
"""
try:
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
cursor.execute("UPDATE links SET scraped = 1 WHERE id = ?", (link_id,))
conn.commit()
conn.close()
logging.info(f"Marked link ID {link_id} as scraped.")
except Exception as e:
logging.error(f"Error updating scraped status for link ID {link_id}: {e}")
You’re absolutely right — the mark_link_as_scraped function is a crucial checkpoint in the scraping process. After a product link has been successfully processed, this function updates its scraped value to 1 in the database.
Why is this so important?
No Extra Work: Once a link is marked as scraped, the scraper knows not to visit it again. This avoids unnecessary repetition and saves time.
Smooth Recovery: If the script stops or crashes partway through, it can pick up exactly where it left off — starting from the next unprocessed link.
Clean Data: It helps keep the products table free of duplicates. Since each product is only scraped once, you won’t end up with the same product saved multiple times.
This small function becomes especially valuable in larger or longer scraping jobs, where reliability and clean, accurate data really matter.
The Scraping Workflow
def scrape_products():
"""
Main function to scrape product details from all unscraped links in the database.
This function orchestrates the complete scraping workflow:
1. Retrieves all unscraped links from the database
2. For each link, fetches the HTML content
3. Parses product details from the HTML
4. Saves the details to the database
5. Marks the link as scraped
6. Introduces a random delay between requests to avoid detection
Returns:
None
Notes:
- If no unscraped links are found, logs an informational message and exits.
- Implements a random delay between 8 and 15 seconds between requests to avoid
overwhelming the server and to mimic human browsing behavior.
- Only marks a link as scraped if the product data was successfully saved.
- Logs progress and completion for monitoring and debugging.
"""
unscraped_links = get_unscraped_links()
if not unscraped_links:
logging.info("No unscraped links found. Exiting scraper.")
return
for link_id, url, category in unscraped_links:
logging.info(f"Processing {url} (Category: {category})")
html_content = fetch_page_content(url)
if html_content:
product_data = parse_product_details(html_content)
if product_data:
scraped_date = datetime.now().strftime("%Y-%m-%d")
if save_product_data(product_data, category, scraped_date, url):
mark_link_as_scraped(link_id)
logging.info(f"Successfully saved and marked {url} as scraped.")
else:
logging.warning(f"Skipping marking {url} as scraped due to save failure.")
time.sleep(random.uniform(8, 15)) # Random delay between processing requests
logging.info("Scraping completed for all available links.")
if __name__ == "__main__":
scrape_products()
The scrape_products function is the heart of the script. It goes through all the product links that haven’t been scraped yet and processes them one by one.
To act more like a real user and avoid putting too much pressure on Blinkit’s servers, the scraper waits for a random amount of time — between 8 and 15 seconds — before moving on to the next link. This small delay helps reduce the chances of being blocked and shows respect for the website’s limits.
The script is designed to run this function only when called directly, so if you ever want to import this code into another project, it won’t start scraping automatically — which gives you more control.
Overall, this scraper follows good scraping practices. It rotates user agents to avoid detection, adds delays between requests, collects detailed product information, and stores everything neatly in a database. The code is modular, meaning it’s easy to update or expand if Blinkit’s site changes or if you want to scrape other sections later.
With clear logs showing each step, you can easily track how the scraper is doing. All in all, it’s a solid tool for gathering and analyzing data from Blinkit’s fruits and vegetables section.
Conclusion
In this project, we successfully built a browser automation and scraping workflow using Blinkit’s fruits and vegetables catalog as an example. By breaking the process into two clear steps — first collecting product links, then visiting each link to gather detailed information — we created a reliable system that collects useful data without overwhelming the website.
To make sure our scraping was both ethical and stable, we followed standard best practices. We rotated user agents, added random wait times between requests, and kept detailed logs of everything the scraper did. Using SQLite as a lightweight database also gave us a smart way to track progress and recover easily if something interrupted the scraping.
The code is structured in a way that’s easy to update or expand, whether you want to scrape more product details or add new categories later.
AUTHOR
I’m Shahana, a Data Engineer at Datahut, where I design robust, scalable data pipelines that turn messy web content into clean, usable datasets—especially in fast-moving industries like e-commerce, grocery delivery, and retail insights.
At Datahut, we work with clients to automate data collection from modern, JavaScript-heavy websites using tools like Playwright and BeautifulSoup. In this blog, I shared a hands-on scraping project built around Blinkit’s Fruits and Vegetables catalog, demonstrating a two-step strategy for gathering product links and detailed data while keeping the process efficient, reliable, and ethical.
If your team is looking to automate product data collection in the eyewear space or beyond, reach out to us through the chat widget on the right. We’d love to help you build a solution that fits your goals.
FAQs
1. What is web scraping and how does it work for Blinkit?
Web scraping is the process of using automated tools to extract data from websites. In the case of Blinkit, we use a scraper to visit product listing pages, collect links to fruits and vegetables, and then gather details like name, price, and weight from each product page.
2. Is it legal to scrape data from Blinkit?
Web scraping legality depends on the site's terms of service and local laws. Always review Blinkit’s terms, avoid overloading their servers, and use the data responsibly for research or analysis.
3. Why collect fruits and vegetables data from Blinkit?
Blinkit offers a wide variety of fresh produce. By scraping this data regularly, businesses and researchers can track price changes, monitor stock availability, and identify seasonal trends.
4. What tools are used for scraping Blinkit data?
We use tools like Playwright to automate the browser, BeautifulSoup to parse HTML, and SQLite to store the scraped data. Logging helps track the scraper’s performance, and user-agent rotation reduces the risk of being blocked.
5. Can this scraping method be used for other categories on Blinkit?
Yes. By adjusting the target URLs and category labels in the script, you can scrape data for other Blinkit categories like dairy, snacks, or packaged goods.