Carrefour UAE Web Scraping Case Study: Tracking Vegetable Prices, Discounts & Availability Over Time
- Anusha P O
- 11 hours ago
- 18 min read

Ever wondered what grocery data scraping can reveal about real-time pricing and product availability?
In this case study, we scrape the Carrefour UAE vegetables category over five days to analyze:
Price fluctuations
Discount patterns
Product availability trends
This guide walks through a complete Carrefour UAE Web Scraping pipeline, from collecting product URLs to extracting structured data and preparing it for analysis using Python.
Step-by-Step Web Scraping Workflow for Carrefour UAE
Step 1: Scraping Product URLs from Carrefour UAE Vegetables Category
Before any meaningful analysis can happen, the most important first step is collecting the right product page links, and this blog begins by focusing on how URLs were gathered from the vegetables section of the Carrefour UAE website in a clean and structured way. Think of this step as creating a shopping list before entering a store—without knowing which aisles to visit, nothing else can move forward smoothly. The vegetables category page acts as the entry point, where dozens of product cards appear as the page loads and expands, each hiding a link that leads to a detailed product page. Instead of copying these links manually, an automated script carefully opens the category page, waits for the content to appear, and scans each product block to extract its URL while ignoring duplicates. As more items load through scrolling or the “Load More” action, the script continues gathering links until the full list is complete, saving them in a structured database and a backup file for safety. This approach turns a repetitive manual task into a reliable process, ensuring that every vegetable product page is captured once and stored neatly, laying a strong foundation for deeper data extraction and time-based analysis in later stages.
Step 2: Extracting Product Data from Carrefour UAE Product Pages
Once the list of vegetable product URLs was safely stored, the next step was to visit each link and gently collect the details hidden inside those pages, turning simple addresses into meaningful data. This stage can be compared to walking through each aisle after noting down its location, taking time to observe what is actually on the shelf. Using the URLs saved in the database, the script opened every product page one by one, allowed the page to load naturally, and handled small interruptions like cookie popups so the content could be read clearly. Each page was then converted into a structured format, making it easy to extract important information such as the vegetable name, price, discount, and availability without confusion. As the data was collected, it was stored back into the database and marked as processed, ensuring the same page would not be visited again in future runs, while a copy was also saved in a JSON file for easy review or sharing. Care was taken to handle missing details or slow-loading pages gracefully, logging issues instead of stopping the process, which allowed the scraper to continue smoothly across many product pages. By the end of this step, the project moved from a simple list of links to a well-organized dataset that clearly described each vegetable product on the Carrefour website, setting the stage for deeper insights and time-based analysis.
Step 3: Data Cleaning and Preparation Using OpenRefine
After collecting raw vegetable product data from the Carrefour website, the next important step was cleaning it so the information could be trusted and easily analyzed, and this is where OpenRefine became especially useful. The process began by opening OpenRefine and uploading the scraped file, which immediately displayed the data in a clear, spreadsheet-like view that made small issues easy to spot. Price values often contained currency symbols, extra dots, or commas that could cause problems during analysis, so these were carefully removed or standardized to ensure every price followed the same clean format. Some columns appeared out of order or contained mixed values, so they were rearranged and refined to keep related information grouped together in a logical way. Duplicate product URLs, which can quietly appear during repeated scraping runs, were identified and removed to avoid counting the same vegetable more than once. Throughout this process, OpenRefine acted like a smart cleaning desk, allowing each adjustment to be previewed before applying it, which helped maintain accuracy. By the end of this stage, the once-raw dataset was transformed into a neat, consistent, and analysis-ready table, making it far easier to explore trends and patterns in the Carrefour UAE vegetables data with confidence.
All-in-One Tool-sets for Smarter Data Extraction
A smooth scraping workflow depends heavily on choosing the right tools, and this project brings together a set of Python libraries that make the entire journey—from opening a webpage to saving clean product details—feel steady and manageable even for someone new to data extraction. Browser automation plays a central role here, with Playwright and Selenium stepping in to load each vegetable product page just like a real user would, allowing the script to scroll, wait, and interact naturally with different elements. To help these automated actions blend in more smoothly, Playwright Stealth adjusts small browser behaviors so the site treats the scraper like a regular visitor rather than a bot. Once a page loads, BeautifulSoup steps forward, turning the raw HTML into a readable structure that makes it easy to extract names, prices, discounts, and other details without wrestling with messy code. For organizing everything behind the scenes, sqlite3 stores product links in a lightweight database, keeping track of what has already been processed, while JSON provides a simple backup format that can be opened or shared with ease. Supporting libraries like logging and datetime help document each step and track when data is collected, and modules such as random and time add natural delays that help the scraper behave more realistically. Even elements like Firefox browser options and Selenium's waiting functions contribute to stability by giving each product page enough time to load fully before any extraction begins. All of these tools work together in a calm, coordinated way, creating a pipeline that handles dynamic pages, organizes the collected information, and keeps the entire process clear and easy to understand.
Step 1: Scraping Product URLs from the Vegetables Section
Importing Libraries
IMPORTS
import sqlite3
import json
import logging
from datetime import datetime
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
from bs4 import BeautifulSoup
import timeThe code uses a few essential Python tools—such as Playwright for loading the webpage, BeautifulSoup for reading the HTML, and SQLite for storing the scraped results—to collect basic product details from the site.
Understanding the Logging Setup
LOGGING SETUP
logging.basicConfig(
filename='/home/anusha/Desktop/DATAHUT/carrefouruae/Log/carrefour.log',
filemode='a',
format='%(asctime)s - %(levelname)s - %(message)s',
level=logging.DEBUG
)
"""This block configures Python's built-in logging module to track the scraper’s runtime activity, errors, and debug information""" The logging setup acts like a small diary for the scraper, quietly recording what happens during the run, and the main parameters simply decide where the log file should be saved, how new entries are added, what details each message includes, and the level of information that needs to be captured, creating a clear and friendly trail that helps trace errors or unusual behavior while working with data from the Carrefour UAE vegetables section
SQLite Table for Storing Scraped Vegetable Product URLs
DATABASE SETUP
conn = sqlite3.connect('/home/anusha/Desktop/DATAHUT/carrefouruae/Data/carrefour.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS product_urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
scraped_date TEXT,
processed INTEGER DEFAULT 0
)''')
conn.commit()
"""This block connects to a SQLite database and ensures that a table for storing scraped product URLs exists"""The database setup acts as a small storage room where each scraped vegetable product link from the Carrefour UAE section can be safely kept for later use, and the code simply connects to a SQLite file, opens a cursor to communicate with it, and creates a table only if it does not already exist; the table includes an automatically increasing ID, a place to store each product URL, a date field to remember when the link was collected, and a small marker that shows whether the data from that URL has been processed, allowing the scraping workflow to run in an organised and friendly manner while working with the vegetables page at the base_url and target_url provided earlier.
JSON Backup System for Product URLs
JSON SAVE SETUP
json_file = '/home/anusha/Desktop/DATAHUT/carrefouruae/Data/carrefour_urls.json'
try:
with open(json_file, 'r') as f:
json_data = json.load(f)
except FileNotFoundError:
json_data = []
"""This block manages backup storage of scraped product URLs in a JSON file. It ensures that previously scraped data is preserved between script runs"""The JSON save setup works like a small notebook that keeps a backup copy of every vegetable product URL collected from the Carrefour UAE section, and the code first points to a file where the data should be stored, then attempts to open it so any previously saved information can be loaded back into memory; if the file is not found, an empty list is created instead, allowing the script to start fresh without errors.
Basic HTTP Headers for Scraping
HEADERS
headers = {
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0",
"Connection": "keep-alive",
}
"""This dictionary stores the HTTP headers that will be sent along with requests made by the browser (Playwright)"""The headers dictionary acts like a small introduction the scraper gives when requesting a page from the Carrefour UAE vegetables section, and each entry plays a simple role in helping the website understand how the request should be handled; the Accept field tells the server what kinds of content can be received, the User-Agent identifies the browser style being used, the Connection field keeps the link open for smoother communication, and the encoding line explains how compressed data can be handled, creating a cleaner and more reliable interaction while fetching details from the base_url and target_url defined earlier.
Core Scraper Function for Collecting Product Links
SCRAPER FUNCTION
def scrape_carrefour():
"""Scrape product URLs from Carrefour UAE (Vegetables category)"""
base_url = "https://www.carrefouruae.com"
target_url = "https://www.carrefouruae.com/mafuae/en/c/F11660500"
with sync_playwright() as p:
browser = p.firefox.launch(headless=False)
context = browser.new_context()
stealth_sync(context)
page = context.new_page()
try:
page.goto(target_url, timeout=60000)
logging.info("Navigated to Carrefour Vegetables page.")The first part of the scraper sets the stage by preparing everything needed to read the vegetables section from the site, starting with two basic URL references that guide the browser to the right category page. The function opens a Playwright browser, adds stealth settings to reduce detection, and creates a new page where the content will load. Once the page is opened, the script gently navigates toward the vegetables section and logs the action as a way to record what is happening behind the scenes. This simple setup works like opening the front door of a store before exploring the shelves, giving a clear starting point before any real extraction begins.
# Accept cookies if popup appears
try:
page.wait_for_selector('#onetrust-accept-btn-handler', timeout=10000)
page.click('#onetrust-accept-btn-handler')
logging.info("Accepted cookies.")
except Exception:
logging.info("No cookie popup detected.")
product_urls = []
last_height = 0
scraped_urls_in_run = set() # track URLs scraped in *this run* only
while True:
soup = BeautifulSoup(page.content(), 'html.parser')
product_tags = soup.find_all('div', class_='max-w-[134px]')
for tag in product_tags:
a_tag = tag.find('a', href=True)
if a_tag:
product_url = base_url + a_tag['href'].split('?')[0]
if product_url in scraped_urls_in_run:
continue # skip already scraped in this run
scraped_urls_in_run.add(product_url)
scraped_date = datetime.now().strftime('%Y-%m-%d %H:%M:%S')In the second part, the scraper handles small website interactions and starts gathering the product links that appear on the screen. It begins by checking whether a cookie popup shows up and, if it does, closes it so the page can load fully without interruptions. After that, the script creates a few helper lists to store the links found during the run and avoid duplicates. As the page scrolls, the HTML content is read with BeautifulSoup, and each product box is scanned to locate the anchor tag that carries the product’s path. The URL is cleaned, its timestamp is recorded, and the link is added to a temporary set that keeps track of what has already been captured, ensuring a smooth and organized scraping process.
# Insert into DB
c.execute('INSERT INTO product_urls (url, scraped_date, processed) VALUES (?, ?, ?)',
(product_url, scraped_date, 0))
conn.commit()
# Append to JSON
json_data.append({
"url": product_url,
"scraped_date": scraped_date,
"processed": 0
})
product_urls.append(product_url)
logging.info(f"Scraped {len(product_urls)} URLs in this iteration.")The third part focuses on storing every product link safely so it can be used later without needing to scrape again. Whenever a new URL is found, the script inserts it into a SQLite table along with the exact time it was collected and a small marker showing that the link has not yet been processed. At the same time, a copy of the same information is added to the JSON backup list, creating a secondary record that protects the data even if the workflow is restarted. This dual-saving approach works like writing a note in two places—one in a structured database and another in an easy-to-read file—making the entire process more reliable.
# Attempt to click "Load More"
try:
load_more_selector = 'button:has-text("Load More")'
page.wait_for_selector(load_more_selector, timeout=10000)
page.click(load_more_selector)
time.sleep(3)
logging.info("Clicked Load More.")
except Exception:
logging.info("No more Load More button, ending scroll.")
break
logging.info(f"Total URLs scraped: {len(product_urls)}")
except Exception as e:
logging.error(f"Error while scraping: {e}")
finally:
browser.close()This part controls the scrolling experience by checking whether the page offers a “Load More” button that reveals additional vegetable products. The script waits for the button, clicks it when available, pauses briefly to let new items appear, and logs the action for clarity. If the button does not show up, it simply means the page has reached the end of the list, so the loop breaks and the scraper finishes its work. This closing step feels like turning each page of a catalog until the last one appears, ensuring that no product link is missed while keeping the process clear and predictable.
Saving the Final Data into a Clean JSON File
# Save JSON
with open(json_file, 'w') as f:
json.dump(json_data, f, indent=4)
logging.info("Scraping completed and JSON saved.")At the end of the scraping run, the script simply opens the JSON file in write mode, stores the updated list of collected product details in a neatly formatted structure, and logs a short message to confirm that the data has been safely saved for future use.
Triggering the Vegetable Scraper Only When the Script Runs Directly
MAIN
if __name__ == "__main__":
scrape_carrefour()
""" Main Execution Block: - When the script is run directly, execute the scraping process"""The main block simply checks whether the file is being run on its own and, if so, starts the vegetable-scraping function, ensuring the process activates only during direct execution and not when the script is imported into another program.
Step2: Collecting Detailed Information from Individual Product Pages
Importing Libraries
IMPORTS
import sqlite3
import json
import os
import random
import time
import logging
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutExceptionThis import block simply gathers the essential tools needed for the scraping workflow, bringing in modules for database storage, JSON backups, time tracking, logging, and HTML parsing, while Selenium and its supporting classes handle browser automation by opening pages, waiting for elements to load, and managing timeouts so the script can smoothly collect vegetable product data without revealing any website links directly in the code.
Setting Up Basic Configuration Paths for Storing Scraped Data
CONFIGURATIONS
DB_PATH = "/home/anusha/Desktop/DATAHUT/carrefouruae/Data/carrefour.db"
LOG_DIR = "/home/anusha/Desktop/DATAHUT/carrefouruae/Log/data_scraper"
JSON_OUTPUT = "/home/anusha/Desktop/DATAHUT/carrefouruae/Data/product_data.json"
"""
- DB_PATH: Path to the SQLite database where product data will be stored.
- JSON_OUTPUT: Path to the JSON file where data will be exported as backup.
- LOG_DIR: Directory where log files will be saved.
"""The configuration section simply defines three important paths that guide where the scraped vegetable information will be stored, with one pointing to the SQLite database for structured records, another leading to the JSON file used as a backup copy, and the last specifying the folder where log files should be saved, creating a clear and organized foundation before the scraper begins collecting data from the Carrefour UAE vegetables section.
Logging Setup
LOGGING
os.makedirs(LOG_DIR, exist_ok=True)
log_file = os.path.join(LOG_DIR, "product_scraper.log")
logging.basicConfig(
filename=log_file,
filemode='a',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)The logging setup creates a folder if it does not already exist, chooses a file where all activity will be recorded, and then defines how each log entry should look by specifying the file path, the mode for adding new messages, the level of detail to capture, and a clear format showing the time, message type, and description, making it easier to trace what happens during the vegetable-scraping process.
Initializing a Simple Selenium Browser to Collect Vegetable Product Data
SELENIUM SETUP
options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)
driver.set_page_load_timeout(60)
"""This block is responsible for starting and configuring the web browser that will be controlled by our script"""The Selenium setup creates a controlled browser window that the script can operate, with the options setting used to decide whether the browser stays visible, the driver launching Firefox to perform the automated actions, and the timeout ensuring the page has enough time to load, making the scraping process for the Carrefour UAE vegetables section smooth.
SQLite Table to Store Product Details
DATABASE SETUP
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS product_data (
id INTEGER,
url TEXT,
name TEXT,
price REAL,
discount TEXT,
scraped_date TEXT
)
""")
conn.commit()
"""This block creates and prepares the SQLite database where all scraped Carrefour product details will be stored"""
The database setup opens a connection to the chosen storage file, creates a cursor to send instructions, and sets up a table only if it does not already exist, with columns for the product link, name, price, discount, and the time it was scraped, forming a clean and friendly structure for saving item details collected from the vegetables section.
Safe Extraction Helper to Cleanly Collect Data
HELPERS
def safe_extract(soup, selector, attr=None, multiple=False, default=None):
"""Safely extract text or attribute values from an HTML element using BeautifulSoup """
try:
if multiple:
return [e.get(attr) if attr else e.get_text(strip=True) for e in soup.select(selector)]
else:
element = soup.select_one(selector)
return element.get(attr) if attr else element.get_text(strip=True)
except Exception:
return defaultThe helper function acts like a gentle, reliable assistant that handles the small but important task of pulling information from each HTML element without breaking the scraping process, and it works by checking whether a single item or multiple items should be gathered, selecting the correct element from the page, and then returning either its text or a specific attribute based on what is needed; if anything unexpected happens—such as the element not existing or the structure being different than usual—the function quietly returns a default value instead of causing an error.
Handling the Cookie Consent Popup for a Smooth Scraping Flow
HELPER TO HANDLE COOKIE POP-UP
def handle_cookie_popup():
"""Handles the cookie consent pop-up on the Carrefour website"""
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
logging.info("Accepted cookies.")
except Exception:
passThe cookie-handling helper takes care of the small but necessary step of clearing the consent popup that appears before the vegetables page can load properly, and it does this by patiently waiting for the accept button to become clickable, pressing it when available, and quietly moving on if the popup is not present; this simple action keeps the scraper from getting stuck at the very beginning and ensures that the rest of the data-collection process can continue smoothly.
Processing Unfinished Vegetable Links and Extracting Complete Product Details
MAIN SCRAPER
cursor.execute("SELECT id, url, scraped_date FROM product_urls WHERE processed=0")
rows = cursor.fetchall()
logging.info(f"Total URLs to scrape: {len(rows)}")
"""It takes product URLs that have not yet been processed from the `product_urls` table, visits each page, extracts product details, and saves them into the `product_data` table"""
all_data = []
for row in rows:
product_id, url, scraped_date = row
try:
"""
Step 1: Open the product page and prepare soup object
- Load the URL in Selenium
- Handle cookie popup if it appears
- Wait for a short random delay to mimic human browsing
- Convert the loaded page into a BeautifulSoup object
"""
driver.get(url)
handle_cookie_popup()
time.sleep(random.uniform(3, 6))
soup = BeautifulSoup(driver.page_source, 'html.parser')
The main scraper begins by pulling all vegetable product links from the database that are still marked as unprocessed, creating a clear list of pages that still need attention, and for each of these links, the script carefully opens the product page in the browser, waits for it to load naturally, handles any cookie popups that may appear, and then allows a small random delay to make the browsing pattern look more human; once the page settles, the HTML is passed into BeautifulSoup, which turns it into a structured format the script can read, and from there, the scraper gradually extracts details such as the product name, price, image, and other information, storing everything in a separate data table so nothing gets missed or overwritten, making the overall scraping journey from raw links to meaningful product data smooth and organized.
Extracting the Main Product Details Like Name, Price, and Discount
EXTRACT PRIMARY DATA
"""Step 2: Extract product name, price and discount"""
name = safe_extract(soup, "h1 span")
price_main = safe_extract(soup, "div.flex.items-baseline div.text-xl")
price_dec = safe_extract(soup, "div.flex.items-baseline div.text-sm.ml-2xs")
try:
price_str = f"{price_main}{price_dec}".replace('..', '.').replace(' ', '').replace('AED', '').replace(',', '.')
price = float(price_str)
except:
price = 0.0
discount = safe_extract(soup, "div.text-md.leading-5.font-bold.ml-2xs.text-c4red-500")
The section responsible for extracting the primary details focuses on gathering the most important information a shopper would notice first, and it starts by pulling the product name from the page so the item can be clearly identified; next, it reads the price, which is often split into two small parts on the page, and these pieces are gently combined, cleaned, and converted into a proper number so the value can be stored accurately, with a fallback of zero in case the format is unusual; finally, the scraper checks whether a discount label is displayed and captures it when present, allowing the data to reflect both regular and offer prices, and this careful step-by-step approach helps create a complete and reliable record of each vegetable product.
Storing Clean Product Details Safely in the Database
INSERT INTO DB
""" Step 3: Save extracted data into the database """
cursor.execute("""
INSERT INTO product_data (id, url, name, price, discount, scraped_date)
VALUES (?, ?, ?, ?, ?, ?)
""", (
product_id, url, name, price, discount, scraped_date
))
conn.commit()
cursor.execute("UPDATE product_urls SET processed=1 WHERE id=?", (product_id,))
conn.commit()Once the scraper finishes collecting the main details of each vegetable product, this section ensures that everything is stored neatly in the database so nothing gets lost, and it does this by inserting the product’s unique ID, page link, name, cleaned price, discount value, and the original scraped date into a structured table designed to hold the final information; after saving the entry, the script immediately updates the status of that specific link in the product URL table to show that it has already been processed, which prevents the same page from being scraped again in future runs, and this simple two-step cycle—first storing the data, then marking it as complete—helps the entire workflow stay organized, and consistent.
JSON Export with a Simple In-Memory Backup
APPEND TO all_data FOR JSON OUTPUT
"""Step 4 : Save a copy into memory (all_data list)
- This allows exporting later into JSON, CSV, or other formats"""
all_data.append({
"url": url,
"name": name,
"price": price,
"discount": discount,
"scraped_date": scraped_date
})
logging.info(f"Scraped product: {product_id}")
except Exception as e:
logging.error(f"Error scraping URL {url}: {e}")
continue
After each vegetable product is successfully scraped and stored in the database, this part of the script creates a simple in-memory copy by adding the cleaned details to a list called all_data, which acts like a temporary collection tray that gathers every item during the run; keeping this extra copy makes it much easier to export all products at once into formats like JSON or CSV later on, without needing to re-query the database or repeat the scraping process. Each entry in the list includes the product link, name, final price, discount value, and the timestamp marking when it was collected, giving the entire dataset a well-structured and easy-to-use format. Along the way, the script logs each product ID to keep a clear record of progress, and if an error occurs for a particular link, the scraper simply notes the issue and moves on, ensuring the full extraction from the Carrefour UAE vegetables section continues smoothly without stopping midway.
Save To JSON
SAVE TO JSON
with open(JSON_OUTPUT, "w") as f:
"""Save all collected product data into a JSON file"""
json.dump(all_data, f, indent=4)
logging.info(f"Scraping completed. Data saved to {JSON_OUTPUT}")The final step of the scraper focuses on turning all the collected vegetable product details into a neatly formatted JSON file, making the data easy to reuse for analysis, reporting, or further automation. By opening the chosen output file in write mode and passing the complete all_data list into json.dump, the script creates a structured snapshot of every product scraped in this session, with clear indentation that keeps the file readable. Once the export is complete, a log message confirms the successful save, helping maintain a smooth flow from scraping to storage and giving the entire workflow a polished finish that connects naturally with everything done earlier in the Carrefour UAE vegetables section scraping process.
Closing the Scraper Safely by Releasing All Used Resources
CLEANUP
driver.quit()
conn.close()
"""This block ensures that all resources used during the scraping process are properly released after the script finishes execution"""The cleanup step brings the scraping process to a gentle and responsible finish by closing the browser window with driver.quit() and shutting down the database connection with conn.close(), making sure that every tool opened during the run is properly released so the system stays smooth, stable, and ready for the next time the vegetable data from the Carrefour UAE section needs to be collected or reviewed.
Conclusion
Bringing this study to a close, the five-day snapshot of the Carrefour UAE vegetables section shows how even everyday grocery items carry quiet patterns when observed over time. Tracking the same set of vegetables day after day highlighted that prices and discounts are not as static as they appear during a single visit, with small shifts reflecting stock movement, promotional cycles, and short-term demand. Some vegetables remained steady throughout the period, offering a sense of pricing consistency, while others showed brief changes that hinted at ongoing offers or supply adjustments. Beyond the numbers, the process itself demonstrated how time-based scraping turns simple product data into a story, revealing how online shelves evolve rather than staying frozen in one moment. This time-series approach transforms routine scraping into a meaningful learning exercise, showing that even a short window of data collection can uncover valuable insights and build a strong foundation for deeper analysis of real-world e-commerce behavior.
Libraries and Versions
Name: playwright
Version: 1.48.0
Name: playwright-stealth
Version: 1.0.6
Name: selenium
Version: 4.18.1
Name: beautifulsoup4
Version: 4.13.3
AUTHOR
I’m Anusha P O, Data Science Intern at Datahut, with a strong focus on building automated data collection workflows that convert raw web content into clean, structured datasets ready for analysis.
In this blog, the focus shifts to practical data extraction from large e-commerce platforms, using the Carrefour UAE website as a real-world example. Modern online retailers display thousands of products across dynamic pages, and understanding how to responsibly collect and organize this information is an essential skill for anyone working with data today.
At Datahut, the goal is to design scalable and reliable web data solutions that support informed business decisions. If there is interest in using public web data for market analysis, pricing intelligence, or category insights, feel free to connect through the chat widget and explore how raw online data can be turned into actionable intelligence.
FAQ SECTION
1. What data can you scrape from Carrefour UAE?
Product name and category
Price and discount details
Stock availability
Product URLs and metadata
2. Which tools are best for scraping Carrefour UAE data?
Playwright for dynamic page loading
Selenium for browser automation
BeautifulSoup for HTML parsing
SQLite and JSON for data storage
3. Is it legal to scrape Carrefour UAE website data?
Scraping publicly available data is generally allowed
Must follow website terms of service
Avoid overloading servers with frequent requests
Use data responsibly for analysis
4. Why track vegetable prices over time?
Identify pricing trends and fluctuations
Monitor discount cycles
Analyze demand and supply patterns
Improve competitive pricing strategies
5. How do you clean scraped ecommerce data?
Remove duplicates and missing values
Standardize price formats
Convert data types (string to numeric)
Structure data for analysis tools


