How to Scrape Ethos Product Data Efficiently Using Python and Async Tools
- Anusha P O
- 2 days ago
- 38 min read

When we think about luxury watches, they are rarely just about telling time—they carry stories of craftsmanship, heritage, and personal style. In India, one name that consistently stands out in this space is Ethos Watches, the country’s largest luxury and premium watch retailer. With a market share of 13% in the premium and luxury segment and a market cap of over ₹7,400 crore, Ethos has built a reputation for trust and authenticity. Every timepiece sold here goes through a series of checks to ensure it is 100% genuine, giving customers complete confidence in their purchase. From iconic names like Rolex, Omega, and TAG Heuer to niche Swiss watchmakers, Ethos curates a wide collection that caters to different tastes—whether it’s a minimal everyday piece, a sporty companion, or an intricate horological masterpiece. With more than 60 boutiques across 20 cities, along with a strong online presence through ethoswatches.com, Ethos makes the world of fine watchmaking accessible to Indian buyers.
Ever wondered what stories lie hidden behind luxury watch collections? By looking closely at data from Ethos Watches, we can uncover patterns that go beyond just brand names or price tags. From understanding which watch types are most popular to seeing how design choices influence buyer preferences, data helps us see the bigger picture of the luxury watch market in India. In the sections ahead, we’ll walk through the process of gathering and analyzing this data to bring those insights to light.
Automated and Insightful Data Extraction
We worked with Ethos Watches’ data in two stages: first, collecting product page links from their online store, and then preparing the information for deeper analysis. Our focus was on exploring different aspects of luxury timepieces—ranging from pricing patterns and brand positioning to design details like straps, glass types, and movement styles.
Step 1: Gathering Product Page Links
Before we can dive into analyzing luxury watches, the very first task is to gather the raw material: product page links. Think of it like preparing the foundation of a house—without strong building blocks, nothing else can stand firmly. In the same way, if our dataset doesn’t start with accurate product links, the later stages of analysis will fall apart.
For Ethos Watches, this meant starting with their brands listing page at ethoswatches.com. This page acts like the front door to hundreds of premium timepieces, and our goal was to carefully step through each section, collect the product URLs, and save them for later. But since doing this manually would take days, we turned to automation.
Using Playwright, a tool that lets us control a web browser with code, we wrote a script that could act like a diligent assistant: open the Ethos website, close any unexpected popups, and systematically record every watch link it found. Each product block on the page contains a clickable link, and by telling Playwright where to look (div.product_sortDesc), we could extract these links one by one.
Once the links were collected, we didn’t just leave them floating around in memory. To keep everything neat and reusable, the script saved the links in two places:
A database (SQLite) – This worked like a permanent storage cabinet, ensuring every product link was stored securely without duplication.
A JSON file – This provided an easy-to-read snapshot of the links from each page, which could be shared or checked later.
Because Ethos Watches has multiple pages of products, the script also needed to handle pagination—that little “Next” button at the bottom of the page. Instead of us clicking it endlessly, the script kept moving forward until there were no more pages left, quietly recording every watch along the way.
In short, Step 1 was all about building a strong dataset of product URLs. With over sixty boutiques and hundreds of timepieces online, Ethos offers a massive catalog, and this process gave us a structured way to capture it all. Having these links in hand is like having a detailed map before setting out on a journey—we now know exactly where each watch lives on the site, and we’re ready to explore deeper insights in the next steps.
Step 2: Extracting Detailed Data from Each Product URL
Once we had a solid collection of product links, the next logical step was to open each one and dig deeper. If Step 1 was about drawing the map, Step 2 was about walking into every boutique and carefully noting down the details of each watch.
Every product page on Ethos Watches tells a story—brand, collection, price, movement type, water resistance, and more. But since no two pages are exactly alike, we needed to design our scraper to adapt.
Here’s how the process worked:
Using the product URLs we had stored earlier, our script revisited each page one by one and carefully extracted the details. Along the way, it was designed to handle unexpected interruptions like promotional or subscription popups by detecting and closing them automatically. Once inside, the scraper located key product information such as the title, brand, price, and technical specifications by targeting the right HTML elements to ensure accuracy. Finally, all captured details were stored securely in the same SQLite database alongside the product links, with a processed flag added to prevent re-scraping the same page. At the same time, the results were exported into a JSON file, making the dataset easy to inspect, share, or use for further analysis.
The most important part of this step was reliability. Websites can be unpredictable—sometimes a page loads slower, sometimes a detail is missing. To tackle this, we included error handling in our code: if a page failed to load or a field wasn’t found, the script logged it instead of breaking down. This ensured that the scraping process could run smoothly across hundreds of pages without constant supervision.
By the end of Step 2, we had transformed a simple list of product links into a rich dataset of Ethos Watches. Instead of just knowing where the watches were, we now had their full profiles: what they were, how much they cost, and what features they carried.
Step 3: Cleaning the Extracted Ethos Watch Data
When you first scrape data from a website like ethoswatches.com, it feels exciting—you’ve just collected hundreds of rows about luxury watches, complete with brand names, models, prices, and technical details. But very quickly, you’ll notice that raw data is rarely “ready to use.” Instead, it often looks messy, inconsistent, and full of small details that can confuse your analysis.
Think of it like bringing home fresh vegetables from the market. They look great at first glance, but before cooking, you need to wash, peel, and cut them. Data cleaning is exactly that process for datasets—it takes the raw, collected information and prepares it so that your analysis becomes smooth and reliable.
Take the price column as an example. The raw data usually includes the ₹ sign and sometimes even commas, like ₹2,50,000. While this looks fine to the human eye, computers prefer a simple number such as 250000. So, one of the first steps in cleaning is removing those extra symbols to make prices easier to calculate and compare.
Units also need attention. In Ethos data, details like 21,600 bph or Approx. 60 hours appear frequently. While these phrases are helpful for customers, they add unnecessary complexity to a dataset. By cleaning, we can strip away the word bph or remove Approx. from the power reserve, keeping only the clean numeric values like 21600 or 60 hours. This way, the data becomes uniform and ready for deeper analysis.
The process may not sound as glamorous as scraping or visualizing trends, but it’s the bridge that connects raw collection to meaningful insights. Once the Ethos dataset is cleaned, we can confidently explore patterns—like how price relates to movement type, or whether certain watch features are linked to higher demand. Clean data doesn’t just make analysis possible; it makes the insights trustworthy.
Essential Building Blocks Behind the Scenes: Python Libraries That Power the Workflow
Behind every smooth and reliable data scraping project lies a carefully selected set of Python libraries working quietly in the background. Think of them as a team of skilled helpers—each with a specific role—making sure everything runs efficiently, stays organized, and doesn’t crash when things get tricky. In this project, we’ve brought together a handful of powerful tools that cover everything from web browsing to data saving, all while keeping the process fast and error-free.
To start, we have asyncio, which acts like a traffic controller for the script. It allows different parts of the program to run at the same time without waiting in line. For example, while one product page is still loading, another can already begin processing—saving time and keeping the workflow smooth. Working closely with asyncio is playwright.async_api, which opens and interacts with websites as if a person were browsing them. It clicks, scrolls, and fetches content—even from pages that load using JavaScript—making it perfect for modern websites.
But fast scraping can also attract unwanted attention from websites with anti-bot systems. That’s where Playwright Stealth (not shown in this block but often used alongside) can be handy—it helps the browser behave more naturally, reducing the chances of being blocked. Meanwhile, TimeoutError from Playwright helps us gracefully handle situations where a page takes too long to load, so the program doesn’t crash but moves on smartly.
Once data is captured, we need to store it in a reliable and organized way. That’s where sqlite3 comes in—a lightweight, file-based database that doesn’t need a server. It’s simple to use and perfect for saving structured information like URLs and product details. Think of it as a neat digital notebook that you can query anytime.
Supporting all of this are a few unsung heroes. The random module helps introduce natural pauses between actions, mimicking how a real person might browse, which also helps avoid detection. The logging library quietly keeps track of what’s happening during the run—successes, failures, and everything in between—so you can go back and understand any issues. Finally, Path from the pathlib module makes it easier to manage folders and file locations in a clean and reliable way, no matter which operating system you’re using.
Altogether, this thoughtful combination of Python libraries creates a strong foundation for scalable, efficient, and resilient data scraping. By letting each tool do what it does best, we ensure that the system remains easy to manage, beginner-friendly, and ready to handle real-world challenges.
Step 1: Extracting Product URLs from the Dog Food Section on Ethos
Imports and Initial Setup
# IMPORT REQUIRED LIBRARIES
import asyncio
import json
import random
import sqlite3
import logging
from pathlib import Path
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
Before diving into the actual scraping process, the very first thing we need to do is gather our tools—and in Python, that means importing the right libraries. Think of this like laying out everything you’ll need on your workbench before starting a DIY project. These imports are not just random names; each one plays a specific and important role in helping our scraper run smoothly and smartly.
We begin with asyncio. This library is like a multitasking expert for our Python code. Normally, a program does one thing at a time—wait for a page to load, then process it, then move to the next one. But with asyncio, we can juggle several tasks at once. It’s like having multiple tabs open in your browser, each doing something useful in the background. This makes our scraper much faster and more efficient, especially when dealing with multiple pages.
Next, we pull in json, which helps us handle structured data. If we ever want to save the information we collect in a format that other programs or people can read easily, JSON is the way to go. It’s like packing our data into neat little boxes, with labels on everything.
Then comes random. This one might sound a bit odd in a scraper, but it actually serves a clever purpose. Websites are smart these days, and if they notice a bot clicking through their pages too quickly or in a predictable pattern, they might block it. So we use random to slow things down a little and add variation—maybe a 2-second pause here, a 3.1-second pause there—just to make our bot feel more human.
Now let’s talk about sqlite3. This is our way of storing the treasure we dig up. Think of it like a mini spreadsheet or a pocket-sized database that lives right on your computer. It doesn’t need any setup or internet access—just quietly saves everything in an organized file. This makes it perfect for projects where we’re collecting a lot of data, like hotel links, product info, or anything else.
We also bring in logging. Imagine you’re keeping a diary while running your scraper. If something goes wrong—maybe a page didn’t load, or a piece of data was missing—you’ll want a record of that. That’s what logging does. It keeps a behind-the-scenes record of everything our code does, so we can look back and understand what worked, what didn’t, and why.
Then we have Path from Python’s pathlib module. This is just a nice, readable way to handle file and folder paths on your computer. Instead of writing long, clunky strings to manage where files go, Path makes it all cleaner and more intuitive. And finally, we import async_playwright from Playwright, along with TimeoutError. Playwright is the heart of our operation—it’s what allows our code to control a web browser. It can open a website, click buttons, scroll through listings, and wait for elements to appear—just like a human browsing manually. That’s especially important for modern sites that load content dynamically or hide it behind buttons. The TimeoutError part helps us deal with situations where a page takes too long to load—kind of like setting an alarm so we don’t sit around waiting forever.
In short, this small block of imports is quietly doing a lot of heavy lifting. It gives our scraper the brain to think, the hands to interact, and the memory to store everything it finds. With these tools in place, we’re ready to start exploring the web—systematically, efficiently, and smartly.
Logging Configuration
# SETUP LOGGING FOR DEBUGGING AND TRACKING
log_dir = Path("logs")
log_dir.mkdir(exist_ok=True)
log_file = log_dir / f"scrape_log_{Path().cwd().name}.log"
logging.basicConfig(filename=log_file, level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
"""
This block sets up logging to track the script's behavior and any errors.
It creates a `logs/` directory (if not already present) and stores log messages
in a file with the current folder's name in it.
"""Once we’ve gathered our tools by importing all the necessary libraries, the next important step in building a scraper is to set up a way to keep track of what’s happening behind the scenes. Imagine you’re baking a cake for the first time and you decide to jot down everything as you go—what worked, what didn’t, and where you had to make changes. This is exactly what logging does for our code. It quietly records the journey of the script so we can look back later and understand how everything unfolded.
In this part of the code, we’re creating a log system using Python’s built-in logging module. We start by making a folder named logs. This will be our storage room—it's where all the log files will live. The line log_dir.mkdir(exist_ok=True) ensures that if this folder doesn’t exist already, Python will create it for us. And if it’s already there, that’s fine too—it won’t throw any errors or complaints.
Then, we build the name for our log file using the current working directory’s name. That’s what this line is doing: log_file=log_dirf"scrape_log_{Path().cwd().name}.log".
Think of it like naming your notebook based on the kitchen you’re baking in. This helps keep our log files organized, especially if we’re running the scraper in different folders or for different projects.
Next, we tell Python how to write in this log file. Using logging.basicConfig(...), we define a few things: where to save the log (filename=log_file), how detailed the messages should be (level=logging.INFO), and the format of each log entry. This format includes the date and time something happened, the type of message (like INFO, WARNING, or ERROR), and a short message describing what occurred.
Why is all of this important? Well, imagine your scraper is running for hours and suddenly stops. Without logging, you’d be left guessing what went wrong. But with logging, you can open the log file and see exactly what the script was doing before it crashed. Maybe a page didn’t load in time, or the internet disconnected briefly. Logging keeps a reliable diary of every action, which is incredibly helpful for both beginners and experienced developers alike.
So before we even send our scraper out into the wild, we’re already building a system to help us understand and debug it later. It’s like packing a travel journal before a trip—you might not need it immediately, but when something unexpected happens, you’ll be glad it’s there.
Defining Global Configuration and Constants
# DATABASE INITIALIZATION
async def init_db():
"""
Initializes the SQLite database by creating a table for storing product URLs.
"""
conn = sqlite3.connect(DB_PATH)
c = conn.cursor()
c.execute("""
CREATE TABLE IF NOT EXISTS product_urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE
)
""")
conn.commit()
conn.close()
Now that we’ve told our scraper where to go and how to behave, it’s time to prepare a place to store the information it collects. Think of this step as setting up a filing cabinet before you start sorting documents. In programming, that filing cabinet is often a database, and for small projects like ours, SQLite is a perfect choice. It’s lightweight, doesn’t require any server setup, and everything is saved in a simple file on your computer.
The function init_db() takes care of this setup. As the name suggests, it initializes the database—meaning it creates the structure we need to start storing data. But don’t worry, it’s not as complicated as it sounds. This function gently checks if our database already exists, and if it doesn’t, it builds the table we need. That table is called product_urls, and it’s where all the product page links we scrape will be saved.
Inside this table, we define two columns. The first is id, which is like a serial number that automatically increases each time we add a new row. You don’t need to think about this one too much—it’s just there to help keep everything uniquely identified. The second column is url, which will store the actual link to a product page. This column is marked as UNIQUE, which means no two entries can be the same. This is really helpful in web scraping because it prevents us from accidentally saving the same link more than once.
The function works step-by-step, just like how you’d open a notebook, draw a table, and label the columns. First, it connects to the database file at the location we defined earlier (DB_PATH). If that file doesn’t exist yet, SQLite will quietly create it. Then, we create a cursor, which is like a pen used to write commands inside the database. We use that cursor to write a CREATE TABLE command, which basically says, “If this table isn’t already here, make one now.” Once that’s done, we save the changes with commit() and close the connection—just like closing the notebook when you're done writing.
This setup is a one-time task. You only need to run init_db() once before scraping begins, to make sure everything is in place. After that, your scraper will know exactly where to put the URLs it collects, and it won’t bother saving the same one twice. It’s a quiet but crucial part of the process that ensures our data stays organized, tidy, and ready for analysis later.
Saving Scraped URLs to the Database
# SAVE URL TO DATABASE
async def save_to_db(url):
"""
Saves a single product URL to the SQLite database.
"""
conn = sqlite3.connect(DB_PATH)
c = conn.cursor()
try:
c.execute("INSERT OR IGNORE INTO product_urls (url) VALUES (?)", (url,))
conn.commit()
finally:
conn.close()
Once our database is set up and ready to receive data, the next step is teaching our scraper how to actually store something in it. That’s where the save_to_db() function comes in. Imagine this function as a responsible assistant that carefully writes each new product link into a notebook—making sure it doesn’t write the same thing twice.
This function is quite simple in its purpose: it takes in one URL—the link to a product page—and saves it to our database. That’s it. But it does it with care. The input to this function is a single string, the URL we’ve scraped from the website. We pass that URL into our database using a command called INSERT OR IGNORE. This phrase is key because it keeps things clean: if we accidentally try to insert the same URL more than once, SQLite will politely ignore it instead of throwing an error or cluttering our table with duplicates.
Behind the scenes, the function starts by connecting to the database file we defined earlier with DB_PATH. Once connected, we create a cursor—this is what allows us to send instructions to the database. Then, using that cursor, we run the SQL command to insert the URL. The (?,) part in the SQL line is a placeholder for the actual value, and it’s filled in with the url we passed. This method not only prevents errors but also helps protect against SQL injection—a common security issue.
Notice how we wrap the database operations in a try block with a finally clause. Even though this function is small, we still want to be safe and responsible. The finally block ensures that the database connection is always closed, no matter what. Think of it like turning off the lights and locking the door when you leave a room—even if something unexpected happens.
In short, save_to_db() is a quiet worker. It doesn’t try to do too much. It takes one piece of data, checks if it’s new, and files it away if it hasn’t seen it before. This kind of function becomes incredibly useful when scraping hundreds or thousands of links, helping us avoid duplication and making sure every piece of data has its proper place.
Store URLs Safely with JSON Backup
# SAVE URL TO JSON
async def save_to_json(url_list, page_num):
"""
Saves a list of URLs from a single page into a JSON file.
"""
json_path = Path(f"Data/data1/ethos_page_{page_num}.json")
with json_path.open("w") as f:
json.dump(url_list, f, indent=2)
After collecting product links from a webpage, we often want to keep a backup—not just for safety, but also for reviewing, sharing, or using the data outside of the database. That’s where our save_to_json() function comes into play. Think of this step as saving your progress in a game, or keeping a soft copy of your handwritten notes. While our database stores everything in a structured, query-friendly format, the JSON file gives us a more portable and human-readable version of the same information.
This function takes two inputs: a list of URLs (url_list) and the page number (page_num) they were scraped from. The page number is especially useful because it helps us organize the files. Instead of dumping everything into one huge document, we save each page’s data separately. This keeps things neat and makes it easier to debug or resume scraping later if something goes wrong.
Inside the function, we first define the path where this JSON file should be saved. Using Python’s Path from the pathlib module, we construct a clean, consistent file location—something like Data/data1/ethos_page_3.json, where “3” is the current page number. It’s a simple but powerful way to make our files self-explanatory just by their names.
Then we use Python’s built-in json module to handle the actual saving. The with open() block ensures the file is opened safely and closed properly when we’re done writing. The json.dump() function takes our list of URLs and writes them into the file in JSON format. The indent=2 part just makes the file prettier and easier to read if we open it later—sort of like adding clean line breaks and indentation in your notebook.
This small function may not seem flashy, but it’s incredibly practical. By saving each batch of URLs page by page, we’re giving ourselves a safety net. If the scraper stops midway or we need to inspect specific data later, we don’t have to rerun everything from scratch. Each JSON file becomes a snapshot of that moment in the scraping process—organized, readable, and ready to use.
Smart Pop-up Handler
# HANDLE UNEXPECTED POPUPS
async def close_popups(page):
"""
Attempts to close any pop-up elements that may appear and block access
to the main content of the web page.
"""
selectors = [
'div[role="button"][aria-label="Close"]',
'a.mdl-cls-btn.ctClickNew'
]
for selector in selectors:
try:
await page.locator(selector).first.click(timeout=2000)
logging.info(f"Closed popup: {selector}")
except:
pass
While scraping a website, things don’t always go exactly as planned. Sometimes, just as your script is trying to read or click something on the page, an unexpected pop-up appears—like a newsletter sign-up, a promotional offer, or a cookie consent banner. If you’ve ever visited a shopping website and been greeted by a sudden overlay blocking the content, you already know how frustrating these pop-ups can be. For a human, it’s easy to click the close button and move on. But for a scraper, unless we teach it what to do, it gets stuck—unable to move forward. That’s exactly why we have the close_popups() function.
This function is a simple but thoughtful piece of our scraper that plays the role of a quiet troubleshooter. It receives the current webpage (or "tab") being controlled by Playwright and begins scanning for pop-ups that might be hiding parts of the site we want to scrape. To do that, it goes through a small list of CSS selectors—these are patterns that help it find specific elements on the page, such as the close button on a pop-up window.
For each selector in the list, the function tries to locate the first matching element. If it finds it, it clicks the button to close the pop-up. If the element isn’t found—maybe the pop-up didn’t show up this time, or it was already dismissed—the function simply moves on without raising an error or stopping the script. This approach keeps the scraper flexible and resilient, ready to handle unpredictable behaviors on the site without breaking.
What makes this function so valuable is its subtlety. It doesn’t scrape data or save anything, but it quietly clears the way so that everything else can work properly. Pop-ups often sit right on top of product listings or navigation buttons. If we don’t close them, our script might fail to click on the “Next” button or miss a set of links. By handling these interruptions in advance, close_popups() helps ensure that the scraping flow remains smooth and uninterrupted, no matter what the website throws at us.
The Core Scraping Function
# MAIN ASYNCHRONOUS SCRAPING FUNCTION
async def scrape_ethos():
"""
Main asynchronous scraping function for Ethos Watches website.
"""
await init_db()
async with async_playwright() as p:
browser = await p.firefox.launch(headless=False)
context = await browser.new_context(extra_http_headers=HEADERS)
page = await context.new_page()
current_page = 1
total_scraped = 0
while True:
url = f"{START_URL}?p={current_page}" if current_page > 1 else START_URL
logging.info(f"Navigating to: {url}")
retry = 0
success = False
while retry < 3 and not success:
try:
await page.goto(url, timeout=60000)
await close_popups(page)
success = True
except PlaywrightTimeoutError:
retry += 1
wait = 2 ** retry + random.uniform(0, 2)
logging.warning(f"Timeout on page {current_page}. Retrying in {wait:.1f}s...")
await asyncio.sleep(wait)
if not success:
logging.error(f"Failed to load page {url} after retries.")
break
product_desc_blocks = await page.query_selector_all("div.product_sortDesc")
product_urls = []
for block in product_desc_blocks:
try:
a_tag = await block.query_selector("a")
href = await a_tag.get_attribute("href") if a_tag else None
if href and href.startswith("https://www.ethoswatches.com/"):
await save_to_db(href)
product_urls.append(href)
except Exception as e:
logging.warning(f"Error parsing product block: {e}")
await save_to_json(product_urls, current_page)
logging.info(f"Page {current_page}: Scraped {len(product_urls)} product URLs.")
total_scraped += len(product_urls)
next_button = await page.query_selector('a.next.page-link')
if next_button:
current_page += 1
delay = random.uniform(3, 6)
logging.info(f"Waiting {delay:.2f}s before next page...")
await asyncio.sleep(delay)
else:
logging.info("No more pages.")
break
await browser.close()
logging.info(f"Total products scraped: {total_scraped}")
After setting up all the building blocks—defining global variables, preparing the database, handling pop-ups, and saving data to files—we’re finally ready to bring everything together in one place. That’s the job of our main function, scrape_ethos(). This function is the heart of the scraping process. You can think of it as the conductor of an orchestra, calling on each instrument—our earlier functions—to play its part at just the right time. And because the web is dynamic and involves waiting for pages to load, we use Python’s asynchronous tools to manage everything efficiently and smoothly.
We begin by calling init_db(), which makes sure our database is set up and ready to store product URLs. Then we launch a web browser using Playwright—a tool that lets us control a browser as if a human were using it. Instead of hiding the browser (as scrapers often do to stay fast), we use it in visible mode (headless=False). This is useful while testing, so we can see what’s happening on each page.
Next, the function starts visiting each page of the website one by one. Ethos Watches uses pagination, meaning that product listings are spread across multiple pages. To move through them, we update the page number in the URL using a loop. For each page, we use a while loop to handle retries. Websites don’t always load smoothly—sometimes they’re slow or have temporary hiccups. So if the page doesn’t load the first time, we wait a bit and try again. This retry logic uses exponential backoff, a fancy way of saying, “wait a little longer each time you fail, but don’t give up too quickly.”
Once the page loads, we immediately call our close_popups() function. This checks if any unwanted pop-ups are covering the content and closes them so we can continue scraping without interruptions. Then we search for all the blocks that contain product information. These are usually HTML elements that follow a certain pattern, in this case, div.product_sortDesc.
From each block, we try to pull the link (<a> tag) that leads to the individual product’s page. We check if the link is valid and starts with the correct base URL to ensure we’re only saving proper product URLs. Each valid link is saved in two places—first into the SQLite database using save_to_db(), and then into a JSON file using save_to_json(). This two-pronged saving approach gives us both a structured database for analysis and a human-readable file for quick inspection or backups.
After collecting all the links from a page, the scraper checks if there is a “Next” button. If so, it waits a few seconds (randomly between 3 and 6) to mimic human browsing, then moves to the next page. If the “Next” button isn’t there, the scraper understands that it has reached the end and stops gracefully.
Finally, when all pages have been processed, we close the browser and print the total number of products scraped. This final message is a small but satisfying checkpoint—it lets us know everything has completed as expected. In the end, this function not only automates the entire scraping process from start to finish, but it also does so in a way that’s thoughtful, reliable, and respectful of the website being visited.
Script Entry Point
# SCRIPT ENTRY POINT
if __name__ == "__main__":
"""
Entry point of the script.
"""
asyncio.run(scrape_ethos())
When you're writing a Python script, it's important to have a clear starting point—something that tells the computer, "Begin here." In this script, the block that starts with if name == "__main__": is exactly that. Think of it like the front door to your program. If someone runs this script directly, this block is what gets executed. But if someone just wants to borrow part of your script (maybe they want to use your functions in their own code), then this section will stay quiet and not run automatically. That’s what makes this line so useful—it helps control when your code should actually do something.
Inside this block, there's a single line: asyncio.run(scrape_ethos()). This is where the real action begins. The scrape_ethos() function is the core of your scraping process—it’s where the browser launches, pages are visited, and product links are collected. But because it’s written as an asynchronous function (which is useful when you're doing tasks like waiting for websites to load), you can’t just call it like a regular function. That’s where asyncio.run() comes in. It sets up the necessary environment and tells Python, “Okay, this is an async function, so let’s run it properly.” It’s a bit like turning on the engine before you can drive a car.
Using this entry-point pattern might feel like a small detail, but it’s actually a good habit for anyone writing Python scripts—especially as your code grows or you work in teams. It keeps things clean, flexible, and prevents surprises when your code is used in new ways. So, even though it's just a few lines, it plays a key role in making sure your scraping project starts at the right time, in the right way.
Step 2: Turning URLs into Complete Product Profiles
Imports and Initial Setup
# IMPORT REQUIRED LIBRARIES
import asyncio
import json
import logging
import sqlite3
from pathlib import Path
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
from playwright_stealth import stealth_async
import re
import randomThe script begins by importing a set of essential Python libraries—both built-in and third-party—that provide the backbone for everything that follows. Modules like sqlite3, json, and logging help manage data storage, structure, and track the script’s progress. asyncio supports asynchronous operations, which allow the script to perform tasks like web browsing without freezing or waiting unnecessarily. The powerful playwright and playwright_stealth libraries handle browser automation and help avoid detection while scraping. Additional helpers like Path, re, and random make tasks like handling file paths, working with patterns, and introducing randomness smooth and reliable. These imports set the stage for a well-coordinated, efficient web scraping process.
Essential Paths for Database, JSON, and Logs
# CONSTANTS FOR FILE AND DATABASE PATH
DB_PATH = "ethos_products.db"
JSON_PATH = "ethos_product_data.json"
LOG_PATH = Path("logs/scrape_details.log")
LOG_PATH.parent.mkdir(exist_ok=True)
"""
This section defines key file paths used throughout the scraper:
"""Before diving into the actual scraping work, it's important for our script to set up a few things behind the scenes—much like laying out your tools before starting a project. In this case, we begin by defining a few constants that tell the script where to save different types of information as it runs.
First, we specify where the scraped data will be stored using DB_PATH. This points to a local SQLite database file named ethos_products.db. Think of this database like a digital notebook with two main sections—one for just keeping track of product page links (product_urls), and another for saving the full details scraped from each of those pages (product_data). Having this separation helps the script stay organized and know what has already been done versus what’s still pending.
Next is JSON_PATH, which refers to a file named ethos_product_data.json. This file acts like a backup copy of the product data, but in a format that’s easy to read and share. JSON files are especially useful if you want to later open the data in tools like Excel or convert it into other formats.
Then we have LOG_PATH, which leads to a log file stored inside a folder called logs/. Logs are like a running diary for the script—they record everything from successful actions to unexpected problems. This helps us trace issues later or just understand how the script performed over time.
Lastly, we make sure that the logs/ folder actually exists before the script tries to save anything there. The line LOG_PATH.parent.mkdir(exist_ok=True) takes care of this by quietly creating the folder if it’s missing. And if it’s already there, the script simply moves on without complaint.
Altogether, this section is about setting up smart, reusable paths so that our scraper knows exactly where to place its output, how to track its progress, and where to look if something goes wrong. It’s a small but important step in making the script stable, organized, and beginner-friendly.
Logging Configuration for Debugging and Monitoring
# SETUP LOGGING FOR DEBUGGING AND TRACKING
logging.basicConfig(
filename=LOG_PATH,
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s"
)
"""
This section sets up logging to help monitor the scraping process.
"""To keep track of what the script is doing at each step, we set up something called logging. Think of logging like keeping a diary for the scraper—it notes down what’s happening, when it happens, and if anything goes wrong. This is especially useful when scraping large websites or running long scripts, where things might fail quietly if we’re not paying attention. In this setup, every log message is saved to a file named scrape_details.log, stored in a folder called logs.
Organizing Storage for Product Data
# DATABASE SETUP
def setup_database():
"""
Create and update the required database tables.
"""
conn = sqlite3.connect(DB_PATH)
cur = conn.cursor()
# Add processed column to product_urls table
cur.execute("PRAGMA table_info(product_urls)")
columns = [row[1] for row in cur.fetchall()]
# Add 'processed' column if missing (default = 0 → not scraped)
if "processed" not in columns:
cur.execute("ALTER TABLE product_urls ADD COLUMN processed INTEGER DEFAULT 0")
# Create structured table for scraped product data
cur.execute("""
CREATE TABLE IF NOT EXISTS product_data1 (
url TEXT PRIMARY KEY,
name TEXT,
brand TEXT,
mrp TEXT
)
""")
conn.commit()
conn.close()
Before we start collecting any product data, we need to prepare a place to store everything—and that’s where the database setup comes in. Think of it like setting up a filing cabinet before you begin sorting and storing documents. If the cabinet isn’t ready or doesn’t have the right folders, things can easily get lost or mixed up. This part of the script handles that preparation for us.
In this function, we connect to an SQLite database—basically, a small file-based system where we’ll keep track of the URLs we want to scrape and the product details we extract. First, we check if a column called processed exists in the product_urls table. This column acts like a checkbox that helps us remember whether we’ve already scraped a product URL. If it’s not there, we add it. Any URL with processed = 0 means we still need to visit it, while processed = 1 means we’ve already handled it.
Next, we set up a new table called product_data. This is where the detailed product information—like name, brand, price, and even things like case shape or power reserve—will be stored. Each row in this table represents one product, and the URL is used as a unique identifier to avoid saving duplicates.
By running this setup function before we begin scraping, we make sure everything is neatly organized. The database will be ready to record and track every product we scrape without confusion or duplication. It also helps us safely pause and resume scraping without starting over, since we already know which URLs have been processed. This makes the whole process much smoother and far less error-prone.
Mark URLs as Processed
# UPDATE PROCESSED STATUS IN DATABASE
def update_processed_status(url):
"""
Marks a product URL as "processed" in the database after it has been successfully scraped.
"""
conn = sqlite3.connect(DB_PATH)
cur = conn.cursor()
cur.execute("UPDATE product_urls SET processed = 1 WHERE url = ?", (url,))
conn.commit()
conn.close()After scraping data from a product page, it’s important for our script to remember that the job is done for that specific page. Otherwise, the next time the script runs, it might go back and repeat the same work, which wastes time and can lead to duplicate entries in our database. That’s where this small but essential function, update_processed_status, comes into play.
Think of it like checking off a task on a to-do list. Once we’ve visited a product page and saved its details, we want to mark it as “complete.” This function connects to our SQLite database and updates the corresponding product URL’s status to show that it has already been handled. It does this by setting the processed column to 1, which simply means “done.”
The function works quietly in the background. When you pass in a URL, it opens the database, finds the matching entry in the product_urls table, and marks it as processed. It then saves the change and closes the connection, making sure everything is neat and tidy before moving on.
Why is this so important? Imagine scraping thousands of product URLs—without a reliable way to track which ones have already been scraped, the script would either reprocess everything from scratch or get confused. By using this approach, we give our scraper a memory. It knows exactly where it left off and can pick up from there, especially helpful if the script is interrupted or needs to be restarted later. In short, this function keeps our data collection process organized and efficient, helping us avoid repetitive work and maintain clean, accurate results.
Store Scraped Entries in JSON
# SAVE SCRAPED DATA TO JSON FILE
def save_to_json(json_path, entry):
"""
This function appends a single scraped product entry (in dictionary format) to a JSON file.
"""
if Path(json_path).exists():
with open(json_path, "r", encoding="utf-8") as f:
data = json.load(f)
else:
data = []
data.append(entry)
with open(json_path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=4)After collecting product information from a webpage, we need a safe and accessible place to store that data. While databases are excellent for structured storage, sometimes it's also helpful to have a simple file you can open and read directly—something you can easily share, move around, or check with your own eyes. That’s exactly what this save_to_json function is designed to do.
Think of a JSON file like a notebook where each page holds details about one product. This function takes a single product entry (structured as a Python dictionary), and adds it to that notebook. If the file doesn’t already exist, it creates one from scratch. If it does exist, the function opens it up, reads all the previous entries, adds the new one to the list, and then saves everything back neatly in the same file.
Here’s how it works step by step. First, the function checks if the file (given by json_path) already exists on your computer. If it does, it loads all the existing entries into a list. If not, it just starts fresh with an empty list. Then, it appends the new product information to that list—like adding one more page to our notebook. Finally, it writes the entire updated list back to the file, formatting it with indentations to keep things clean and easy to read.
This approach is especially helpful if you want a backup of your scraped data or if you're not ready to work with a database just yet. JSON files are flexible, human-readable, and compatible with many tools. Later, you can even convert them to CSV or Excel if needed. So in a nutshell, this function helps keep your data safe, organized, and always within reach—even outside your code.
Popup Management for Smooth Scraping
# HANDLE UNEXPECTED POPUPS
async def close_popups(page):
"""
Attempts to close any pop-up windows that may appear on Ethos product pages.
"""
# First popup
try:
await page.locator("div#close[role='button']").click(timeout=2000)
logging.info("First popup closed")
except:
pass
# Second popup
try:
await page.locator("a[onclick='cnscbEthClose()']").click(timeout=2000)
logging.info("Second popup closed")
except:
passWhen you're building a scraper to collect product details from a website, it's not always a smooth ride. Sometimes, pages throw up unexpected pop-ups—like cookie notices, promotional banners, or welcome messages—that cover the content you're trying to grab. These popups can block important product details and interrupt your scraper's flow, causing it to either miss key information or crash entirely. That’s where this function, close_popups, steps in to help.
Imagine you’re trying to read a book, but every few pages, someone waves an ad in front of your face. You’d have to gently push it aside before continuing. Similarly, this function quietly checks for two known types of popups that appear on the Ethos product pages. It does this by looking for specific HTML patterns—kind of like scanning a page for a "close" button and clicking it automatically if it's there.
It works asynchronously, meaning it runs alongside other tasks without blocking them. That makes it suitable for fast-paced web scraping, where every second counts. The function tries to click each popup’s close button using a short timeout of two seconds. If the popup isn’t found or doesn’t close in time, it doesn’t throw a tantrum—it just moves on silently. This helps keep your scraping process steady and uninterrupted.
Why is this important? Because when a popup covers the screen, even if your code is perfect, it might still fail to fetch the data. For example, if a cookie banner is sitting on top of the "Add to Cart" button or product name, your scraper might think the element doesn't exist or isn't clickable. By handling these interruptions upfront, close_popups ensures your scraper has a clear view of the page—just like cleaning your glasses before reading.
You typically call this function right before you start extracting data from a page. Just write await close_popups(page), and it will do its job quietly in the background. It’s one of those small touches that makes your scraping setup much more reliable and professional, especially when dealing with unpredictable websites.
Retrieve Label-Based Data from Pages
# EXTRACT A SPECIFIC PRODUCT SPECIFICATION VALUE
async def extract_spec_value(page, label):
"""
Extracts the value of a specific product specification from a product detail page.
"""
try:
# Find by specRow or li.calibre_sepcColumn
element = await page.query_selector(f"xpath=//div[@class='specRow'][span[@class='specName' and contains(text(), '{label}')]]/span[@class='specValue']")
if element:
return (await element.inner_text()).strip()
# Try li fallback
element = await page.query_selector(f"xpath=//li[@class='calibre_sepcColumn specRow'][span[@class='specName' and contains(text(), '{label}')]]/span[@class='specValue']")
if element:
return (await element.inner_text()).strip()
except Exception as e:
logging.warning(f"Failed to extract {label}: {e}")
return NoneIn web scraping, especially when dealing with product pages, it’s often necessary to pull out specific details—like the size of a watch, its strap color, or the type of movement inside. These pieces of information usually sit in a structured part of the webpage called a specification section. Think of it like a table where each row shows a label and its value—for example, “Case Size” on the left and “42 mm” on the right. Now, the challenge here is that not all web pages structure these rows the same way. Some use one kind of HTML layout, while others might use a different one altogether. So, to handle this smartly, we use a flexible function called extract_spec_value.
This function is designed to take in a page (which is the product page we’re looking at) and a label (like “Strap Color”), then go and find the value that matches that label. It does this by first looking for a block of HTML with a class name called specRow. Inside that, it checks if there’s a span with the class specName that includes the label we’re searching for. If it finds that, it pulls the value from the corresponding specValue span. But the page might be using a different structure, so the function is smart enough to try a second layout as a fallback—specifically, a list item (li) with a slightly different class name. This two-step search ensures we don’t miss the data just because the structure changes a little.
The beauty of this function is that it hides all the technical details behind a clean, readable interface. You simply ask, “What’s the Case Size?” and it brings you back the answer, if it exists. If it can’t find the label or something goes wrong during the search, it quietly logs a warning and returns None. This way, the scraping process doesn’t crash—it just skips that bit and keeps moving. As a result, the rest of the data collection can continue uninterrupted, and we still have a note in the logs about what went wrong, in case we want to fix or revisit it later. Functions like this one help keep our scraping code clean, reusable, and resilient—even when websites throw us curve balls with changing layouts.
Fetching and Organizing Product Info
# EXTRACT ALL PRODUCT DETAILS FROM THE PAGE
async def extract_data(page):
"""
Extracts detailed product information from an Ethos Watches product detail page.
"""
data = {}
try:
# Name
title = await page.query_selector("h1.ethos_title span.fWeight_regular")
data["name"] = await title.inner_text() if title else None
# Brand
brand = await page.query_selector("div.specCol span.specName:text('Brand') + span.specValue a")
if not brand:
brand = await page.query_selector("h1.ethos_title a")
data["brand"] = await brand.inner_text() if brand else None
# MRP
price = await page.query_selector("span.price")
data["mrp"] = await price.inner_text() if price else None
except Exception as e:
logging.error(f"Error extracting data: {e}")
return data
At the heart of our scraping project, there’s a function called extract_data(). This is the part of the script responsible for visiting a single product page on the Ethos Watches website and pulling out all the important details about the watch listed there. Think of it as the person who walks into a store, reads every tag on a product, notes down everything neatly, and leaves. That’s what this function does—but on the web.
Now, since websites are built using HTML, we can target specific parts of the page using tools like Playwright, which lets us control a browser with code. And since websites sometimes load information slowly or use a lot of background scripts, we use asynchronous programming here. That just means we let our code “wait patiently” for things to load, without freezing the whole program. Inside the function, we start with a blank dictionary called data. This is like an empty notepad where we’ll write down each detail we find about the product.
We begin by trying to get the name of the product. Usually, it’s found in the main heading at the top of the page. So, we use Playwright to find that element using something called a CSS selector—basically, a way to point to specific parts of a webpage. If we find the name, we grab the text and store it in our dictionary.
Next, we look for the brand of the watch. Sometimes it's right there under the "Brand" label; other times, it might be part of the title. So, we try both possibilities to make sure we don’t miss it. This “try multiple ways” approach is common in scraping because websites aren’t always perfectly consistent. Then we move on to the model number. This could be sitting inside a tag with the ID #psku, or if it's not there, we ask our helper function extract_spec_value() to look it up for us in the specifications section. That helper is smart—it knows how to look for labels and get their values, which is great for structured data.
We do something similar for the price, or MRP. We look for a span with the class “price”, grab the number if it exists, and store it. Now comes the bulk of the work: all the technical specifications of the watch. This is where the function shines. We’ve prepared a list of fields we care about—things like “Strap Color”, “Dial Colour”, “Movement”, “Power Reserve”, and many others. For each of these, we loop through and call our extract_spec_value() helper. This helper searches the page for that label and grabs whatever value is next to it.
To keep our keys in the dictionary consistent and clean, we lowercase each label and replace spaces with underscores. So, for example, “Strap Material” becomes strap_material—easier to work with in code later.
Throughout this whole process, we wrap everything in a try block. That’s a safety net. If something goes wrong—maybe the website structure changes, or an element is missing—we don’t want the whole program to crash. Instead, we log the error, skip that one detail, and continue collecting the rest. That way, we don’t lose everything just because of one hiccup.
Finally, once all the available information is collected, the function returns the data dictionary. At this point, it's like a neatly filled-out form with all the details of the watch, ready to be saved into a database or analyzed further.
In simple terms, this function behaves like a reliable assistant: it opens a product page, reads every label carefully, grabs the matching information, and organizes it all into a clean, consistent format. It handles surprises gracefully and keeps working even when some data is missing. This kind of organized, flexible approach is what makes a scraper both powerful and dependable—even when the website it's working on isn't perfect.
Main Asynchronous Scraper
# MAIN ASYNCHRONOUS SCRAPING FUNCTION
async def scrape():
"""
Main asynchronous function that controls the entire scraping workflow.
"""
setup_database()
conn = sqlite3.connect(DB_PATH)
cur = conn.cursor()
cur.execute("SELECT url FROM product_urls WHERE processed = 0")
urls = [row[0] for row in cur.fetchall()]
conn.close()
if not urls:
logging.info("No unprocessed URLs found.")
return
logging.info(f"Total URLs to process: {len(urls)}")
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await stealth_async(page)
for idx, url in enumerate(urls, 1):
try:
logging.info(f"[{idx}] Navigating to {url}")
await page.goto(url, timeout=30000)
await asyncio.sleep(random.uniform(1, 3))
await close_popups(page)
data = await extract_data(page)
data["url"] = url
# Save to DB
conn = sqlite3.connect(DB_PATH)
cur = conn.cursor()
columns = [
"url", "name", "brand", "mrp"
]
values = [data.get(col, None) for col in columns]
cur.execute(f"""
INSERT OR REPLACE INTO product_data1 ({', '.join(columns)})
VALUES ({','.join('?' * len(columns))})
""", values)
conn.commit()
conn.close()
# Save to JSON
save_to_json(JSON_PATH, data)
# Mark as processed
update_processed_status(url)
logging.info(f"[{idx}] Scraped and saved: {url}")
except PlaywrightTimeoutError:
logging.error(f"[{idx}] Timeout while loading {url}")
except Exception as e:
logging.error(f"[{idx}] Unexpected error: {e}")
await browser.close()
Let’s imagine you’re tasked with collecting detailed product information from a website—like a digital assistant that visits a page, reads every detail carefully, writes it down, and moves on to the next page without repeating itself. That’s exactly what the scrape() function is designed to do. It acts like the brain behind our scraping operation, coordinating every step from start to finish.
To begin with, this function is asynchronous, which simply means it can perform multiple tasks at once without waiting for one to finish before starting another. This is especially helpful when you're dealing with the internet, where delays are common. Think of it like reading a book while waiting for water to boil—you're making good use of your time. This kind of efficiency is what async def scrape() brings to the table.
Now, before diving into any scraping, the function prepares the environment by calling another function named setup_database(). This ensures all necessary tables in the database are ready. Then it connects to our SQLite database and fetches all product URLs that haven’t been visited yet—these are marked with processed = 0. If there are no URLs left, it logs that there’s nothing to do and gently exits. This is a thoughtful checkpoint to avoid unnecessary effort.
When there are URLs to process, the real journey begins. We launch a Chromium browser using Playwright, which is a tool that can control web browsers just like a human would—with clicks, scrolls, and even typing. But here’s the twist—we also apply a stealth mode using stealth_async(page). This makes our scraper look and behave like a real user, helping us avoid getting blocked by the website’s defenses.
Once the browser is ready, we loop through each URL one by one. For every product page, we navigate to it and wait for it to load. Since many websites now have pop-ups (like newsletter signups or cookie warnings), we call close_popups(page) to get them out of the way, just like you would click the 'X' before browsing a page.
Next comes the heart of the scraping—extract_data(page). This function collects everything we care about, such as the product’s name, brand, price, model number, strap material, and much more. After gathering this information, we also tag it with the URL it came from, so we always know the source.
Once the data is in hand, we save it in two places. First, we insert it into a structured SQLite database, which acts like a local spreadsheet where each row is a product and each column holds specific details. Then, we also save the same data into a JSON file. This acts like a backup or an easy-to-share export that other programs or people can use later. After saving the data, we make a note that the URL has been processed by calling update_processed_status(url). This is like checking off a task on your to-do list—it ensures we don’t visit the same page again in future runs.
Throughout the entire process, we also maintain detailed logs. Whether the scraping is successful, a timeout happens, or an unexpected error pops up, we log it all. This is extremely helpful when debugging or resuming a scrape that was interrupted.
Finally, after all URLs are processed, we close the browser gracefully. Just like shutting down your computer at the end of the day, this step ensures all resources are freed up and everything ends cleanly.
In summary, the scrape() function ties together many moving parts into one well-orchestrated routine. It ensures we only visit unprocessed URLs, collect rich product data carefully, save it reliably, and avoid repeating our work. For anyone starting out in web scraping, this function offers a complete, real-world example of how to build a smart, efficient, and polite scraper—one that’s both effective and respectful to the website it visits.
Starting the Scraper (Main Entry Point)
# SCRIPT ENTRY POINT
if name == "__main__":
"""
Program entry point. Runs the async scrape function and logs fatal errors if any.
"""
try:
asyncio.run(scrape())
except Exception as e:
logging.critical(f"Fatal error: {e}")
At the very end of the script, we have a simple block that acts like a "start button." When you run this script directly, it kicks off the main scraping process. In our case, this is handled by the if name == "__main__": block. If anything goes seriously wrong, it logs the error so you know what happened.
Conclusion
In the world of online retail, understanding detailed product information can make a real difference—whether you're tracking trends, comparing brands, or building your own catalog. This blog walked through how we can extract rich product details from Ethos Watches using Playwright and asynchronous Python code. By carefully navigating each product page, reading labels, and collecting specifications like price, strap material, or movement type, we’ve created a reliable system to turn unstructured website content into clean, organized data. For beginners or interns just starting out, this blog shows that with the right tools and a step-by-step approach, even a dynamic website can be broken down into meaningful insights that power smarter analysis and applications.
Libraries and Versions Used
Name: asyncio
Version: Built-in Python module (no separate installation required)
Name: json
Version: Built-in Python module (no separate installation required)
Name: random
Version: Built-in Python module (no separate installation required)
Name: sqlite3
Version: Built-in Python module (no separate installation required)
Name: logging
Version: Built-in Python module (no separate installation required)
Name: pathlib
Version: Built-in Python module (no separate installation required)
Name: playwright
Version: 1.48.0
AUTHOR
I’m Anusha P O, Data Science Intern at Datahut. I specialize in building smart scraping systems that automate large-scale data collection from complex e-commerce websites. In this blog, I walk you through how we extracted and structured detailed product information from Ethos Watches using Playwright, SQLite, JSON, and asynchronous Python workflows—turning intricate product pages into clean, analysis-ready datasets.
At Datahut, we help businesses unlock the full potential of web data by designing robust, scalable scraping solutions for competitive intelligence, product research, and market analysis. If you’re exploring data-driven strategies for e-commerce or want to organize large product datasets efficiently, reach out via the chat widget on the right. Let’s transform your raw web data into actionable insights.
FAQ SECTION
1. Is it legal to scrape Ethos product data using Python?
Scraping Ethos product data can be legal if it is done responsibly and in compliance with Ethos’ terms of service, robots.txt rules, and applicable data protection laws. Publicly available product information such as prices, brands, and availability is generally safer to scrape for research or analysis purposes.
2. Why should async tools be used for scraping Ethos product data?
Async tools allow multiple requests to be processed simultaneously, making scraping faster and more efficient. When scraping Ethos, which may have multiple product pages and categories, async frameworks significantly reduce scraping time while maintaining performance.
3. Which Python libraries are best for async web scraping?
Popular Python libraries for async web scraping include aiohttp, asyncio, Playwright, and httpx. These tools help manage concurrent requests efficiently and handle dynamic content commonly found on eCommerce websites like Ethos.
4. How can I avoid getting blocked while scraping Ethos?
To avoid blocks, use techniques such as rotating user agents, setting proper request delays, limiting request frequency, and handling headers correctly. Async scraping should be carefully throttled to mimic human-like browsing behavior.
5. What type of product data can be scraped from Ethos?
You can scrape product-related data such as product names, brand details, prices, discounts, ratings, availability, product descriptions, and category information. This data can be used for competitive analysis, price monitoring, and market research.