How to Scrape E-Commerce Websites with Python in 2026
- Tony Paul

- Jul 7, 2022
- 10 min read
Updated: 3 days ago

A few years back, you could build a working e-commerce scraper in about 20 lines of Python. Just requests, BeautifulSoup, done. We wrote a guide on exactly that, and people loved it. Run that same code against a real store today and you will probably get blocked before you read a single product. The fundamentals still hold up fine. It is the world around them that changed. So let's fix that together.
(Quick definition if you are just getting started. Web scraping is pulling data off web pages and turning it into something you can use, like a CSV, some JSON, or a row in a database. That's it.)
Which Python library should you use for web scraping?
Use requests with BeautifulSoup for plain HTML pages, curl_cffi when you are being fingerprint-blocked, Playwright when the content is rendered by JavaScript, and Scrapy when you are building a full crawler with thousands of URLs.
I will be honest with you. Most scrapers do not fail because of clever anti-bot tech. They fail because somebody grabbed the wrong library for the page in front of them and then spent two hours confused. So here is the map before we touch any code.
A quick word on httpx, since people always ask. It is genuinely good. If you are building an API client, I would reach for it over requests most days. But for scraping it does not buy you much. It rides the same standard TLS stack, so the moment a site checks fingerprints, httpx gets bounced right alongside requests. Knowing it exists is enough. We won't build on it here.
The rough path most scraping jobs follow looks like this: start with requests plus BeautifulSoup. Get blocked? Swap in curl_cffi. Data's nowhere in the HTML? Bring in Playwright. Scraping thousands of pages? Wrap the whole thing in Scrapy.
That is the trail we are walking, with PUMA as our willing test subject.
What are we scraping, and what data do we want?
Same target as the original guide: PUMA's Manchester City FC collection. It is a real storefront, listing pages and product pages and all, which keeps this honest.
Listing URL: https://in.puma.com/in/en/collections/collections-football/collections-football-manchester-city-fc
We want four fields per product:
Product URL. The unique link to each page. You will want this for pretty much any e-commerce dataset, because it is how you tell one product from another down the line.
Product name. Something like "Manchester City 25/26 Men's Home Replica Jersey".
Price. What it is going for right now. PUMA shows both the sale price and the original when something's discounted, which matters later.
Description. The copy about materials, fit, the usual.
The libraries doing the work: requests is the dependable old HTTP library. Point it at a URL, get a response. BeautifulSoup takes that HTML and lets you pick through it with CSS selectors. csv ships with Python, so there is nothing to install, and it writes everything to a file. A little later we add curl_cffi to slip past anti-bot systems and Playwright for the JavaScript-heavy pages.
Let's get set up:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install requests beautifulsoup4 curl_cffi playwright
playwright install chromium
Everything here is written for Python 3.11+.
How do you scrape an e-commerce site with requests and BeautifulSoup?
Fetch the listing page, collect every product link, visit each product page, pull the fields out with CSS selectors, and write everything to a CSV. Here is the complete working scraper, and then we will walk through the parts that matter.
Let's start right where the old guide did. The plan has not moved an inch:
Hit the listing page with the jerseys.
Grab every product link, stash them in a list.
Visit each product page, one at a time.
Pull the fields out with CSS selectors.
Drop it all into a CSV.
Now here is a small gift. Since 2022, PUMA quietly made our job easier. The product name and description now live in clean <meta> tags on each page (og:title and og:description). Reading those beats chasing deep CSS classes that get renamed every redesign. It is a good habit to carry everywhere, actually. Here is the whole scraper:
import csv
import requests
from bs4 import BeautifulSoup
LISTING_URL = (
"https://in.puma.com/in/en/collections/"
"collections-football/collections-football-manchester-city-fc"
)
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/146.0.0.0 Safari/537.36"
)
}
def get_soup(url: str) -> BeautifulSoup:
response = requests.get(url, headers=HEADERS, timeout=20)
response.raise_for_status()
return BeautifulSoup(response.content, "html.parser")
def collect_product_links(listing_url: str) -> list[str]:
soup = get_soup(listing_url)
links = set()
# PUMA product pages follow the pattern /in/en/pd/{slug}/{id}
for a in soup.select('a[href*="/pd/"]'):
href = a["href"]
if href.startswith("/"):
href = "https://in.puma.com" + href
links.add(href.split("?")[0]) # drop the ?swatch= query string
return sorted(links)
def parse_product(url: str) -> dict[str, str]:
soup = get_soup(url)
name_tag = soup.select_one('meta[property="og:title"]')
desc_tag = soup.select_one('meta[property="og:description"]')
price_tag = soup.select_one('meta[property="product:price:amount"]')
return {
"name": name_tag["content"].strip() if name_tag else "",
"price": price_tag["content"].strip() if price_tag else "",
"description": desc_tag["content"].strip() if desc_tag else "",
"url": url,
}
def main() -> None:
links = collect_product_links(LISTING_URL)
print(f"Found {len(links)} products")
with open("puma_manchester_city.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(
f, fieldnames=["name", "price", "description", "url"]
)
writer.writeheader()
for link in links:
writer.writerow(parse_product(link))
if __name__ == "__main__":
main()
A handful of things changed from the old version, and they are worth pausing on.
We hand BeautifulSoup response.content, the raw bytes, rather than .text. That lets the parser sort out the encoding itself instead of guessing wrong and mangling a stray character.
Every selector lookup is wrapped in if tag else "". There is always one product with a missing field somewhere in the catalog, and the classic crash is calling .text on a None that a selector quietly handed back. Ask me how I know.
The CSV opens once, and the header plus every row get written inside that same with block, so writer is always in scope. The old guide split this across two blocks, which is exactly how you end up staring at a "writer is not defined" error wondering what you did wrong.
And we always send a believable User-Agent. In 2022 that usually got you through the door. In 2026, often not even close, which is the whole next section.
Why is your Python scraper getting blocked even with a real User-Agent?
Because the site is reading your TLS fingerprint, not your headers. Python's requests library runs on OpenSSL, and its TLS handshake looks nothing like a real browser's. Cloudflare and Akamai check that handshake before they ever look at your User-Agent, so your carefully faked headers never even get a turn.
Run our classic scraper against PUMA from a cloud or datacenter IP and there is a real chance you get a 200 OK... that is actually a bot-detection page wearing a disguise. Not your products. A CAPTCHA in a trench coat. And you are sitting there thinking, my User-Agent literally says Chrome, what more do you want?
Here is the part that took me ages to understand. The moment your client opens an HTTPS connection, it sends a TLS handshake (the ClientHello), and the exact set of cipher suites, extensions, and the order they arrive in forms a kind of signature. The JA3/JA4 hash. requests has one signature. Chrome has another. The site compares them and turns you away at the door.
How does curl_cffi fix TLS fingerprint blocking?
curl_cffi is a Python binding to a special build of curl that copies a real browser's TLS and HTTP/2 fingerprint, and it mimics the requests API almost line for line. Patching our scraper is nearly a one-liner:
from curl_cffi import requests
def get_soup(url: str) -> BeautifulSoup:
# impersonate a real Chrome TLS + HTTP/2 fingerprint
response = requests.get(url, impersonate="chrome", timeout=20)
response.raise_for_status()
return BeautifulSoup(response.content, "html.parser")
That is the change. The rest of the scraper does not move, because curl_cffi was deliberately built to slot in where requests was. A few things worth knowing for 2026:
Just use impersonate="chrome", the bare alias. It tracks the newest supported fingerprint for you.
The latest pinned targets right now are around chrome146 and safari260, but here is the sneaky bit. Pinning some old profile is itself a giveaway. An outdated Chrome fingerprint in 2026 is its own red flag. Let the alias follow the current one and stop thinking about it.
Keep your fingerprints current.
Since v0.15.1 you can just run curl-cffi update in the terminal to pull the latest fingerprint database. No reinstall.
Reuse a Session for cookies and connection pooling, same as you would with requests:
from curl_cffi import requests
session = requests.Session(impersonate="chrome")
soup_listing = BeautifulSoup(session.get(LISTING_URL).content, "html.parser")
For the big pile of sites that lean mostly on TLS fingerprinting (most e-commerce catalogs, pricing pages, listing data) curl_cffi on its own carries you through. The other half of the fight is your IP's reputation, which is a separate problem.
Rotating residential proxies keep one address from getting rate-limited or banned. We get into how to wire those up in our guide to using proxies for web scraping, and if you want the full Cloudflare-specific playbook, that is the whole point of our web scraping without getting blocked with curl_cffi post.
When do you need Playwright instead?
You need Playwright when the data is not in the HTML at all, because the page builds itself with JavaScript after loading. curl_cffi beats fingerprint blocks, but it cannot run JavaScript. React and Vue storefronts hand you a nearly empty HTML shell, then fetch all the real content with background calls once a browser actually loads. Infinite-scroll category pages do the same.
The frustrating version of this: you write a perfect scraper, run it, and get back... nothing. No products. No error. Just empty. You did not do anything wrong. The data was not in the HTML yet, because no browser ran the JavaScript that fetches it.
That is your cue to bring in a real browser, and Playwright is the one most people reach for now.
Here we keep it short and show the shape of it. If you want the full walkthrough, scraping a real dynamic, JavaScript-heavy store end to end with lazy-loading and pagination and all, we have a whole separate guide on scraping a dynamic e-commerce website with Python. This section is just the "here's when and why" version.
from playwright.sync_api import sync_playwright
def scrape_listing_with_playwright(url: str) -> list[str]:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
# wait until the JS-rendered product links actually exist
page.wait_for_selector('a[href*="/pd/"]')
links = set()
for a in page.query_selector_all('a[href*="/pd/"]'):
href = a.get_attribute("href")
if href:
links.add(href.split("?")[0])
browser.close()
return sorted(links)
Two habits will save you real grief. Use wait_for_selector (or wait_until="networkidle") so you are reading the page after the JavaScript has filled it in, not mid-load when half the products do not exist yet. And run headless in production. For infinite scroll, loop page.mouse.wheel(0, 3000) with little pauses until nothing new turns up.
One caution, though. A real browser is heavy and slow next to a plain HTTP client. So only reach for Playwright when you actually need it. If curl_cffi already hands you the data, just use curl_cffi and walk away happy.
How do you scrape hundreds of pages without waiting forever?
Use asyncio with curl_cffi's AsyncSession, and cap your concurrency with a semaphore. You get the browser fingerprint and the concurrency in one shot.
Our PUMA category has 60 products, so going one at a time is fine. You will not even notice. But pull hundreds or thousands of URLs that way and it gets painful fast, because you are spending almost all your time just waiting on the network to answer.
import asyncio
from curl_cffi.requests import AsyncSession
from bs4 import BeautifulSoup
async def parse_product(session: AsyncSession, url: str) -> dict[str, str]:
response = await session.get(url, impersonate="chrome", timeout=20)
soup = BeautifulSoup(response.content, "html.parser")
name = soup.select_one('meta[property="og:title"]')
return {
"name": name["content"].strip() if name else "",
"url": url,
}
async def scrape_all(urls: list[str]) -> list[dict[str, str]]:
semaphore = asyncio.Semaphore(10) # cap concurrency, be polite
async with AsyncSession() as session:
async def bounded(url: str):
async with semaphore:
return await parse_product(session, url)
return await asyncio.gather(*(bounded(u) for u in urls))
if __name__ == "__main__":
product_urls = [...] # the list from collect_product_links()
rows = asyncio.run(scrape_all(product_urls))
print(f"Scraped {len(rows)} products concurrently")
That Semaphore is the quiet hero here. It caps how many requests fire at once, which keeps you polite and keeps you out of rate-limit jail. Five to ten at a time is a sane starting point. Nudge it up or down depending on how the site reacts. (And if you need browser-based scraping at this scale, Playwright has its own async version, async_playwright, that works the same way.)
Is the scraped data clean and ready to use?
Not yet. Same honest answer as the original scraper gave. The raw output always needs a little work. Prices come through with currency symbols and commas. Names sometimes carry trailing whitespace, or a colour stuck on the end.
Descriptions get padded with marketing fluff. In a real pipeline you would add a cleaning pass, usually with pandas, to turn prices into actual numbers and collapse the duplicate variants of one product into a single row.
Pulling it together
Step back and the modern Python scraping stack is not really one tool. It is a decision tree:
requests + BeautifulSoup for simple, unguarded pages.
curl_cffi the second you hit TLS-fingerprint blocks. Nearly a drop-in swap.
Playwright when the data only shows up after JavaScript runs, or you need to interact with the page.
asyncio + curl_cffi once you are scaling to big URL lists.
Rotating proxies on top of any of those, once your IP reputation becomes the thing holding you back.
E-commerce scraping is what quietly powers competitor price monitoring, stock tracking, and product research. Brands all over the world run on it. But doing it at scale is a real, grinding job. The anti-bot cat-and-mouse never stops. Site redesigns break your selectors at 2am. Proxies need feeding. The cleaning never quite ends. A 20-line script is a brilliant way to learn the shape of it all. A pipeline you can actually depend on is a different animal entirely.
If you would rather have clean, reliable web data without babysitting any of that, come talk to us at Datahut. Managed web scraping is the thing we do all day.
Related reading:
Web Scraping vs. Web Crawling: Which One Do You Need? (add live URL)
Frequently Asked Questions
What is the best Python library for web scraping in 2026?
It depends on the page. requests with BeautifulSoup handles simple HTML pages. curl_cffi is the best choice for sites protected by Cloudflare or Akamai because it copies a real browser's TLS fingerprint. Playwright is needed when content is rendered by JavaScript. Scrapy is best for large crawls with thousands of URLs.
Why does my Python scraper get blocked even with a browser User-Agent?
Because anti-bot systems check your TLS fingerprint (the JA3/JA4 hash of your HTTPS handshake) before reading your headers. Python's requests library has an OpenSSL fingerprint that looks nothing like Chrome's, so the block happens before your User-Agent is ever seen. curl_cffi solves this by impersonating a real browser's handshake.
What is the difference between curl_cffi and Playwright?
curl_cffi is a fast HTTP client that mimics a browser's network fingerprint but cannot run JavaScript. Playwright drives a real browser, so it can render JavaScript-heavy pages, click, and scroll, but it is much slower and heavier. Use curl_cffi first, and Playwright only when the data is not in the raw HTML.
Is web scraping e-commerce sites legal? Scraping publicly available product data is generally legal when done responsibly: respecting robots.txt, avoiding server overload, and not collecting personal data. Compliance requirements vary by region (GDPR, CCPA), so businesses scraping at scale should follow ethical scraping practices or use a managed provider that handles compliance.

![Web Scraping vs. Web Crawling: Which One Do You Need? [2026 Guide]](https://static.wixstatic.com/media/b3461d_dc64d380c93f40cab763e1036173c9a6~mv2.jpg/v1/fill/w_980,h_465,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/b3461d_dc64d380c93f40cab763e1036173c9a6~mv2.jpg)
