How Can Web Scraping Help Extract Refrigerator Data from Best Buy ?
- Shahana farvin
- 2 days ago
- 29 min read

When it comes to electronics in the U.S., Best Buy is one of the biggest and most familiar names. Their online store has a huge range of home appliances, and refrigerators make up a big chunk of that—everything from compact, space-saving models to smart fridges that connect to Wi-Fi and even have screens on the front.
At first glance, collecting all this product information might seem simple. You might think, “Just write a script, grab the data, and you're done.” But in reality, it’s not that straightforward. Best Buy’s website uses JavaScript to load a lot of its content, which means the information doesn’t appear right away when you look at the raw page. If you try to scrape it using basic tools, you’ll likely get nothing useful.
So, to collect the data properly, we need smarter tools that behave more like a real person browsing the site. That means waiting for things to load, scrolling through pages, and clicking when needed. And here’s an interesting twist—Best Buy has some strong security measures to spot bots. We found that if we let our scraper run with the browser visible—so the browser actually opens up and shows what it’s doing—it had a much better chance of getting through without being blocked.
To make things easier, we broke the project into two clear steps. First, we focused on collecting all the links to the individual refrigerator products by scrolling through the listing pages automatically. Then in the second step, we visited each of those product pages one by one to gather more detailed information—like prices, features, and customer reviews.
This project is a great example of how web scraping today often means doing more than just grabbing data. It’s about figuring out how a website really works, handling content that changes as you browse, and making your scraper act as naturally as possible. And just as important—it’s about doing all of this in a respectful, thoughtful way that works with the website, not against it.
URL COLLECTION
Imports and Initial Setup
import asyncio
import random
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import sqlite3
from datetime import datetime
import logging
Before we jump into scraping anything, we need to gather all the tools our script will use—just like packing your toolkit before starting a project. This happens in the import section of our code, where we bring in different Python modules. Each one has a job, and together they help us build a smart, smooth, and reliable scraper.
Let’s start with Playwright. Think of it like a remote control for your web browser. It can open pages, click buttons, scroll down, and wait for things to load—just like a human would. This is super useful for websites like Best Buy, where much of the content only appears when you interact with the page. Without Playwright, we’d miss a lot of the data we need.
Once a page is fully loaded, we pass it to BeautifulSoup. If the web page were a messy room full of random papers, BeautifulSoup would be the person who calmly walks in and picks out the exact note you’re looking for. It helps us dig through all the HTML code and pull out just the useful stuff—like product names, prices, and links—without getting lost in everything else.
Then there’s SQLite, our simple way to store data. You can think of it as a built-in spreadsheet that quietly lives on your computer. It doesn’t need a fancy setup or internet connection, but it gives us an easy way to save everything we collect and organize it nicely. It also helps us track which product links we’ve already visited and which ones are still waiting.
To speed things up, we use asyncio, a tool that lets our program multitask. It’s like having several tabs open in your browser, all doing different things at once. This way, our scraper can handle more pages in less time.
We also include a few helpful sidekicks. The random module lets us add small delays between actions, like waiting a few seconds before visiting the next page. This makes our scraper behave more like a real person, which helps us avoid getting blocked by the website.
The datetime module is there to add time stamps—useful for keeping track of when we scraped something or organizing our data by date.
And finally, there’s logging. This tool keeps a quiet log of everything that happens while the scraper runs. Whether something was scraped successfully, failed, or skipped, logging writes it all down. That way, if something goes wrong, we can look back and figure out what happened.
With all these tools packed and ready, we’re set to start building the heart of our scraper—ready to explore dynamic websites, stay organized, and move smartly like a real user.
Logging Configuration
# Set up logging with timestamp and log level for better debugging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
The logging.basicConfig() function sets up the basic configuration for logging in your script.
level=logging.INFO means the logger will capture all messages that are INFO level or higher (like WARNING, ERROR, etc.).
format='%(asctime)s - %(levelname)s - %(message)s' customizes the log message format to include the timestamp, the log level (INFO, WARNING, etc.), and the actual log message.
This setup helps you track the flow of your scraper by recording when events happen and how serious they are — making it much easier to troubleshoot issues if something breaks during scraping.
User Agent Management
def load_user_agents(file_path):
"""
Load user agent strings from a file to rotate during scraping.
Args:
file_path (str): Path to the text file containing user agent strings
Returns:
list: List of user agent strings with empty lines removed
Note:
Each user agent should be on a separate line in the file
"""
with open(file_path, "r") as f:
return [line.strip() for line in f if line.strip()]
Every time you open a website, your browser quietly introduces itself by saying, “Hi, I’m Chrome,” or “I’m Firefox on Windows,” or something similar. This little introduction is known as a user agent, and it helps the website understand what kind of browser you're using. But it can also be used to tell whether a visitor is a human or a bot.
In our case, since we’re building a scraper, we don’t want the website to immediately recognize us as a bot. So, we use a clever trick—our scraper puts on a disguise. That disguise is a user agent, and we’ve created a function that acts like a “disguise manager.”
Imagine having a closet full of outfits. Instead of always wearing the same one, you change your outfit every time you go out, making it harder for someone to recognize you. That’s exactly what our function does—it reads from a file that contains many different browser identities and picks one at random for each visit.
It even tidies things up first by removing any blank lines in the file, so we don’t end up using broken or empty disguises. In the end, this simple step makes a big difference in helping our scraper blend in and avoid being blocked.
Page Scrolling Handler
async def scroll_to_bottom(page):
"""
Scroll to the bottom of the page to trigger lazy loading of products.
Args:
page: Playwright page object
This function implements an infinite scroll detection mechanism by:
1. Getting the current page height
2. Scrolling to bottom
3. Waiting for new content
4. Comparing new height with previous height
5. Breaking if heights are equal (no more content loaded)
"""
previous_height = await page.evaluate("document.body.scrollHeight")
while True:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000) # Wait for lazy-loaded content
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == previous_height:
break
previous_height = new_height
Modern websites have become quite clever—they don’t show you everything right away. Instead, they load more content only when you scroll down, much like how Instagram or Facebook works. This is great for saving bandwidth and improving speed for real users, but it adds a bit of a challenge for web scrapers.
To deal with this, we use what’s called a scroll handler.
Here’s how it works: first, it checks how tall the page currently is. Then it scrolls down a bit, waits a second or two to let new content load, and checks again to see if the page got taller. If the height doesn’t change after scrolling, it means there’s no more content to load—we’ve reached the bottom.
What makes this scroll handler especially useful is its ability to handle delays and timeouts. Sometimes websites are slow to load new items, and a less careful scraper might move on too soon and miss out on data. But this function waits patiently, checking carefully, and only stops once it's sure there’s nothing left to fetch. It’s like having a responsible assistant who doesn’t rush and always double-checks before finishing the job.
Database Management
def initialize_db(db_name="bestbuy_refrigerators.db"):
"""
Initialize SQLite database and create products table if it doesn't exist.
Args:
db_name (str): Name of the SQLite database file
Returns:
sqlite3.Connection: Database connection object
The products table schema:
- id: Primary key
- url: Unique product URL
- date_scraped: Date when URL was scraped
- scraped: Flag indicating if product details have been scraped (0/1)
"""
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
date_scraped DATE,
scraped INTEGER DEFAULT 0
)
''')
conn.commit()
return conn
When we gather data, it’s important not to just grab it and forget it—we need a safe place to store everything we find. That’s where our database manager comes in.
In our case, we use a SQLite database, which you can imagine as a supercharged version of an Excel sheet. It’s built to handle not just hundreds, but even millions of rows of data without slowing down. We set up a table inside this database to store key details: the product’s URL, the date we found it, and whether we’ve already collected its full details or not.
But one of the best things about this setup is how reliable it is. Let’s say your computer crashes in the middle of scraping. Normally, that would mean starting over—but not here. Our system keeps track of progress as it goes, almost like an autosave feature in a video game. So even if something goes wrong, you can just pick up right where you left off. This makes the whole process smoother, safer, and a lot less stressful.
URL Storage
def insert_product_url(conn, url):
"""
Insert a product URL into the database if it doesn't exist.
Args:
conn (sqlite3.Connection): Database connection object
url (str): Product URL to insert
Note:
Uses INSERT OR IGNORE to handle duplicate URLs gracefully
"""
cursor = conn.cursor()
try:
cursor.execute('''
INSERT OR IGNORE INTO products (url, date_scraped, scraped)
VALUES (?, ?, ?)
''', (url, datetime.now().date(), 0))
conn.commit()
except sqlite3.Error as e:
logging.error(f"Error inserting URL {url}: {e}")
This function works a lot like a meticulous librarian—one who’s determined to keep the catalog tidy and free of duplicates. Each time a new URL comes in, the function first checks if it’s already in our database. After all, there's no need to store the same link more than once—that would only create unnecessary clutter.
To do this smartly, it uses a feature in SQLite called "INSERT OR IGNORE." That might sound technical, but the idea is simple: “Only add this new entry if it’s not already there.” It’s like having a filing system that quietly prevents you from putting the same paper in the drawer twice. Clean, efficient, and organized.
Now, what if something goes wrong? Maybe the database is busy at the moment, or there’s an oddly formatted URL that causes a hiccup. Instead of stopping the entire process, the function stays calm. It writes down the issue using our logging system and moves on. This is incredibly important when you’re dealing with thousands of URLs—you don’t want one bad link to bring your whole project to a halt. It’s built to be steady, reliable, and quietly persistent, just like that skilled librarian who keeps the shelves in perfect order, no matter what.
HTML Parsing
def parse_product_urls(content):
"""
Extract product URLs from the page HTML content using BeautifulSoup.
Args:
content (str): HTML content of the page
Returns:
list: List of complete product URLs
Note:
URLs are constructed by appending the extracted href to the base Best Buy URL
"""
soup = BeautifulSoup(content, "html.parser")
product_urls = []
for link in soup.select("li.sku-item > div > div > div > div.shop-sku-list-item > div.list-item.lv > div.column-left > a.image-link"):
href = link.get("href")
if href:
full_url = f"https://www.bestbuy.com{href}"
product_urls.append(full_url)
return product_urls
Here’s where we start turning the raw web content into something meaningful. The tool that helps us do this is BeautifulSoup, which allows us to move through the HTML structure of a webpage. It lets us see exactly where the valuable pieces—like product links—are hiding.
This function is trained to look for very specific patterns in the HTML. It’s kind of like reading a treasure map where an ‘X’ marks the spot. When it finds what looks like a product link, it checks whether the link is complete. If it’s not, it cleverly builds the full URL by adding Best Buy’s domain to the front—ensuring we always get a valid, working link.
We’ve also taken care to make this process more reliable. Websites often change how they look, but not everything changes at once. So instead of relying on elements that might move around, we focus on patterns in the HTML that tend to stay the same. That way, our scraper remains useful and accurate, even as the site evolves over time.
Main Scraping Logic
async def scrape_product_urls(base_url, user_agents_file, db_name="bestbuy_refrigerators.db"):
"""
Main scraping function that coordinates the entire scraping process.
Args:
base_url (str): Template URL for Best Buy's refrigerator category pages
user_agents_file (str): Path to file containing user agent strings
db_name (str): Name of the SQLite database file
This function:
1. Loads user agents for rotation
2. Initializes the database
3. Iterates through pages
4. Handles browser automation using Playwright
5. Manages country selection popups
6. Coordinates scrolling and content extraction
7. Stores results in the database
Anti-detection measures:
- Random user agent rotation
- Random delays between pages
- Standard viewport size
- Proper handling of lazy loading
"""
user_agents = load_user_agents(user_agents_file)
conn = initialize_db(db_name)
async with async_playwright() as p:
for page_num in range(1, 10): # Loop through pages 1 to 9
current_page = base_url.format(page_num=page_num)
logging.info(f"Scraping page {current_page}...")
user_agent = random.choice(user_agents)
browser = await p.chromium.launch(headless=False) # Set headless to False to see the browser
context = await browser.new_context(
user_agent=user_agent,
viewport={'width': 1920, 'height': 1080} # Set a standard viewport size
)
page = await context.new_page()
try:
await page.goto(current_page, timeout=120000)
# Handle country selection if it appears
if "Choose a country" in await page.content():
logging.info("Detected country selection page, navigating to the US site...")
await page.click("body > div.page-container > div > div > div > div:nth-child(1) > div.country-selection > a.us-link")
await page.wait_for_load_state("domcontentloaded", timeout=120000)
# Scroll to load all products
await scroll_to_bottom(page)
await page.wait_for_timeout(200000)
content = await page.content()
product_urls = parse_product_urls(content)
logging.info(f"Total URLs scraped from page {page_num}: {len(product_urls)}")
for url in product_urls:
insert_product_url(conn, url)
# Add a small delay between pages
await page.wait_for_timeout(random.uniform(2000, 4000))
except Exception as e:
logging.error(f"Error on page {page_num}: {e}")
continue
finally:
await page.close()
await context.close()
await browser.close()
conn.close()
This part of our project is the true brain of the entire scraping operation. Think of it as a well-organized project manager, quietly coordinating a team where each member has a specific role—browsing pages, handling pop-ups, collecting data, and carefully filing it away.
When the function starts, it doesn’t just rush in. First, it gets everything ready. It loads up the different user agents we’ll use as disguises, sets up the database for storing our findings, and applies any configuration settings we need. This preparation is like a researcher setting up their tools before diving into a stack of library books—calm, focused, and methodical.
Then, as it goes through each page of product listings, it behaves more like a real person than a robot. It pauses for a random amount of time between actions, mimicking natural human behavior. It even knows how to deal with pop-ups—like a prompt asking you to select your country—by interacting with them just the way a real visitor would. And throughout this process, it rotates through different user agents to help avoid detection.
What really makes this function smart is how it handles problems. If one page doesn’t load or causes an error, it doesn’t stop everything. It simply makes a note of what went wrong in the log and moves on, just like a determined researcher who doesn’t let one missing book throw off the entire study. It’s thoughtful, adaptable, and persistent—exactly what a good scraping engine needs to be.
Script Entry Point
if __name__ == "__main__":
# Base URL template for Best Buy's refrigerator category
# Parameters:
# - page_num: Page number for pagination
# - Additional query parameters filter for in-stock items and category
base_url = "https://www.bestbuy.com/site/searchpage.jsp?_dyncharset=UTF-8&browsedCategory=pcmcat1637590307724&cp={page_num}&id=pcat17071&iht=n&ks=960&list=y&qp=soldout_facet%3Dname~Exclude%20Out%20of%20Stock%20Items&sc=Global&st=pcmcat1637590307724_categoryid%24pcmcat367400050001&type=page&usc=All%20Categories"
try:
asyncio.run(scrape_product_urls(base_url, "user_agents.txt"))
except KeyboardInterrupt:
logging.info("Scraping interrupted by user")
except Exception as e:
logging.error(f"Fatal error: {e}")
This is where our scraper program actually begins to run, like turning the key in a car's ignition. It sets up the initial URL we want to scrape (Best Buy's refrigerator category in this case) and kicks off the scraping process.
The entry point includes all safety features for when things go wrong. It is capable of handling two types of stops: when you deliberately want to stop the script (like you pressed Ctrl+C), and when unexpected errors occur. Imagine that you have both regular and emergency brakes in your car.
DATA COLLECTION
Import Section
import sqlite3
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import asyncio
import json
The import section for this part of the project looks very similar to what we used earlier when scraping product URLs. It brings in all the essential tools—like Playwright for browser automation, BeautifulSoup for navigating through HTML, asyncio for asynchronous programming and SQLite for handling our database.
But there’s one new addition here: the json library. This small but powerful tool is especially handy when we're dealing with structured data, like product specifications. Often, product details are stored on a webpage in neat, organized blocks—almost like little data files hidden inside the HTML. These blocks are usually written in JSON format.
Database Connection Functions
connect_db
# Database Connection
def connect_db(db_path='bestbuy_refrigerators.db'):
"""
Create and return a connection to the SQLite database.
Args:
db_path (str): Path to the SQLite database file
Returns:
sqlite3.Connection: Database connection object
"""
return sqlite3.connect(db_path)
The connect_db function is where our data storage journey truly begins. Think of it as opening the front door to our data vault. When this function is called, it connects to a file-based SQLite database. If the file isn’t there yet, it quietly creates one for us—no extra steps needed.
This connection is more than just a one-time handshake. It becomes a steady, reliable pipeline through which all our data flows—whether we’re saving new product information or checking what we’ve already collected. Without this connection, nothing else involving the database can happen.
We’ve also made the function flexible. By default, it connects to a database file named 'bestbuy_refrigerators.db', but if needed, we can tell it to use a different file. This makes the function reusable across different projects. Plus, it’s built to work well in asynchronous environments, meaning even if multiple parts of the program are trying to use the database at once, everything stays safe and efficient.
So, you can think of this connection as setting up a dedicated phone line between our scraper and the database. It’s secure, always on, and forms the foundation of all the data-related tasks that follow. Without it, we’d have no way to store or retrieve anything—we’d be scraping into thin air.
create_tables
# Ensure necessary tables exist
def create_tables(conn):
"""
Initialize database schema by creating necessary tables if they don't exist.
Args:
conn (sqlite3.Connection): Database connection object
Tables created:
1. scraped_data: Stores successful scraping results with product details
2. error_urls: Tracks failed scraping attempts with error messages
"""
cursor = conn.cursor()
# Create `scraped_data` table if it doesn't exist with separate title and price columns
cursor.execute("""
CREATE TABLE IF NOT EXISTS scraped_data (
url TEXT,
title TEXT,
sale_price TEXT,
price TEXT,
discount TEXT,
brand TEXT,
model TEXT,
sku TEXT,
rating TEXT,
reviews TEXT,
key_specs TEXT,
date TEXT
)
""")
# Create `error_urls` table if it doesn't exist
cursor.execute("""
CREATE TABLE IF NOT EXISTS error_urls (
url TEXT,
error_message TEXT,
date_scraped TEXT,
scraped INTEGER DEFAULT 0
)
""")
conn.commit()
The create_tables function acts like the architect of our data storage system. Before we can store anything, we need a solid structure—a blueprint that tells the database exactly what kinds of information we’re going to save, and where each piece should go. That’s exactly what this function sets up.
It creates two key tables. The first one, called scraped_data, is where we’ll keep everything we collect about refrigerators. This table is designed with care, including columns for all the details we might extract—such as the product URL, title, sale price, regular price, discount, brand, model number, SKU, customer rating, review count, and even a slot for detailed specifications. Each column is assigned the appropriate data type, so whether we’re storing numbers, text, or longer descriptions, everything fits neatly in place.
The second table, error_urls, serves a different but equally important purpose. Sometimes, scraping a product page might fail—maybe the page didn’t load, or the data wasn’t in the expected format. Instead of letting these failures vanish, we log them here. This table includes the URL that caused trouble, the error message, the date it happened, and the current status—so we can come back and handle those issues later.
To keep things smooth, the function uses the “CREATE TABLE IF NOT EXISTS” command. This means it won’t break or complain if the tables already exist. You can safely run it every time the script starts, and it will only create the tables if they aren’t there yet. Finally, it commits the changes to the database right away, making sure the setup is saved and ready to go—like locking in the foundation of a building before construction begins.
Database Operation Functions
fetch_unscraped_urls
# Fetch unsaved URLs and their dates
def fetch_unscraped_urls(conn):
"""
Retrieve URLs that haven't been successfully scraped yet.
Args:
conn (sqlite3.Connection): Database connection object
Returns:
list: Tuples of (url, date_scraped) for unscraped products
"""
cursor = conn.cursor()
cursor.execute("SELECT url, date_scraped FROM products WHERE scraped = 0")
return cursor.fetchall()
The fetch_unscraped_urls function works like a smart task manager for our scraping system. Its job is to figure out which URLs still need attention—basically, the to-do list of pages we haven’t successfully scraped yet.
To do this, it looks into the error_urls table, which is where we keep track of all the pages that caused trouble during previous scraping attempts. It specifically checks for rows where the scraped flag is set to 0, meaning those URLs haven't been successfully processed yet. These are the tasks waiting in line, and this function helps our scraper pick them up and try again.
When the function runs, it returns two pieces of information for each URL: the link itself, and the date it was first added to the table. This not only gives us a clear view of which URLs are pending, but also provides historical context—like when we first tried to scrape them. That information helps us make better decisions, like whether to prioritize older tasks or simply track how long something has been in the queue.
Behind the scenes, the SQL query used in this function is written for efficiency, which means it runs quickly even when the table has thousands of entries. That’s important, especially for large-scale scraping projects that run for hours or days. In short, fetch_unscraped_urls makes sure we don’t miss anything and that we always know what still needs to be done—keeping the whole operation flowing smoothly.
mark_as_scraped
# Update 'scraped' status
def mark_as_scraped(conn, url):
"""
Mark a URL as successfully scraped in the database.
Args:
conn (sqlite3.Connection): Database connection object
url (str): URL of the scraped product
"""
cursor = conn.cursor()
cursor.execute("UPDATE products SET scraped = 1 WHERE url = ?", (url,))
conn.commit()
The mark_as_scraped function plays the role of a progress tracker in our scraping system. Once we've successfully collected data from a URL—without any errors—this function steps in and updates the record in the database to reflect that the job is done.
Specifically, it changes the scraped flag to 1 in the error_urls table for that particular URL. This update is important because it tells our system, “We’ve already taken care of this one—no need to do it again.” Without this step, we might end up scraping the same page over and over, wasting time and resources.
What makes this function reliable is that it’s built with error handling in mind. If something unexpected happens while trying to update the database, it doesn’t let that error cause bigger problems. And once the update is successfully made, it commits the change to the database, ensuring the progress is saved right away.
Even better, this function performs its update in what's called an atomic way. That means the operation either completes fully or doesn't happen at all—there’s no in-between. This helps protect the accuracy of our tracking system. If updates were only partially saved, we could end up with mixed or corrupted information about what’s been scraped and what hasn’t. So, mark_as_scraped is a small function with a big responsibility: it keeps our progress clear, our data collection efficient, and our records accurate.
save_scraped_data
# Save scraped data to the `scraped_data` table with separate title and price columns
def save_scraped_data(conn, url, title, sale_price,price,discount,brand,model,sku,rating,reviews, key_specs,original_date):
"""
Store successfully scraped product data in the database.
Args:
conn (sqlite3.Connection): Database connection object
url (str): Product URL
title (str): Product title
sale_price (str): Current sale price
price (str): Regular price
discount (str): Discount amount/percentage
brand (str): Product brand
model (str): Model number
sku (str): SKU number
rating (str): Product rating
reviews (str): Number of reviews
key_specs (str): JSON string of product specifications
original_date (str): Date when URL was first discovered
"""
cursor = conn.cursor()
cursor.execute("INSERT INTO scraped_data (url, title, sale_price, price,discount, brand,model,sku,rating,reviews,key_specs, date) VALUES (?, ?, ?, ?,?,?,?,?,?,?,?,?)", (url, title, sale_price,price,discount,brand,model,sku,rating,reviews, key_specs,original_date))
conn.commit()
The save_scraped_data function is like the official archivist of our project. Its job is to take all the details we’ve collected about a refrigerator—like its name, price, brand, ratings, and more—and carefully store them in our database for future use.
Each time this function is called, it receives a set of inputs—one for each piece of product information. It then inserts all of that into the corresponding columns of the scraped_data table. Think of it like neatly filing each product’s info into the right drawer in a well-organized filing cabinet.
But this function doesn’t just drop the data in and hope for the best—it’s built to be smart and cautious. It checks that everything is in the correct format, makes sure special characters won’t break the database, and even handles complex fields like technical specifications. This attention to detail helps keep our records clean, consistent, and easy to retrieve later.
Another important aspect is the commit step. Once the data is inserted, the function commits the change to the database, locking it in. This means that even if something unexpected happens later—like the script crashes or the internet drops—whatever was already saved stays safe.
Lastly, this function always ties each entry back to the original product URL. That way, we always know exactly where the data came from. This traceability is key when double-checking information or fixing issues later on. In short, save_scraped_data is what makes sure all our hard-earned data is safely stored, correctly organized, and easy to access.
save_error_url
# Save error details to `error_urls` table
def save_error_url(conn, url, error_message, original_date):
"""
Log failed scraping attempts for retry.
Args:
conn (sqlite3.Connection): Database connection object
url (str): Failed product URL
error_message (str): Description of the error
original_date (str): Date when URL was first discovered
"""
cursor = conn.cursor()
cursor.execute("INSERT OR REPLACE INTO error_urls (url, error_message, date_scraped,scraped) VALUES (?, ?, ?,?)",
(url, error_message, original_date,0))
conn.commit()
The save_error_url function works like the troubleshooter of our scraping system. Whenever something goes wrong while trying to scrape a page—maybe the site didn’t load properly, or the data wasn’t found—this function steps in to log exactly what happened.
It records three key pieces of information: The URL that caused the issue, a detailed error message explaining what went wrong, and a timestamp to show when the error occurred.
This detailed logging is incredibly helpful later on. It lets us see patterns—for example, if the same URL fails multiple times or if similar errors keep popping up. With that kind of insight, we can better understand where the system needs fixing or adjusting.
Instead of blindly adding new entries every time, this function uses an approach called "INSERT OR REPLACE". That means if the same URL has failed before, it simply updates the existing error record with the latest message and time. This keeps our database clean and avoids duplicate entries.
It also resets the scraped flag back to 0, which is like saying, “Hey, this one didn’t work—try again later.” That way, the scraper knows to come back to it in future runs.
Perhaps the most valuable part? The error messages themselves. These tell us whether the issue was caused by the website changing, a network hiccup, or something wrong in our own scraping code. Over time, this error history becomes a goldmine for improving the system—making it stronger, more accurate, and more reliable with each update.
Web Scraping Helper Functions
handle_country_selection
# Handle country selection if prompt appears
async def handle_country_selection(page):
"""
Handle Best Buy's country selection popup if it appears.
Args:
page: Playwright page object
"""
if "Choose a country" in await page.content():
await page.click("body > div.page-container > div > div > div > div:nth-child(1) > div.country-selection > a.us-link")
await page.wait_for_load_state("domcontentloaded", timeout=60000)
The handle_country_selection function acts like our geographic guide, helping the scraper navigate to the correct version of the Best Buy website—specifically, the U.S. version. This step is important because many international websites display a prompt asking visitors to choose their country, and we need to make sure we're seeing the same content that a U.S.-based customer would see.
Here's how it works: when the scraper lands on a page, it checks if there’s a country selection pop-up. If it finds one, the function automatically selects “United States.” This ensures we get access to the right products, prices, and layout—just as we expect.
But the function doesn’t rush. It patiently waits for the page to fully reload after making the selection. It uses Playwright’s built-in waiting features to confirm that the site is ready for the next steps. This helps avoid problems like trying to scrape a page before it's finished loading, which could lead to missing or broken data.
What makes this function especially reliable is its error handling. If the country prompt is missing or appears in an unexpected format, it doesn’t crash or throw off the whole scraping process. Instead, it handles the situation smoothly, helping the scraper move forward without interruption.
scroll_to_bottom
# scrolling function
async def scroll_to_bottom(page):
"""
Implement infinite scroll to ensure all content is loaded.
Args:
page: Playwright page object
This function scrolls until no new content is loaded, as determined by
comparing page heights before and after scrolling.
"""
previous_height = await page.evaluate("document.body.scrollHeight")
while True:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000) # Wait for lazy-loaded content
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == previous_height:
break
previous_height = new_height
The scroll_to_bottom function plays the role of our content explorer. Its job is to make sure we don’t miss any data that's hidden behind endless scrolling—just like when you're on Instagram or Facebook and more posts appear as you scroll down.
Many modern websites, including Best Buy, use this technique called lazy loading or infinite scrolling, where new items load only when you scroll further. So, if we only scrape what’s initially visible, we’d miss out on a large part of the product list.
That’s where this function comes in. It scrolls the page in small steps, just like a real user would—going down, pausing, waiting for more content to appear, and checking if the page has grown in size. It does this by comparing the height of the page before and after each scroll. If the height stays the same, it knows we’ve likely reached the end.
To make the process smooth and responsible, the function also waits between scrolls, giving the page time to load and avoiding too many requests at once. This reduces the risk of being blocked by the website or loading content too fast to capture.
And in case something doesn’t go as expected—like a scroll gets interrupted or the page doesn't behave normally—the function has built-in error handling to keep things running safely without crashing the whole process.
Data Extracting Functions
extract_title
def extract_title(soup):
"""Extract product title from the page."""
tittle_tag = soup.select_one('div.sku-title > h1')
return tittle_tag.get_text(strip=True) if tittle_tag else "NOT AVAILABLE"
This function works like a focused scanner, carefully searching for the main title of the product on a webpage—specifically, the refrigerator’s name.
It uses BeautifulSoup’s select_one method, which allows us to target a very specific part of the page using a CSS selector. In this case, it looks for the text inside the HTML tag 'div.sku-title > h1', which is usually where Best Buy places the product title.
You can think of this like using a magnifying glass to examine a very particular section of a document. If the product title is there, the function grabs the text, cleans up any extra spaces, and returns it neatly.
But websites aren’t always predictable. Sometimes the layout changes, or maybe the product title is temporarily missing. To prepare for this, the function is designed with a backup plan. If it can’t find the title where it's expected, it simply returns "NOT AVAILABLE" instead of causing the whole scraper to crash. This makes the function both precise and resilient.
So in simple terms, this function is like a smart assistant that goes straight to where the product name is supposed to be, checks if it’s there, and either brings it back cleaned up—or politely lets us know it couldn’t find it.
Similarly, other parsing functions are used to extract details like sale price, brand, original price, discount, IDs, rating, and reviews. They follow the same pattern as the title parser—targeting specific elements and returning the value or a default if not found. See the code next for how each one works.
extract_sale_price
def extract_sale_price(soup):
"""Extract current sale price if available."""
saleprice_tag = soup.select_one('div.flex.gvpc-price-1-2441-31 > div:nth-child(1) > div:nth-child(1) > div > span:nth-child(1)')
return saleprice_tag.get_text(strip=True) if saleprice_tag else "NOT AVAILABLE"
extract_brand
def extract_brand(soup):
"""Extract product brand name."""
brand_tag = soup.select_one('div.pb-200 > a')
return brand_tag.get_text(strip=True) if brand_tag else "NOT AVAILABLE"
extract_price
def extract_price(soup):
"""Extract regular product price."""
price_tag = soup.select_one('div.flex.gvpc-price-1-2441-31 > div:nth-child(1) > div.pricing-price__savings-regular-price > div.pricing-price__regular-price-content--block.pricing-price__regular-price-content--block-mt > div:nth-child(1) > span')
return price_tag.get_text(strip=True) if price_tag else "NOT AVAILABLE"
extract_discount
def extract_discount(soup):
"""Extract discount information if available."""
discount_tag = soup.select_one('div.flex.gvpc-price-1-2441-31 > div:nth-child(1) > div.pricing-price__savings-regular-price > div.pricing-price__savings.pricing-price__savings--promo-red')
return discount_tag.get_text(strip=True) if discount_tag else "NOT AVAILABLE"
extract_ids
def extract_ids(soup):
"""Extract model and SKU numbers."""
model_tag = soup.select_one('div.title-data.lv > div > div.model.product-data.pr-100.inline-block.border-box > span.product-data-value.text-info.ml-50.body-copy')
sku_tag = soup.select_one('div.title-data.lv > div > div.sku.product-data.pr-100.inline-block.border-box > span.product-data-value.text-info.ml-50.body-copy')
return (model_tag.get_text(strip=True) if model_tag else "NOT AVAILABLE",
sku_tag.get_text(strip=True) if sku_tag else "NOT AVAILABLE")
extract_rating_reviews
def extract_rating_reviews(soup):
"""Extract product rating and number of reviews."""
rating_tag = soup.select_one('ul > li > a > div > span.ugc-c-review-average.font-weight-medium.order-1')
reviews_tag = soup.select_one('ul > li > a > div > span.c-reviews.order-2')
return (rating_tag.get_text(strip=True) if rating_tag else "NOT AVAILABLE",
reviews_tag.get_text(strip=True) if reviews_tag else "NOT AVAILABLE")
extract_key_specs
def extract_key_specs(soup2):
"""Extract and format product specifications as JSON."""
specs = {}
for row in soup2.find_all('div', class_='zebra-row flex p-200 justify-content-between body-copy-lg'):
complete_text = row.get_text(separator='\n').strip()
lines = [line.strip() for line in complete_text.split('\n') if line.strip()]
if len(lines) >= 2:
specs[lines[0]] = lines[1]
return json.dumps(specs, indent=4)
This function is like a super-organized assistant whose job is to carefully go through every line of a refrigerator’s specification sheet and turn it into something we can actually work with—clean, structured data.
Think of it as someone reading a product brochure. For each row in the specifications table, the assistant looks at the feature name—like “Capacity” or “Cooling System”—and then notes down the corresponding value, such as “25.5 cubic feet” or “Twin Cooling Plus.” The function collects all these key-value pairs and saves them in a format called JSON, which is great for storing and comparing data in a structured way.
This step is especially important because the specs section usually contains the most technical details about the product, and it's where you really start to understand the differences between models. Having this information organized lets us do things like comparing sizes, energy efficiency, or smart features across many refrigerators at once.
It’s also designed to be thorough and cautious. The function looks only for rows with certain class names to avoid grabbing unrelated content. This helps make sure we collect the right kind of information while skipping the clutter.
Main Scraping Functions
scrape_data
# Scrape data for a single URL
async def scrape_data(url):
"""
Scrape all product details from a single URL.
Args:
url (str): URL of the product page to scrape
Returns:
tuple: All scraped product details
Raises:
RuntimeError: If any error occurs during scraping
This function:
1. Launches a browser instance
2. Navigates to the product page
3. Handles country selection if needed
4. Scrolls to load all content
5. Extracts all product details
6. Clicks to view full specifications
7. Extracts detailed specifications
"""
try:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto(url)
# Handle country selection if it appears
await handle_country_selection(page)
# Scroll to load all products
await scroll_to_bottom(page)
await page.wait_for_timeout(9000)
print(f"scraping :{url}")
soup = BeautifulSoup(await page.content(), 'html.parser')
# Use the separate parsing functions
title = extract_title(soup)
sale_price = extract_sale_price(soup)
price=extract_price(soup)
discount=extract_discount(soup)
brand=extract_brand(soup)
model=extract_ids(soup)[0]
sku=extract_ids(soup)[1]
rating=extract_rating_reviews(soup)[0]
reviews=extract_rating_reviews(soup)[1]
# Click the button (add the actual selector for the button here)
await page.click('div.col-xs-7 > div > button.c-button.c-button-outline.c-button-md.show-full-specs-btn.col-xs-6') # Replace 'button_selector' with the actual selector
# Wait for content to load after clicking the button
await page.wait_for_timeout(3000) # Adjust timeout as needed for the content to load
# Retrieve and parse the new page content
soup2 = BeautifulSoup(await page.content(), 'html.parser')
# additional parsings
key_specs=extract_key_specs(soup2)
await browser.close()
except Exception as e:
raise RuntimeError(f"Error scraping {url}: {e}")
return title,sale_price, price,discount,brand,model,sku,rating,reviews,key_specs
The scrape_data function is like the main director of our entire scraping process. It’s the one calling the shots—making sure every other part of the script works together smoothly to collect data from a single product page.
First, it opens up a browser window using Playwright. But not just any browser—it starts with specific settings that help it act more like a real human browsing the web. This helps reduce the chances of the website blocking us for being a bot.
Once the browser lands on the product page, the real coordination begins. The function checks if there's a country selection popup and makes sure the U.S. version of the site is selected. Then, it scrolls down the page, step by step, allowing all the product information to load—just like a human would scroll slowly to read the page.
After that, it pulls together all the important data: the product title, prices, ratings, specifications, and more. It does all of this while handling each step with care. If something doesn’t go as planned—maybe the layout changed or an element didn’t load—it catches the error and moves on without crashing the entire process. This kind of error handling is what makes the function reliable even when the website isn’t acting perfectly.
And once everything is done, the function doesn’t just walk away. It makes sure to clean up—closing browser tabs and freeing up resources. Think of it as someone turning off the lights and shutting the door after finishing work.
Overall, scrape_data brings together all the smaller parts of our scraper—handling the page, collecting details, storing the data, and managing problems—all in one smooth and efficient flow. It’s the backbone of the entire scraping system.
main
# Main scraper function
async def main():
"""
Main execution function that coordinates the scraping process.
This function:
1. Establishes database connection
2. Ensures necessary tables exist
3. Retrieves unscraped URLs
4. Attempts to scrape each URL
5. Saves successful results and logs failures
6. Handles cleanup
"""
conn = connect_db()
# Ensure tables exist
create_tables(conn)
urls_dates = fetch_unscraped_urls(conn)
for url, original_date in urls_dates:
try:
title, sale_price, price,discount,brand,model,sku,rating,reviews ,key_specs= await scrape_data(url) # Await scrape_data since it's async
save_scraped_data(conn, url, title, sale_price, price,discount, brand,model,sku,rating,reviews,key_specs, original_date) # Pass title and price separately
mark_as_scraped(conn, url) # Mark as scraped after saving
except Exception as e:
error_message = str(e)
save_error_url(conn, url, error_message, original_date) # Save error details
conn.close()
if __name__ == "__main__":
asyncio.run(main())
The main function is the supreme orchestrator of our entire scraping operation. It begins by establishing the database connection and ensuring that our data structures—such as tables—are correctly set up. This function creates the workflow backbone, tying together all the components of the scraping system.
It manages the overall control flow: fetching the list of URLs that need processing, launching scraping attempts one by one, handling successful scrapes, capturing any errors, and ensuring that the results—whether successful data or failure logs—are stored appropriately. Importantly, it includes top-level error handling, which means that even if individual URLs fail, the entire scraping process continues without interruption.
In addition to coordination, the main function handles resource management—cleaning up database connections and freeing up system resources after scraping is done. It is designed to support long-running sessions with stability and integrity, making sure no data is lost and system performance stays reliable.
Finally, it serves as the true entry point for the script, initiating the asyncio event loop that powers our asynchronous scraping logic. This ensures our scraper runs efficiently—handling multiple tasks at once—while maintaining tight control over flow and exceptions.
Conclusion
Scraping refrigerator data from Best Buy showcases how automation can streamline large-scale data collection, making it significantly easier to analyze product listings, prices, and availability. By combining Playwright for handling dynamic web elements with BeautifulSoup for precise data extraction, we were able to gather well-structured information efficiently—without manual effort. This method proves especially valuable for price tracking, trend analysis, and market research, empowering both businesses and consumers to make smarter decisions.
That said, web scraping must be used responsibly. It’s essential to respect a website’s policies, avoid overwhelming servers with too many requests, and implement proper delays between actions. Understanding the structure of Best Buy’s site, managing dynamic content carefully, and ensuring ethical practices are key to sustainable data extraction.
As e-commerce platforms grow more complex and data-driven decisions become the norm, web scraping stands out as a critical skill for analysts, developers, and researchers who want to stay competitive in a fast-paced digital market.
AUTHOR
I’m Shahana, a Data Engineer at Datahut, where I specialize in building smart, scalable data pipelines that convert raw, dynamic web content into clean, actionable insights—fueling better decisions across retail and consumer electronics.
At Datahut, we’ve spent more than a decade helping businesses harness the power of automation for product tracking, price intelligence, and competitive analysis. In this blog, I walk you through how we used Playwright and BeautifulSoup to scrape refrigerator product data from Best Buy—capturing detailed specifications, prices, and availability from a JavaScript-heavy site with accuracy and efficiency.
If your team is exploring ways to automate product data collection in the electronics sector—or any high-volume e-commerce domain—feel free to reach out through the chat widget on the right. We’d be happy to help you design a robust scraping solution that meets your needs.
FAQ SECTION
1.What kind of refrigerator data can be extracted from Best Buy using web scraping?
You can extract a wide range of data points such as product names, model numbers, prices (regular and discounted), availability, customer ratings, number of reviews, brand, capacity, energy efficiency ratings, dimensions, features (like smart connectivity, water dispensers), warranty details, and promotional offers.
2. Is it legal to scrape refrigerator data from BestBuy.com?
Scraping public data is generally legal, but Best Buy’s website terms may restrict automated access. To avoid legal issues, it’s best to use scraping techniques responsibly (e.g., rate limiting) and review the site's robots.txt and terms of service.
3. Why would someone scrape refrigerator data from Best Buy?
Common reasons include:
Price comparison with other retailers
Market research on brands and features
Inventory monitoring for stock availability
Competitor analysis
Affiliate marketing or price tracking tools
4. Can I track price drops of refrigerators using web scraping?
Yes, by scheduling regular scrapes (e.g., daily or weekly), you can detect changes in pricing, identify limited-time deals, or track discount trends over time.
5. What tools are used to scrape Best Buy for refrigerator data?
Popular tools include Python libraries such as BeautifulSoup, Scrapy, Selenium, and Playwright. For large-scale projects, you may also use cloud scraping services or scraping APIs.
6. How often should I scrape Best Buy for accurate refrigerator data?
It depends on your use case. For price tracking, daily or hourly scrapes may be ideal. For product listing or feature analysis, a weekly or bi-weekly scrape might suffice.
7. Will Best Buy block my scraper?
Yes, if your scraper sends too many requests too quickly or violates site rules. Use techniques like rotating IPs, proxies, and user-agent headers, and add delay between requests to reduce the risk of being blocked.
8. Can scraped refrigerator data be used in a product comparison app?
Yes, as long as you comply with copyright and terms of use. Many apps use scraped data to display features, reviews, and prices for comparison purposes.
9. What’s the difference between using Best Buy’s API and scraping the website?
Best Buy offers a public API with structured data access, but it may have usage limits or restricted access. Scraping allows you to extract data not available in the API, like promotional banners or dynamically loaded specs.