Which tools are best for scraping data from AllMachines?

To scrape data from AllMachines effectively, tools like Playwright and BeautifulSoup are commonly used. They help handle dynamic content and parse HTML efficiently.

How do you handle infinite scrolling while scraping AllMachines?

Websites like AllMachines often use infinite scrolling to load products dynamically. Using automation tools such as Playwright or Selenium can simulate user scrolling to load and capture all data.

How can I avoid being blocked while web scraping?

Websites can block bots if scraping patterns look suspicious. You can avoid this by rotating proxies, randomizing request headers, and mimicking human browsing behavior.

What are the challenges in scraping data from sites like AllMachines?

Common challenges in scraping include handling JavaScript-heavy sites, CAPTCHAs, legal compliance, and IP blocking. Each can be managed with the right scraping setup and ethical practices.

How to Scrape Product Data from AllMachines: A Step-by-Step Guide

Shahana farvin
3 hours ago
40 min read

Did you ever think about how comparison websites get the prices and details of the same product from so many online stores? There’s a pretty little trick called web scraping that does it.

You can think of web scraping as almost sending a tiny robot to various websites to collect similar information and extract titles, prices and descriptions. Over the years, that robot has gotten very intelligent! The advent of new technologies like headless browsers (browsers that run in the background - they don’t have a window), and workarounds to avoid being blocked by sites to prevent scraping has made the techniques for web scraping more powerful and reliable.

In this blog I will take you through a project where we use web scraping to scrape data from a site called AllMachines - a site that lists farming equipment. Why farming tools? Well, just like comparing phones, or comparing laptops helps a person make a buying decision or understand the market, comparing tractors and other machines should help in that process too.

We divided the project into two simple sections:

1)Collecting links to products: First, we collected all of the web addresses (URLs) for the product pages on the site.

2)Collecting details for products: Then we visited each of those product pages and collected important information such as product name, features, specifications, etc...

This two-part process makes it easier to manage and is handy if we need to modify or extend one of the sections in the future.

Now, let us move into the code and see where all these sections fit together to create our smart farming equipment data collector!

Links Collection

This web scraping project was designed to scrape product listings from a website called AllMachines, which features farming equipment. The aim is straightforward: scrape through categories like tractors, harvesters, balers and other farm machinery, and scrape the product links into a database.

Once the scraper locates these links, it saves them in a lightweight database called SQLite - almost like a small notebook to save information to use on the product links we have scraped.

To make this process run smoothly, we utilized two powerful pieces of technology: Playwright and BeautifulSoup.

The scraper is sophisticated enough to handle some tricky web features, for example infinite scrolling (where more products appear below the current page as you scroll down) as well as having built-in error handling and logging so in the event of an error we know how to ascertain the problem and fix it.

Import Section

import time
import sqlite3
import logging
import traceback
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright

Before we dive into the actual scraping, let’s go over the tools (or libraries) we will be using in the code. You can think of these as applications or tools, which help us accomplish specific tasks.

Here’s the list of the libraries that we have imported:

time: This library helps us pause between actions. You can think of it as a short break so that we don’t overwhelm the website with too many requests at once.
sqlite3: This library will provide us with a simple database, where we can save the product links and other information we scrape. You can think of it as a saving note for something we can look back on.
logging: This library will keep track of what happens in the course of the script. It is extremely helpful for seeing if things are functioning correctly or if the web-scraper has stopped running.
traceback: If our code generates an error, this library will show us exactly where the code broke down, so we can diagnose the issue more effectively.

Next, here are the real stars of the show:

BeautifulSoup: This library allows us to read it and extract the information from the code of a webpage (HTML). This is like scanning a recipe and pulling off just the list of ingredients.
Playwright: This one is really neat. It allows our code to interact with sites like a human would - clicking buttons, scrolling on pages, closing popups. This is most useful when dealing with a lot of Javascript from a webpage (which is responsible for features we take for granted, like sliders, infinite scrolling, or even pop-up menus).

Logging Setup

# Configure logging to write logs to a file with a specific format
logging.basicConfig(
    filename='scraper.log',
    filemode='w',  # Overwrite the log file each time the script runs
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

Logging is pretty much your code's diary, where you note what happens while it runs. It's really important because if something goes wrong, your log can provide context for what happened and when.

In our scraper, we have implemented everything so that all those notes (or "logs") are saved in a file called scraper.log.

Here are a couple of key things about our implementation:

filemode='w': This means that every time we run the code, we start a new log file. We don't come back to an enormous log file filled with outdated messages that we don't need any longer.
Logging level is INFO: We are going to log all relevant info, and leave out tiny details that we don't need to know (unless it's deep debugging).

We intentionally structured the log messages in an organised fashion. Every log message has:

The timestamp of when the event occurred
The event type (such as info/warning/error)
The message describing what happened

This way of structuring the log helps you simply look through the logs later, find a problem, or just see how it went.

By doing this from the start, we now have a scraper that almost tells its own story, which isn't any more helpful than it being there when we improve or update the scraper or fix bugs.

Database Functions

def init_db(db_name="allmachines_products.db"):
    """
    Initialize the SQLite database and create the `product_links` table if it doesn't exist.

    This function connects to the SQLite database specified by `db_name`. If the database file does not exist,
    it will be created. The function also ensures that the `product_links` table is created with the required
    schema to store product categories and URLs.

    Args:
        db_name (str): The name of the SQLite database file. Defaults to "allmachines_products.db".

    Returns:
        sqlite3.Connection: A connection object to interact with the SQLite database.

    Logs:
        - INFO: When the database is successfully initialized and the table is created.
        - ERROR: If there is an error during database initialization.

    Raises:
        Exception: If there is an error during database initialization, it logs the error and raises it.
    """
    try:
        conn = sqlite3.connect(db_name)
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS product_links (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                category TEXT,
                url TEXT UNIQUE
            )
        """)
        conn.commit()
        logging.info("Initialized database and created table if not exists.")
        return conn
    except Exception as e:
        logging.error(f"Database initialization error: {e}")
        logging.debug(traceback.format_exc())
        raise

The init_db function is like preparing our storage box before we go and find anything. And in this particular case, the "box" refers to a SQLite database that we are going to use in order to store the product links that we find from scraping.

What this function does is the following:

First, it creates a connection to the database file. If this file does not existed, it will create one for us. This file is where we will store the data we collect.

Second, it creates a table named product_links. A table, you could think of it like a spreadsheet, and it has rows and columns, which is a place to store our information.

In this product_links table, there are three columns:

id: A number that increments each time we add a new entry (like row number in Excel).
category: Will tell us what type of equipment it is, like "tractor" or "harvester".
url: A link to the actual product on the web.

We have also added a UNIQUE constraint on the URL column to prevent saving the same product URL more than once. If the scraper finds the same URL a second time, it will not save a record again. Do we see how this is useful?

We added error handling here as well. In case of some failure regarding the database setup, the function will log an error so it's easier to find out what was wrong.

By putting the setup inside of a separate function, we created a cleaner method of setting up our schemas, and organized our code. If you think about it, it's a bit like organizing your tools in a labeled box before you start a project; it makes everything neat and orderly.

def save_links_to_db(conn, category, links):
    """
    Save the extracted product links to the SQLite database.

    This function inserts the extracted product links into the `product_links` table. Duplicate entries
    are ignored using the `INSERT OR IGNORE` SQL statement.

    Args:
        conn (sqlite3.Connection): The SQLite database connection object.
        category (str): The category of the products (e.g., "tractors", "combine-harvesters").
        links (set): A set of product links to be saved in the database.

    Returns:
        None

    Logs:
        - INFO: The number of links saved for the given category.
        - ERROR: If there is an error during the database operation.
    """
    try:
        cursor = conn.cursor()
        for link in links:
            # Insert links into the database, ignoring duplicates
            cursor.execute("INSERT OR IGNORE INTO product_links (category, url) VALUES (?, ?)", (category, link))
        conn.commit()
        logging.info(f"Saved {len(links)} links for category '{category}' to the database.")
    except Exception as e:
        logging.error(f"Database save error for category {category}: {e}")
        logging.debug(traceback.format_exc())

The save_links_to_db function is where we actually save the links we found for all the products into our database. Sort of like taking everything you wrote down in your notepad and typing in a real file that won't get lost.

This function needs three things to work:

The database connection we set up
The category with the equipment (tractors, harvesters, etc.)
A set of product links we just scraped from the website.

Here's how it works:

The function loops through each link one after the other and tries to insert into the database.
We perform a neat little trick with the database: INSERT OR IGNORE. This tells the database: "hey, try to add this link, but if it already exists, ignore it." This means we avoid saving the same product/item more than once and we don't have to deal with annoying errors.
Instead of saving (committing) each link individually, it saves (commits) them all at once at the end. This is much more efficient—like putting everything in one box and carrying it, instead of taking up to 10 trips.

As per all of our scraper, we added logging here as well. So every time we save links, we also log how many were saved and from which category. That way, we can track how thing are progressing.

And like all of the other functions, if something goes wrong, it doesn't crash the entire app. It can catch an error, log what happened, then move on. So even if our scraper encounters a hiccup it remain strong.

Scraper Functions

def load_all_products(page):
    """
    Perform infinite scrolling on the page to load all products.

    This function simulates infinite scrolling by repeatedly scrolling to the bottom of the page
    until no new content is loaded. It is used to ensure that all products are loaded on pages
    with lazy-loading or infinite scrolling mechanisms.

    Args:
        page (playwright.sync_api.Page): The Playwright page object representing the browser page.

    Returns:
        None

    Logs:
        - INFO: Each scroll action performed.
        - INFO: When no more content is loaded, indicating the end of scrolling.
        - WARNING: If there is an error during the scrolling process.
    """
    try:
        previous_height = None
        while True:
            # Scroll to the bottom of the page
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            logging.info("Scrolled to the bottom of the page.")
            time.sleep(5)  # Wait for new content to load

            # Get the current height of the page
            current_height = page.evaluate("document.body.scrollHeight")
            if previous_height == current_height:
                # Stop scrolling if no new content is loaded
                logging.info("No more content to load.")
                break
            previous_height = current_height
    except Exception as e:
        logging.warning(f"Error during infinite scrolling: {e}")
        logging.debug(traceback.format_exc())

The load_all_products function helps us out with something called infinite scrolling - which is sometimes difficult to scrape.

Let's clarify.

I imagine you have encountered a website where there are more products that keep loading as you scroll down. That is an example of infinite scrolling, where instead of loading all of the items at once, the website provides more items while you're scrolling down. For users this creates a very nice experience, but for scrapers it means we have to simulate human scrolling otherwise we miss a lot of content!

This is what this function does. It automatically scrolls down to the bottom of the page, waits a few seconds, and then asks the question:

"Um, is the page any taller?" (which is means new products were added)

If the page does not get taller after we scroll, it is signal that there are no more products to load. We made it to the bottom!

On each scroll, there’s also a five second pause to allow the website to load more products, as you would do naturally waiting a moment while scrolling for the next items to suddenly appear.

In addition, this function can be tough and clever. If something unexpected happens, for example the scroll fails to work unexpectedly, If for whatever reason the page does not load properly. This function won’t crash everything entirely, it manages to catch those errors and log them, enabling you to check later after that wrap up.

To keep it short:

This function ensures we are not missing any products that are hidden behind infinite scrolling. Simply put, it would be like taking your time to scroll down a shopping site patiently until you have looked at every item sitting on the shelf.

def extract_product_links(html):
    """
    Extract product links from the HTML content of the page.

    This function parses the HTML content using BeautifulSoup and extracts all product links
    from the specified CSS selector. It constructs full URLs for relative links.

    Args:
        html (str): The HTML content of the page as a string.

    Returns:
        set: A set of unique product links extracted from the page.

    Logs:
        - INFO: The number of links extracted from the page.
        - ERROR: If there is an error during the HTML parsing process.
    """
    links = set()
    try:
        soup = BeautifulSoup(html, "html.parser")
        # Select all anchor tags within the specified CSS selector
        for a in soup.select(" div.flex.justify-between.items-center > a"):
            href = a.get("href")
            if href:
                # Construct the full URL if the link is relative
                full_url = "https://www.allmachines.com" + href if href.startswith("/") else href
                links.add(full_url)
        logging.info(f"Extracted {len(links)} product links from page.")
    except Exception as e:
        logging.error(f"HTML parsing error: {e}")
        logging.debug(traceback.format_exc())
    return links

Alright, let’s talk about the heart of our scraper — the part where we actually grab the product links from the page. That’s what the extract_product_links function does.

Imagine you’ve just scrolled all the way down on an online store’s page (like we did in the previous step). Now, your screen is filled with tons of farming equipment listings. What we need to do next is pick out the links that lead to each of those products.

That’s where this function comes in.

First, it takes the full web page (in HTML format — kind of like the skeleton of a webpage) and feeds it into a tool called BeautifulSoup. This tool is like a super-smart highlighter — it helps us easily spot and pull out specific parts of the webpage. Think of it as using "Find" in a document, but with extra powers.

In this case, we’re looking for special pieces of HTML code — anchor tags (<a>) inside certain boxes (or <div>s) that are styled in a specific way. These are the boxes that AllMachines uses to hold product links. We tell BeautifulSoup, “Hey, find any anchor tags that live inside boxes with these class names: flex, justify-between, and items-center.” (Class names are just labels that websites use to organize their layout.)

Once we find those tags, we grab the href attribute from each one — that’s the part that holds the actual URL.

But sometimes these links are only partial, like:/product/tractor-123To fix that, we add the website’s main URL at the beginning, so it becomes a full, working link:https://allmachines.com/product/tractor-123

One more cool trick here — we use a set to store all the links. Why? Because a set automatically removes duplicates. So even if the same product appears twice on the page, we’ll only keep it once.

And of course, just like the other functions, we’ve added some error handling. That means if something goes wrong while reading the page, the whole scraper won’t crash — it will simply log the error so you can check it later.

Oh, and yes — it also logs how many links it found, which is super helpful to track how things are going.

def scrape_category(category, url, conn):
    """
    Scrape a specific category page for product links and save them to the database.

    This function navigates to the category page, performs infinite scrolling to load all products,
    extracts product links from the page, and saves them to the database.

    Args:
        category (str): The category of the products (e.g., "tractors", "combine-harvesters").
        url (str): The URL of the category page to scrape.
        conn (sqlite3.Connection): The SQLite database connection object.

    Returns:
        None

    Logs:
        - INFO: The start and completion of scraping for the given category.
        - ERROR: If there is an error during the scraping process.
    """
    try:
        with sync_playwright() as p:
            # Launch the browser
            browser = p.chromium.launch(headless=False)
            context = browser.new_context()  # Create a new browser context
            page = context.new_page()  # Open a new page

            logging.info(f"Scraping category: {category} | URL: {url}")
            page.goto(url)  # Navigate to the category URL
            time.sleep(3)  # Wait for the page to load

            load_all_products(page)  # Perform infinite scrolling
            html = page.content()  # Get the page content
            product_links = extract_product_links(html)  # Extract product links

            save_links_to_db(conn, category, product_links)  # Save links to the database
            browser.close()  # Close the browser
    except Exception as e:
        logging.error(f"Error scraping category '{category}': {e}")
        logging.debug(traceback.format_exc())

Let’s break down what the scrape_category function does — it’s kind of like the team leader in charge of collecting product links from one specific equipment category (like tractors or harvesters).

Here’s the step-by-step story:

When this function starts, it opens up a browser window — just like you would when using Chrome or Firefox. We’re using a tool called Playwright to do this. It lets us control the browser automatically with code (like a robot clicking and scrolling for us). And since we’re setting headless=False, the browser actually opens up on your screen — which is super helpful while you're testing or debugging, so you can watch the scraper in action.

Next, the browser goes to a category page — say, the one for “Tractors.”

Once the page loads, the scraper starts scrolling down automatically — just like a real user. This is important because many websites only load more products when you scroll, a feature called infinite scrolling. So, we keep scrolling until everything is loaded.

After that, we grab all the HTML (the website’s behind-the-scenes code), and pass it to a helper function that picks out all the product links on the page. It’s kind of like scanning through a grocery list and pulling out all the items you need.

Once we have the links, we save them into a database using another function. This way, we have a nice organized list of links that we can come back to later.

Finally, the browser closes down, so we’re not using up unnecessary memory or keeping windows open in the background. It’s like washing your hands and tidying up your desk after you're done.

And what if something goes wrong?

Don’t worry — this function also includes error handling. That means if something fails (like a page doesn’t load properly), it won’t crash the whole process. Instead, it’ll log the error, skip that category, and move on to the next one. So your scraper keeps running smoothly even if one piece goes a little off track.

Main Script

def main():
    """
    Main function to scrape all categories and save product links to the database.

    This function initializes the database, iterates through all categories and their URLs,
    scrapes each category page for product links, and saves the extracted links to the database.

    Steps:
        1. Initialize the SQLite database.
        2. Iterate through all categories and their URLs.
        3. Scrape each category page for product links.
        4. Save the extracted links to the database.

    Logs:
        - INFO: The start and completion of the scraping process.
        - CRITICAL: Any critical errors encountered during execution.

    Returns:
        None
    """
    category_urls = {
#hint:tractor section can be is filtered and scraped because of high number of products and to handle infinite scrolling
        "tractors": "https://www.allmachines.com/tractors/view-all",
        "combine-harvesters": "https://www.allmachines.com/combine-harvesters/view-all",
        "balers": "https://www.allmachines.com/balers/view-all",
        "forage-harvesters": "https://www.allmachines.com/forage-harvesters/view-all",
        "combine-headers": "https://www.allmachines.com/combine-headers/view-all",
        "forage-headers": "https://www.allmachines.com/forage-headers/view-all",
        "rakes": "https://www.allmachines.com/rakes/view-all",
        "tedders": "https://www.allmachines.com/tedders/view-all",
        "specialty-crop-harvesters": "https://www.allmachines.com/specialty-crop-harvesters/view-all"
    }

    try:
        conn = init_db()  # Initialize the database
        for category, url in category_urls.items():
            scrape_category(category, url, conn)  # Scrape each category
        conn.close()  # Close the database connection
        logging.info("All categories scraped and saved to SQLite successfully.")
    except Exception as e:
        logging.critical(f"Fatal error in main(): {e}")
        logging.debug(traceback.format_exc())

if __name__ == "__main__":
    main()

When you are beginning your journey into web scraping using Python, it can be hard to keep track of everything going on in a script. One of the most important parts of the script — and one of the pieces that ties everything together — is the main() function. You can think of it as the control center for your scraper, as it is what directs the code for what to do, when to do it, and in what order it should be done.

In our case, for our web scraping project, the main() function starts off with setting up a list of the categories we would like to scrape. Each category (for example, tractors, excavators and forklifts) has its own webpage on the AllMachines website. We setup each category with its specific URL as a dictionary — somewhat like taking note of what you need to pick up from different supermarkets to be as efficient as possible, when you actually do your shopping!

After we prepare the list, the function opens a connection to a database where all of the product details we scrape will be stored (like names, prices, or specs). Then it processes each category in order and passes the URL to the function called scrape_category(). The scrape_category() function will do all of the real scraping, visiting the page, collecting the data and saving it. While the function called main() is just the hub for it all to keep moving.

Once all categories have been scraped, the main() function closes the database connection. This is crucial to keep it neat and tidy so that data doesn't get corrupt or lost. It is like closing the lid on a storage box after packing everything away neatly.

A neat feature of this script is the last line, if __name__ == "__main__": .This may look odd to you now, but this just means, "only run this script when we're running it as a script." If someone else wants to reuse some of our code — only the scraping function for example — they can import this script and not have the whole scraping process happen. It's just a nice way to make your code semantic and reusable.

Structuring our code in this manner makes it simple to update or add new categories down the road. The script is neat, easy to follow, and not likely to include bugs. For people learning how to build real scraping tools, this is a really good example of keeping things simple and practical.

So if you're new to Python or scraping, don't worry! main() is something to be understood, not feared. It is simply there to ensure that everything runs in order (like a recipe). Once you understand how to run it in your IDE and how it works the rest of the code will make a lot more sense!

Data Collection

Now we enter the second phase of our web scraping journey— this is where the action takes place! By this point, the scraper will visit every URL in our lists. But it is NOT just going to simply grab what is visible to it — it will patiently wait until the entire page has been fully loaded. This is key since, many sites today use javascript to load data dynamically — meaning the information is not necessarily all available upon page opening.

What kinds of things do we collect? Everything from product names and descriptions, prices, catchy marketing phrases, unique scores or ratings, important product features, and even deep technical specs, basically everything you want to know about before making a purchase on a piece of agriculture equipment!

In the background, we keep track of everything we scrape using SQLite — think of it as a notebook that tracks the things we've already scraped! With this data organization, we can easily resume where we left off if something goes wrong and need to pause or restart. In addition, it safely stores all the data in such a fashion that we can access it later for analysis, as reports, or funnel it into other apps.

So with smart tools and good organization, we can ensure that our scraping is much more than a one-time solution. Our scraping solution is reliable, easy-to-maintain, and resilient so we can scale as the website grows. This means that whether we are tracking 10 products or 10,000 products, our scraper will work, and that's a major accomplishment when working with the scaled data in the agriculture and machinery space!

Import Section

import sqlite3
import logging
import traceback
import time
import json
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright

In this step, we are going to make only a small change to our list of imported libraries: we're adding the json library. This little addition really helps us when dealing with complex and nested pieces of data! Think about it like packing a messy drawer into a neat labeled box. This library will help us take all the detailed information about our product and put it into a standard format (string) that is easy to save in our database.

Other than that, all the other imports remain the same as we had during our previous step when we were capturing the product links. So there's nothing super new or complicated about this step - simply a useful upgrade to accommodate more products and more detailed data!

Logging Configuration

# Configure logging to write logs to a file with a specific format
logging.basicConfig(
    filename='product_scraper.log',
    filemode='w',  # Overwrite the log file each time the script runs
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

In this step, we will make only a minor adjustment to the way we do logging - and it is simply a matter of organization. Instead of writing to the same log file as we did in the first phase, we will write to a new log file called product_scraper_phase2.log. This will help us log everything in a separate and organized way, so that we can keep track of what is going on, and fix problems if there are any.

Other than simply changing the name of the log file, everything else about the logging remains absolutely unchanged - it has the same format, the same behavior, and it records in exactly the same way. This is a small change that will help keep us organized as our scraper works through different stages of the process!

Database Functions

Now let's discuss the most important part of the scraping project; the database functions. We need functions that will take our scraper from a simple one-off script to a fully functional tool that can execute again and again without losing anything.

These functions allow us to track what the scraper has accomplished. By utilizing these functions, we are not simply running requests to grab data, we are building a system that remembers what URLs it has crawled through and that keeps track of thousands of product URLs and all the information we collect about those products - including names, prices, features - all safely stored in one tidy SQLite database file.

Another advantage of using a SQLite database versus a complicated server, is that it runs immediately, straight out of the box. SQLite is lightweight and ideal for a project like this. And these database functions are so much more than storing data; they also create an intelligent, organized environment that can deal with interruptions, resume where it left off, and keep our existing, collected data safe and accessible.

def connect_db(db_name="allmachines_products.db"):
    """
    Establish a connection to the SQLite database.
    
    Args:
        db_name (str): Name of the SQLite database file to connect to.
                      Defaults to "allmachines_products.db".
    
    Returns:
        sqlite3.Connection: An active connection to the specified SQLite database.
    
    Example:
        >>> conn = connect_db()
        >>> # Perform operations with conn
        >>> conn.close()
    """
    return sqlite3.connect(db_name)

At first glance, the connect_db() function might seem like no big deal — just a tiny piece of the code that connects to the database. But don’t let its simplicity fool you! This little function plays a huge role in keeping our project clean, consistent, and easy to manage.

By putting all the database connection logic in one place, we create a single point of control. That means the rest of our code can just call connect_db() whenever it needs to talk to the database — no need to repeat the same connection code over and over again.

Why is this a big deal? Imagine you decide to change the database file name, or even switch to a completely different database system someday. If your connection code is scattered everywhere, you’d have to dig through every file to update it. But with connect_db(), you just make the change once — and the rest of the project still works like a charm.

It’s a simple trick, but one that makes your code easier to maintain and way more flexible in the long run.

def ensure_scraped_column(conn):
    """
    Ensure the 'scraped' column exists in the product_links table.
    
    This function checks if the 'scraped' column exists in the product_links table.
    If it doesn't exist, the function adds the column with a default value of 0,
    representing that URLs have not been scraped yet.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
    
    Raises:
        Exception: If there's an error checking for or adding the column.
        
    Notes:
        - Uses PRAGMA table_info to query table structure
        - Default value of 0 indicates not scraped
        - Logs success or failure of the operation
    """
    try:
        cursor = conn.cursor()
        cursor.execute("PRAGMA table_info(product_links)")
        columns = [row[1] for row in cursor.fetchall()]
        if 'scraped' not in columns:
            cursor.execute("ALTER TABLE product_links ADD COLUMN scraped INTEGER DEFAULT 0")
            conn.commit()
            logging.info("Added 'scraped' column to product_links table.")
    except Exception as e:
        logging.error(f"Error ensuring scraped column: {e}")
        logging.debug(traceback.format_exc())

The ensure_scraped_column() function is a great example of smart and safe coding — also known as defensive programming. Instead of assuming that everything in the database is already perfect, this function double-checks the setup before moving forward.

Here’s what it does: it looks at the structure of the database table (kind of like peeking under the hood) using a special SQLite feature called PRAGMA. If it finds that the “scraped” column is missing — which is important for tracking progress — it simply adds it. No fuss, no errors, no need for you to run a separate setup script. Pretty cool, right?

This makes the scraper super flexible. Whether it’s your very first run or you’re continuing after a break, the function makes sure everything is in place so the process can run smoothly. The "scraped" column it adds starts off with a value of 0 for each URL, meaning "not scraped yet." Later on, this little flag helps the scraper figure out exactly where it left off, so it can resume without repeating anything or missing a step.

It’s a small function with a big impact — making the whole system more reliable, user-friendly, and able to recover gracefully from interruptions.

def init_data_table(conn):
    """
    Initialize the product_data table to store scraped product details.
    
    Creates the product_data table if it doesn't exist. This table stores
    all the product information extracted from product pages, including
    metadata, descriptive content, and structured data in JSON format.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
    
    Table Schema:
        - id: Primary key, auto-incremented integer
        - category: Text field for product category classification
        - url: Text field with unique constraint for product page URL
        - title: Text field for product title
        - description: Text field for product description
        - price: Text field for product price
        - tagline: Text field for product tagline
        - equipment_score: Text field for AllMachines Equipment Score
        - highlights: Text field storing product highlights as JSON
        - specifications: Text field storing product specifications as JSON
    
    Raises:
        Exception: If there's an error creating the table.
    """
    try:
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS product_data (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                category TEXT,
                url TEXT UNIQUE,
                title TEXT,
                description TEXT,
                price TEXT,
                tagline TEXT,
                equipment_score TEXT,  
                highlights TEXT, 
                specifications TEXT
            )
        """)
        conn.commit()
        logging.info("Initialized 'product_data' table.")
    except Exception as e:
        logging.error(f"Error initializing data table: {e}")
        logging.debug(traceback.format_exc())

def init_error_table(conn):
    """
    Initialize the error_urls table to store URLs that failed to scrape.
    
    Creates the error_urls table if it doesn't exist. This table tracks URLs
    that encountered errors during the scraping process, allowing for retry
    attempts or manual investigation.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
    
    Table Schema:
        - id: Primary key, auto-incremented integer
        - category: Text field for product category
        - url: Text field with unique constraint for failed URL
    
    Raises:
        Exception: If there's an error creating the table.
    """
    try:
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS error_urls (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                category TEXT,
                url TEXT UNIQUE
            )
        """)
        conn.commit()
        logging.info("Initialized 'error_urls' table.")
    except Exception as e:
        logging.error(f"Error initializing error table: {e}")
        logging.debug(traceback.format_exc())

The init_data_table() and init_error_table() functions do more than just create database tables — they’re actually setting the stage for how we organize and manage the complex data we collect while scraping.

Let’s break it down.

The product_data table is built in a really smart way. For simple stuff like product names and prices, it uses regular database fields (think: neat columns like in a spreadsheet). But for more complicated information — like technical specs or a list of features — it uses JSON, a format that’s perfect for storing detailed or layered data. This mix gives us the best of both worlds: we can easily search and filter simple data, while still having room to store complex details without needing to force everything into a strict structure.

Then there’s the error_urls table — and this is a lifesaver! Sometimes, when scraping the web, a page might not load properly or its layout might suddenly change. Instead of crashing or giving up, the scraper calmly writes down that URL into this table so we can check it later or try again. It’s like keeping a list of “things to follow up on” instead of just forgetting them.

Together, these two functions make the scraper reliable, flexible, and resilient — all things you want when dealing with the ever-changing, sometimes unpredictable world of web pages. It’s a big step up from a basic script and makes the project feel much more professional and production-ready.

HTML Parsing Functions

The HTML parsing functions serve as the scrapers' brains - they are the ones really thinking and ripping off usable information from the untidy web pages. You can imagine a web page as a huge bowl of alphabet soup. The functions are the surgeons (or soup sleuths!) that are good at finding the right words or ingredients we care about that seem to float around aimlessly.

Each function has its own task - maybe one pulls out the product name, another, the price, while another might dig at the technical specifications. They were created to be intelligent and flexible, so even if the website changed a tiny bit (which happens all of the time!), the functions can typically still figure it out without breaking.

This section of the code is where the real magic happens - taking a pile of HTML and turning it into clean data we can use for whatever purpose, be it analysis, display, or storage. This is the step that takes chaos and creates clarity, ensuring that the information we have collected is accurate, complete, and easily used.

def parse_title(soup):
    """
    Extract the product title from the BeautifulSoup object.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: The product title text if found, None otherwise.
    
    Notes:
        - Uses a specific CSS selector to locate the title element
        - Returns None if the element is not found or an error occurs
        - Strips whitespace from the extracted text
    """
    try:
        return soup.select_one("body > main > div > div.flex.items-start.justify-start.mt-2 > div > h1").text.strip() if soup.select_one("body > main > div > div.flex.items-start.justify-start.mt-2 > div > h1") else None
    except Exception as e:
        logging.error(f"Error parsing title: {e}")
        logging.debug(traceback.format_exc())
        return None

def parse_description(soup):
    """
    Extract the product description from the BeautifulSoup object.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: The product description text if found, None otherwise.
    
    Notes:
        - Targets a specific div with class containing 'three-line-clamp'
        - Returns None if the element is not found or an error occurs
        - Strips whitespace from the extracted text
    """
    try:
        return soup.select_one("div.three-line-clamp.lg\:line-clamp-none.\!overflow-hidden > p").text.strip() if soup.select_one("div.three-line-clamp.lg\:line-clamp-none.\!overflow-hidden > p") else None
    except Exception as e:
        logging.error(f"Error parsing description: {e}")
        logging.debug(traceback.format_exc())
        return None

def parse_price(soup):
    """
    Extract the product price from the BeautifulSoup object.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: The price text if found, None otherwise.
    
    Notes:
        - Uses CSS selector to find price element in a specific div structure
        - Returns None if the element is not found or an error occurs
        - Strips whitespace from the extracted text
    """
    try:
        return soup.select_one("div > div > div.mb-5 > div").text.strip() if soup.select_one("div > div > div.mb-5 > div") else None
    except Exception as e:
        logging.error(f"Error parsing price: {e}")
        logging.debug(traceback.format_exc())
        return None

There are some fundamental parsing functions used in the scraper such as: parse_title(), parse_description(), and parse_price(), which are among the most important parts of our code. The basic parsing functions in the scraper are the pieces of code that kill a web page and pull out the specific pieces of information that we care about the most — what the product is, what the product does, and how much it costs.

To accomplish this, we will use rules that are known as CSS selectors, the web equivalent of GPS, to tell the scraper exactly where the individual pieces of information are located on the webpage. Finding selectors can be an exercise in investigations - we really had to search through the underlying code of the AllMachines website and attempt each selector path to determine the best approaches to achieving success.

Although many of these functions appear quite simple, there was a lot of thinking behind them. Also, websites can change at any time (and they frequently change), so we put the calls to each function, wrapped in a safety net, called a try-except block. This way, if just a tiny thing breaks one function - like the product price is missing or moved - the whole scraper will not crash. Instead, it will log the problem for us to come back to later and then continue and keep going. This makes the scraper more robust and much less likely to fail just because one important aspect was missing.

def parse_tagline(soup):
    """
    Extract and format the product tagline from the BeautifulSoup object.
    
    This function locates the tagline div, extracts text from spans within it, 
    and formats them with a comma after the first span's text.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: The formatted tagline text if found, None otherwise.
    
    Format Details:
        - If multiple spans exist: "{first_span_text}, {remaining_text}"
        - If only first span exists: just the first span text
        - If no spans but div exists: all text inside the div
    
    Notes:
        - Handles complex nested structure with multiple spans
        - Special formatting with comma after first span's text
        - Returns None if the target div is not found or an error occurs
    """
    try:
        tagline_div = soup.select_one("div.flex.flex-wrap.items-center.gap-2.mt-4.text-sm.leading-4")
        if tagline_div:
            # Extract all span elements inside the div
            spans = tagline_div.find_all("span")
            if spans:
                # Get the text of the first span
                first_span_text = spans[0].text.strip()
                # Get the remaining text inside the div (excluding the first span)
                remaining_text = tagline_div.get_text(separator=" ", strip=True).replace(first_span_text, "", 1).strip()
                # Combine the first span text with the remaining text, separated by a comma
                tagline = f"{first_span_text}, {remaining_text}" if remaining_text else first_span_text
                return tagline
            else:
                # If no spans are found, return all text inside the div
                return tagline_div.get_text(separator=" ", strip=True)
        return None
    except Exception as e:
        logging.error(f"Error parsing tagline: {e}")
        logging.debug(traceback.format_exc())
        return None

def parse_equipment_score(soup):
    """
    Extract the AllMachines Equipment Score from the BeautifulSoup object.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: The equipment score text if found, None otherwise.
    
    Notes:
        - Targets a specific div with classes related to the score display
        - The score element has specific styling (semibold font, background color)
        - Returns None if the element is not found or an error occurs
    """
    try:
        score_element = soup.select_one("div.flex.items-center.gap-2.text-sm.font-semibold.w-fit.py-2.bg-white.relative.z-10")
        if score_element:
            return score_element.text.strip()
        return None
    except Exception as e:
        logging.error(f"Error parsing equipment score: {e}")
        logging.debug(traceback.format_exc())
        return None

def parse_highlights(soup):
    """
    Extract and format product highlights as a JSON array of key-value pairs.
    
    This function locates the highlights section, extracts key-value pairs from
    list items, and returns them as a JSON-formatted string.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: JSON string containing highlight key-value pairs if found,
                    None otherwise.
    
    JSON Format:
        [{"key1": "value1"}, {"key2": "value2"}, ...]
    
    Notes:
        - Targets a specific section with class 'my-8.lg\:my-14.lg\:\!my-9.relative'
        - Each highlight is represented as a dictionary with a single key-value pair
        - Returns None if the highlights section is not found or an error occurs
    """
    try:
        highlights_section = soup.select_one("section.my-8.lg\:my-14.lg\:\!my-9.relative")  # Locate the section
        if highlights_section:
            highlights = []
            list_items = highlights_section.select("ul > li")  # Select all <li> elements
            for li in list_items:
                key = li.select_one("span.justify-self-center").text.strip() if li.select_one("span.justify-self-center") else None
                value = li.select_one("div.items-center.card-body").text.strip() if li.select_one("div.items-center.card-body") else None
                if key and value:
                    highlights.append({key: value})  # Add key-value pair to the list
            return json.dumps(highlights)  # Convert the list to a JSON string
        return None
    except Exception as e:
        logging.error(f"Error parsing highlights: {e}")
        logging.debug(traceback.format_exc())
        return None

def parse_specifications(soup):
    """
    Extract and format product specifications as a structured JSON array.
    
    This function locates the specifications section (third div with class 'scroll-mt-40'),
    extracts data from tables within it, and organizes it into a hierarchical JSON structure.
    
    Args:
        soup (BeautifulSoup): Parsed HTML of the product page.
    
    Returns:
        str or None: JSON string containing structured specification data if found,
                    None otherwise.
    
    JSON Format:
        [
            {
                "Table Title 1": [
                    {"Row Header 1": "Value 1"},
                    {"Row Header 2": "Value 2"}
                ]
            },
            {
                "Table Title 2": [
                    {"Row Header 1": "Value 1"},
                    {"Row Header 2": "Value 2"}
                ]
            }
        ]
    
    Notes:
        - Targets the third div with class 'scroll-mt-40'
        - Each table in the specifications section becomes a key in the output
        - Table headers become dictionary keys with row values
        - Returns None if the specifications section is not found or an error occurs
    """
    try:
        # Locate all divs with the class 'scroll-mt-40'
        specifications_sections = soup.select("div.scroll-mt-40")
        
        # Ensure there are at least three divs and select the third one
        if len(specifications_sections) >= 3:
            specifications_section = specifications_sections[2]  # Select the third div
            specifications = []
            
            # Find all tables within the third div
            tables = specifications_section.select("table.w-full")
            for table in tables:
                # Get the table title from the thead
                title = table.select_one("thead").get_text(strip=True) if table.select_one("thead") else None
                if not title:
                    continue  # Skip tables without a title

                # Parse the rows in the tbody
                rows = table.select("tbody > tr")
                table_data = []
                for row in rows:
                    key = row.select_one("th").get_text(strip=True) if row.select_one("th") else None
                    value = row.select_one("td").get_text(strip=True) if row.select_one("td") else None
                    if key and value:
                        table_data.append({key: value})  # Add key-value pair to the table data

                # Add the table title and its data to the specifications
                if table_data:
                    specifications.append({title: table_data})

            return json.dumps(specifications)  # Convert the list to a JSON string
        return None
    except Exception as e:
        logging.error(f"Error parsing specifications: {e}")
        logging.debug(traceback.format_exc())
        return None

Now let’s talk about the real magic—the advanced parsing functions. These parts of the scraper go beyond just grabbing simple text; they actually understand how information is organized on a web page and then carefully rebuild that structure into clean, usable data.

Take the parse_tagline() function, for example. Sometimes websites use multiple layers (like several <span> tags inside each other) to style or highlight different parts of a product’s tagline. This function smartly navigates through that tangled structure and extracts the full tagline exactly as it was meant to be seen.

Then there are functions like parse_highlights() and parse_specifications(). These are even more impressive. They don’t just pull random pieces of text—they figure out how everything fits together. Imagine looking at a table of features on a product page, with labels on the left and values on the right. These functions understand that relationship and convert the whole thing into something structured, like a neat dictionary or a JSON object. That way, when we later want to use this data—for analysis, filtering, or even showing it in an app—it’s already organized and easy to work with.

This process of turning messy, semi-organized website code into clean, structured data is one of the most powerful things our scraper does. It’s what turns a bunch of web pages into something truly useful.

def parse_product_page(html):
    """
    Parse the complete product page HTML to extract all relevant product data.
    
    This function serves as a centralized parser that coordinates the extraction
    of all product details by creating a BeautifulSoup object and passing it to
    specialized parsing functions for each data component.
    
    Args:
        html (str): Raw HTML content of the product page.
    
    Returns:
        tuple: A 7-tuple containing:
            - title (str or None): Product title
            - description (str or None): Product description
            - price (str or None): Product price
            - tagline (str or None): Product tagline
            - equipment_score (str or None): AllMachines Equipment Score
            - highlights (str or None): JSON string of product highlights
            - specifications (str or None): JSON string of product specifications
    
    Notes:
        - Creates a BeautifulSoup object using the 'html.parser'
        - Delegates extraction of each component to specialized functions
        - Returns None for any component that fails to parse
        - Logs any errors that occur during parsing
    """
    try:
        soup = BeautifulSoup(html, 'html.parser')
        title = parse_title(soup)
        description = parse_description(soup)
        price = parse_price(soup)
        tagline = parse_tagline(soup)
        equipment_score = parse_equipment_score(soup)
        highlights = parse_highlights(soup)
        specifications = parse_specifications(soup)  
        return title, description, price, tagline, equipment_score, highlights, specifications
    except Exception as e:
        logging.error(f"HTML parsing error: {e}")
        logging.debug(traceback.format_exc())
        return None, None, None, None, None, None, None

At the center of our scraping setup is a function called parse_product_page()—and honestly, it’s the real conductor of the orchestra.

Think of it like this: when you’re cooking a big meal, instead of trying to make every dish in one huge pot, you use different pans and tools for different tasks—one for boiling, one for frying, one for baking. That’s exactly what this function does! Rather than trying to handle everything itself, it lets other smaller, specialized functions (like parse_title(), parse_price(), and so on) do their specific jobs.

What makes this setup so smart is that it builds the BeautifulSoup object—the thing that reads and organizes the webpage’s HTML—only once. Then it passes this organized version of the page to each of the helper functions. This saves time and avoids doing the same work over and over.

Also, if anything changes on the website—for example, if the way product titles are displayed is updated—you only need to fix the one small function that handles titles. Everything else keeps working smoothly. That makes the whole scraper way easier to maintain and less likely to break.

And here’s another cool thing: if one part of the page is broken or weird, the scraper doesn’t crash. It just skips that piece and keeps going. This means we can still collect lots of useful information even from pages that aren’t perfect.

In short, parse_product_page() ties everything together in a neat, reliable way—and it’s a big reason why this scraper works so well behind the scenes.

Scraping Functions

The scraping functions are where all the action happens—they’re basically the heart of our whole system. This is the moment when the scraper goes out into the wild (a.k.a. the internet), visits real web pages, and starts collecting information.

Imagine a robot that not only knows how to browse a website like a human but can also pick out the exact details we care about—like product names, prices, and features—and then neatly save them into a database. That’s what these functions do. They’re smart enough to handle today’s modern, often tricky websites (some of which don’t even load all their content right away), and they do it over and over again across thousands of different pages.

What’s really impressive is how well all the parts work together—browser automation helps open and load the pages, HTML parsing figures out what information to grab, and the database stores everything safely. It’s like a carefully choreographed dance, where each step has to be in sync. If even one move goes off, things could fall apart. But thanks to this setup, the whole process runs smoothly and reliably.

def scrape_product_data(conn, url, category):
    """
    Scrape product data from a given URL and save it to the database.
    
    This function orchestrates the complete scraping process for a single URL:
    launching a browser, navigating to the page, extracting content, parsing data,
    and saving results to the database. It also handles error conditions and
    updates scraping status.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
        url (str): The URL of the product page to scrape.
        category (str): The category of the product.
    
    Process Flow:
        1. Launch a headless Chromium browser using Playwright
        2. Navigate to the target URL and wait for page load
        3. Get the page content and parse product details
        4. If successful (title exists), save data and mark URL as scraped
        5. If unsuccessful, record the URL in the error table
        6. Close the browser
    
    Notes:
        - Uses Playwright's synchronous API for browser automation
        - Sets a generous 60s timeout for page navigation
        - Includes a 3-second wait for JavaScript-loaded content
        - Records failed URLs for potential retry
    """
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)  # Launch browser
            context = browser.new_context()
            page = context.new_page()

            logging.info(f"Scraping URL: {url}")
            page.goto(url, timeout=60000)  # Navigate to the URL
            time.sleep(3)  # Wait for the page to load

            html = page.content()  # Get the page content
            title, desc, price, tagline, equipment_score, highlights, specifications = parse_product_page(html)  # Parse the product details

            if title:
                save_product_data(conn, category, url, title, desc, price, tagline, equipment_score, highlights, specifications)  # Save product data
                mark_as_scraped(conn, url)  # Mark the URL as scraped
            else:
                save_error_url(conn, category, url)  # Save the URL to the error table

            browser.close()  # Close the browser
    except Exception as e:
        logging.error(f"Error scraping URL {url}: {e}")
        logging.debug(traceback.format_exc())
        save_error_url(conn, category, url)  # Save the URL to the error table in case of failure

The scrape_product_data() function is where our scraper really shows off what it can do. Instead of just grabbing raw webpage code (like simple scrapers that send basic requests), this one uses a powerful tool called Playwright. Think of Playwright like a mini web browser that our program controls—it opens the webpage, waits for everything to load (just like how you’d wait for all the pictures and buttons to appear when you open a site), and then gets to work collecting the data.

It even runs JavaScript—just like a real browser—so we don’t miss out on content that loads a little later or gets added by scripts running in the background. This is super helpful for modern websites that don’t show everything right away.

The function is designed to be smart and responsible too. It opens the browser, does its job, and then neatly closes everything—even if something goes wrong. This avoids wasting memory or accidentally leaving background tasks running.

There are also some clever timing tricks in place. For example, the scraper is patient—it gives each page up to 60 seconds to load (just in case the internet is slow or the site is heavy), and then waits an extra 3 seconds to make sure any late-loading bits of the page have time to show up before we start extracting data. This little detail really helps us get complete and accurate results every time.

def save_product_data(conn, category, url, title, description, price, tagline, equipment_score, highlights, specifications):
    """
    Save the scraped product data to the database.
    
    Inserts all product information into the product_data table. Uses INSERT OR IGNORE
    to handle potential duplicate URLs gracefully (avoiding constraint violations).
    
    Args:
        conn (sqlite3.Connection): An active database connection.
        category (str): The category of the product.
        url (str): The URL of the product page.
        title (str): The product title.
        description (str): The product description.
        price (str): The product price.
        tagline (str): The product tagline.
        equipment_score (str): The AllMachines Equipment Score.
        highlights (str): JSON string of product highlights.
        specifications (str): JSON string of product specifications.
    
    Notes:
        - Uses INSERT OR IGNORE to handle potential URL uniqueness constraint violations
        - Logs success or failure of the insertion operation
        - Commits the transaction to ensure data is saved
    """
    try:
        cursor = conn.cursor()
        cursor.execute("""
            INSERT OR IGNORE INTO product_data (category, url, title, description, price, tagline, equipment_score, highlights, specifications)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (category, url, title, description, price, tagline, equipment_score, highlights, specifications))
        conn.commit()
        logging.info(f"Saved product data for URL: {url}")
    except Exception as e:
        logging.error(f"Error saving product data for {url}: {e}")
        logging.debug(traceback.format_exc())

The database functions that support our scraper are designed with smart data-handling habits. One great example is the save_product_data() function. This function takes the product details we just scraped and makes sure they're safely saved into our database.

But here’s the clever part: it uses a special SQL trick called "INSERT OR IGNORE". That means if we’ve already saved data for a certain product (maybe because we ran the scraper before or had to restart it), the database won’t freak out or crash—it’ll just skip over that entry and move on. No duplicates, no errors.

This makes our scraper more reliable and allows it to pause, restart, or rerun without messing up the data or starting from scratch. It’s a small detail, but it’s what makes the whole system more professional and stress-free to use.

def mark_as_scraped(conn, url):
    """
    Mark a URL as successfully scraped in the product_links table.
    
    Updates the 'scraped' column to 1 for the specified URL to indicate
    that it has been successfully processed and should not be scraped again.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
        url (str): The URL to mark as scraped.
    
    Notes:
        - Sets the 'scraped' flag to 1 to indicate successful processing
        - Commits the transaction to ensure data is saved
        - Logs success or failure of the update operation
    """
    try:
        cursor = conn.cursor()
        cursor.execute("UPDATE product_links SET scraped = 1 WHERE url = ?", (url,))
        conn.commit()
        logging.info(f"Marked as scraped: {url}")
    except Exception as e:
        logging.error(f"Error updating scraped status for {url}: {e}")
        logging.debug(traceback.format_exc())

The mark_as_scraped() function plays an important role in helping our scraper keep track of its progress. After it finishes collecting data from a product page, this function updates the database to mark that URL as "scraped."

This is done by changing a special flag (which we set up earlier) from 0 to 1. It’s a simple idea, but very powerful—it tells the system, “Hey, we already got data from this one, no need to visit again.”

Thanks to this tracking system, the scraper can easily pause and resume without repeating work or missing anything. It’s like crossing off items on a checklist, ensuring nothing gets lost or done twice.

def save_error_url(conn, category, url):
    """
    Record a URL that failed to scrape in the error_urls table.
    
    Inserts the failed URL and its category into the error_urls table
    for potential retry or manual investigation.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
        category (str): The category of the product.
        url (str): The URL that failed to scrape.
    
    Notes:
        - Uses INSERT OR IGNORE to handle potential URL uniqueness constraint violations
        - Logs the operation with WARNING level (not as serious as ERROR)
        - Commits the transaction to ensure data is saved
    """
    try:
        cursor = conn.cursor()
        cursor.execute("""
            INSERT OR IGNORE INTO error_urls (category, url)
            VALUES (?, ?)
        """, (category, url))
        conn.commit()
        logging.warning(f"Saved error URL: {url}")
    except Exception as e:
        logging.error(f"Error saving to error_urls for {url}: {e}")
        logging.debug(traceback.format_exc())

The save_error_url() function shows a smart way to handle problems when something goes wrong. Instead of just writing an error message and moving on, this function actually saves the problem URL into a special database table. This means you can easily go back later, see which pages failed, and try scraping them again or figure out what caused the issue.

Together with the other two database functions (save_product_data() and mark_as_scraped()), it follows best practices by always saving changes (committing) after updating the database. This helps protect your data even if the program crashes partway through. On top of that, all three functions use consistent error logging, which makes it much easier to understand what went wrong and fix it—just like a well-prepared toolkit for real-world issues.

Main Driver Functions

The main driver functions act like the directors of the entire scraping project. They don’t just run other parts of the code—they carefully guide when and how each part should work, making sure everything happens in the right order. These functions are what bring the whole system together, turning many smaller pieces into one smooth and automated process.

Instead of just being a bunch of separate tools, the scraper works like a well-organized machine because of these drivers. They control the flow of the program from start to finish, making the system reliable, easy to understand, and ready for repeated use.

def get_unscraped_urls(conn):
    """
    Retrieve all unscraped URLs from the product_links table.
    
    Queries the database for all URLs that have not been scraped yet
    (where scraped = 0) along with their categories.
    
    Args:
        conn (sqlite3.Connection): An active database connection.
    
    Returns:
        list: A list of tuples, each containing (category, url) for unscraped URLs.
    
    Query Details:
        - Selects category and URL from product_links table
        - Filters for records where scraped = 0
        - Returns all matching records as a list of tuples
    """
    cursor = conn.cursor()
    cursor.execute("SELECT category, url FROM product_links WHERE scraped = 0")
    return cursor.fetchall()

The get_unscraped_urls() function is a smart way to manage the scraping workflow. Instead of going through all the URLs every time or keeping track of progress manually, this function checks the database and returns only the URLs that haven’t been scraped yet. You can think of it like a to-do list that updates itself automatically.

This design has some great benefits: you can stop the scraper and restart it later without losing any progress; you can run multiple scraping sessions without repeating work; and the scraper always knows what’s left to do. The function itself is short and simple, but very powerful—it just runs one SQL query to find the right URLs based on a “scraped” flag we added earlier. This shows how smart database planning at the beginning can make everything else easier and more reliable later.

def main():
    """
    Main function to orchestrate the entire scraping process.
    
    This function serves as the entry point for the script and coordinates
    the overall workflow:
    
    1. Connect to the database
    2. Ensure required database structure (tables, columns)
    3. Retrieve unscraped URLs
    4. Process each URL to extract product data
    5. Clean up resources
    
    Process Flow:
        1. Establish database connection
        2. Ensure database has required structure (ensure_scraped_column)
        3. Initialize product_data and error_urls tables
        4. Retrieve list of unscraped URLs from database
        5. Iterate through URLs, scraping each one
        6. Close database connection when complete
    
    Notes:
        - Logs the number of URLs found for processing
        - Logs completion of the scraping process
        - Database connection is properly closed after processing
    """
    conn = connect_db()  # Connect to the database
    ensure_scraped_column(conn)  # Ensure the 'scraped' column exists
    init_data_table(conn)  # Initialize the product_data table
    init_error_table(conn)  # Initialize the error_urls table

    urls = get_unscraped_urls(conn)  # Get all unscraped URLs
    logging.info(f"Found {len(urls)} unscraped URLs to process.")

    for category, url in urls:
        scrape_product_data(conn, url, category)  # Scrape each URL

    conn.close()  # Close the database connection
    logging.info("Scraping finished for all URLs.")

if __name__ == "__main__":
    main()

The main() function acts like the conductor of an orchestra, making sure every part of the scraper works together in the right order. It follows a clear three-step structure: setup, processing, and cleanup.

First, in the setup phase, it connects to the database and makes sure all the necessary tables and columns are ready. Then, in the processing phase, it gets the list of URLs that still need to be scraped and goes through them one by one—loading each page, extracting the data, and saving it. Finally, in the cleanup phase, it closes the database connection to free up system resources.

This clean structure makes the function easy to understand and maintain. If you ever want to add new steps—like sending notifications or saving logs—you can easily do it in the right section without breaking anything. Even though the main() function looks simple, it’s the key piece that ties everything together and helps new developers quickly understand how the whole scraping system works.

Conclusion

This AllMachines web scraper is much more than just a script—it’s a full-featured data collection system built with care, precision, and real-world usability in mind. Instead of just grabbing a few lines of text, it turns messy web pages into clean, structured data that’s ready for analysis or use in other apps.

The smart design is visible in every part of the scraper. If something goes wrong during scraping, the program doesn’t crash—it logs the issue clearly and keeps going. The database is designed not only to store data efficiently but also to keep track of what’s already been scraped, which makes the scraper perfect for long-term, large-scale use.

One of the standout strengths is the modular approach: each function focuses on one task, making the code easier to read, maintain, and update as the website changes. Using Playwright to control a real browser and BeautifulSoup to read the HTML means this tool handles everything from JavaScript-heavy pages to tricky nested data.

Most importantly, this scraper is built like a professional tool. It can pause and resume, logs any problems, and keeps your data safe with smart transaction handling. That makes it ideal for scraping thousands of products over time. As the AllMachines website updates its products and prices, this scraper is ready to keep your dataset current—making it a powerful tool for market research, business analysis, or anything else you might need.

AUTHOR

I’m Shahana, a Data Engineer at Datahut, where I build reliable and scalable data pipelines that turn unstructured web content into clean, usable datasets—especially for use cases like e-commerce, product intelligence, and market research.

In this blog, I walked through a real-world scraping project where we collected detailed product information from AllMachines, a site that lists farming equipment. Using tools like Playwright, BeautifulSoup, and SQLite, we created a scraper that handles dynamic pages, avoids detection with smart user agent handling, and stores everything in an organized format for future analysis.

At Datahut, we focus on building web scraping solutions that are not just effective, but also practical and robust—designed to handle real-world websites, recover from errors, and scale as needed.

If your team is looking to automate product data collection in the eyewear space or beyond, reach out to us through the chat widget on the right. We’d love to help you build a solution that fits your goals.

FAQ section

FAQ 1: What is web scraping and how does it work?

Web scraping is the process of automatically extracting data from websites using scripts or tools. It allows you to collect information like product titles, prices, and descriptions from multiple pages efficiently.

👉 Learn more about what web scraping is and how it works.

FAQ 2: Which tools are used for scraping data from AllMachines?

In this project, we used Playwright and BeautifulSoup — two powerful Python libraries for handling dynamic websites and parsing HTML content.Explore our Python web scraping tutorial to understand how to use these tools effectively.

FAQ 3: How do you handle infinite scrolling while web scraping?

Websites like AllMachines may use infinite scrolling to load more products dynamically. To scrape such pages, you can use tools like Playwright to simulate user scrolling. Read how to build smart and resilient web scrapers for dynamic websites that handle infinite scrolling seamlessly.

FAQ 4: How do you prevent being blocked during web scraping?

Many websites detect and block bots during scraping. You can prevent this by rotating proxies, adding random delays, and mimicking human browsing patterns.Check out our expert guide on how to maintain anonymity when web scraping at scale for best practices.

FAQ 5: What are common challenges faced in web scraping projects?

Some of the biggest challenges in web scraping include dynamic website structures, CAPTCHAs, and legal compliance. Each of these can be managed with the right setup and ethical data collection approach. Discover more about web scraping challenges you need to know.