top of page
Writer's pictureAmbily Biju

Decoding Fashion: A Beginner's Guide to Scraping Zara's Online Catalog


Zara's online catalog

Zara is one of the world's foremost fashion retailers, with lavish and stylish collections of clothing for women. It draws a large segment of worldwide customers with its fashionable offerings. This web scraping project primarily aims at scientifically gathering all the extensive details regarding women's clothing items in the Zara website. This will collect comprehensive data related to price, product features and specifications that are important for market analysis and consumer insights.


The web scraping process will be executed in two major steps. First step is to scrape product URL by extracting links to the individual product pages from the Zara website. This is the basic step as it allows us to access detailed product information. We will then carry out product data scraping from each collected URL, capturing various key attributes such as:


  • Product URL: The direct link to the product page

  • Title: The name of the clothing item

  • Original Price: The initial price before any discounts

  • Discount Price: The current price after discounts

  • Discount Percentage: The percentage of the discount offered

  • Description: A detailed description of the product

  • Outer Shell: Information about the material used for the outer layer

  • Recycled Material: Details on whether the product includes recycled materials

  • Care Instructions: Guidelines on how to maintain the clothing item

  • Origin: The country of manufacture


Understanding the Role of Key Python Libraries in Zara Web Scraping

Beautifulsoap from bs4


BeautifulSoup is a powerful library for parsing HTML and XML documents . It simplifies navigating to HTML structure and retrieving any specific information off a web page. For instance, during the scraping process on Zara's website, the use of BeautifulSoup involves scraping product URLs from zara's category pages as well as all detailed product data from the respective product pages.


Playwright playwright:asyncApi


Playwright is a library for automation of dynamic web pages, especially those that rely very much on JavaScript for rendering. Its asynchronous API (async_playwright) allows the browser to load content, simulate scrolling, and work with pages like a real user does. That's why it would be ideal for loading additional products or simulating page navigation in extracting product URLs. The ability of the playwright to mimic user behavior, such as loading delays on a page, allows this web scraper evade detection.


Asyncio:


Asyncio is a standard Python library that supports asynchronous programming using async and await syntax. It permits tasks such as loading pages, scrolling, and extracting URLs involved with web scraping to run in parallel, enhancing the efficiency and performance of things all together. With asyncio, both the scraping product links process and the actual data extraction process are non-blocking, thus quickening the execution of multiple tasks.


Random:


The random module is used to introduce randomness into the scraping process by creating arbitrary delays between requests. It also makes the scraping behavior seem more human-like, as the probability of hitting anti-scraping measures will be reduced. In the product data scraping step , random.choice() is used to select a random user agent from a list so that each request seems to originate from a different browser or device and, therefore, makes detection even less likely.


Time:


The time module is used in pausing execution between the web requests. It can allow the website enough time to respond between requests in order to facilitate smooth interaction and prevent servers from overloading. It helps scrape the websites respectfully and ethically, making the entire process stable.


SQLite3:


SQLite3 is a small database management system provided with the Python distribution. The SQLite3 engine comes handy to store scraped data in a structured manner.


Why SQLite Outperforms CSV for Web Scraping Projects

SQLite is an excellent choice for managing scraped data storage due to its simplicity, reliability, and efficiency. SQLite is a self-contained, serverless, and zero-configuration database: it can easily be integrated into the web scraping workflow without having to add the overhead of setting up and maintaining larger database systems. It stores data directly on the file system. In scraping, large amounts of data or URLs need to be stored locally and quickly, so it's ideal for that. The lightweight design allows for fast retrieval of data with simple queries, making it much more efficient than the reliance on CSV files, that, when having large amounts of data, become cumbersome and slow.


One of the most valuable features for SQLite in web scraping is that it can graciously handle interruptions. This means that, while monitoring the status of scrapping in the database, we can keep track of which URLs have successfully scrapped and which have not be scrapped. The best thing about SQLite is that in case of an interruption caused by network failure or system crash, we may continue scraping from where it left off and not repeat the same URLs. This is really helpful when scraping big datasets; otherwise, we lose all our progress.


Data Extraction Workflow


STEP 1: Product Link Scraping From Category Pages Product URL Scraping


This code extracts product URLs from category pages using browser automation through Playwright and BeautifulSoup for HTML parsing. It starts by reading category URLs from a text file, and for every category, it launches a browser to load the page then scrolls down to load its more content. It then goes and takes each extracted product URL to get it added in a set to avoid duplications.


Once all URLs are collected, they are retrieved using a SQLite database, which naturally skips duplicates. Random delays between requests are employed not to crash the site. Finally, the count of unique URLs is printed, and the database connection is closed. The implementation is fully asynchronous, the sense of which means that the code uses this threading model to make efficient scraping of infinite scrolling pages with potential large sizes.


Setting Up the Environment

from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
import asyncio
import random
import time
import sqlite3

Imported necessary libraries: BeautifulSoup for parsing HTML, async_playwright for browser automation, asyncio for asynchronous tasks, random for generating delays, time for timing functions, and sqlite3 for database management.


Reading Category URLs from Text File

def read_category_urls(file_path):
    """
    Reads category URLs from a specified text file.

    This function opens a text file containing URLs, reads 
    each line, and returns a list of URLs after stripping any
    leading or trailing whitespace.

    Args:
        file_path (str): The path to the text file containing
                         the category URLs.

    Returns:
        list: A list of strings, where each string is a URL read
              from the file.The list will be empty if the file is 
              empty or does not exist.
    """
    with open(file_path, 'r') as f:
        urls = [line.strip() for line in f.readlines()]
    return urls

This read_category_urls function reads the category URLs from a text file provided. It only allows the argument file_path that represents the path of the text file holding the list of category URLs, and then opens the file in read mode, reads every line and removes any leading or trailing whitespace, thereby making sure that clean URL entries will be used, before compiling them into a list. The function then returns this list of URLs that can easily be accessed and manipulated in subsequent processes. If the file is empty or does not exist, the function returns an empty list .


Extracting Unique Product URLs From Web Pages

async def extract_product_urls(page, url_set):
    """
    Extracts product URLs from the given page and adds unique URLs 
    to the provided set.

    This asynchronous function retrieves the HTML content of a page, 
    parses it using BeautifulSoup, and extracts URLs of products 
    found in anchor tags with the class 'product-link'. It checks for 
    duplicates against a provided set and only adds unique URLs.

    Args:
        page (Page): The Playwright page object representing the 
                     current web page from which to extract URLs.
        url_set (set): A set to store unique product URLs, preventing 
                       duplicates from being added.

    Returns:
        int: The total number of unique product URLs extracted and added 
             to the set during the execution of this function.
    """
    soup = BeautifulSoup(await page.content(), 'html.parser')
    total_urls = 0
    for anchor in soup.select('a.product-link'):
        href_url = anchor.get('href')
        if href_url and href_url not in url_set:
            url_set.add(href_url)
            total_urls += 1
    return total_urls

The extract_product_urls function is asynchronous and uses both Playwright and BeautifulSoup to fetch product URLs from a webpage. There were two arguments passed on to this function: page, a Playwright page object representing the current page, and url_set set storing unique URLs of products to prevent duplicates. Inside the function: The crawler class selects the HTML content of the page, parses through it using BeautifulSoup, and loads the same into a structured representation of the HTML. So, it can search for an anchor tag with the class 'product-link' to fetch product URLs. Every URL obtained is checked not to already be in the url_set. If the URL is new, it adds this one to the set and increments the count of unique URLs. Finally, it returns how many unique product URLs were added in total .While executing this function, so one perfectly knows how many new URLs have been found.


Scrolling the Web Page to Load More Content

async def scroll_page(page):
    """
    Scrolls the webpage down to load additional content.

    This asynchronous function simulates a user scrolling to the bottom 
    of the page by executing JavaScript to set the scroll position to 
    the height of the document. It then waits for a specified duration 
    to allow new content to load.

    Args:
        page (Page): The Playwright page object representing the 
                     current webpage that is being scrolled.

    Returns:
        None: This function does not return a value.
    """
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await page.wait_for_timeout(2000)

The asynchronous function scroll_page is intended to make a page scroll to load more content. It accepts the argument page that is the Playwright's page object representing the current page. There, within the function it uses JavaScript to set its scroll position at the bottom by executing window.scrollTo(0, document.body.scrollHeight). It then loads any other content that may not have been initially in view and then waits 2 seconds for the new content to load completely by awaiting page.wait_for_timeout(2000). The function does not return anything, but it is used mainly in order to make sure that all available content on the page is visible for further processing.


Scraping Product URLs From  Category Pages

async def scrape_category_page(url, url_set):
    """
    Scrapes a single category page to extract product URLs and returns the
    total number of unique URLs scraped.

    This asynchronous function uses Playwright to launch a browser, navigate
    to the specified category page, and continuously scrolls down to load 
    more content. It extracts product URLs from the page until there are no
    more new URLs to scrape. Unique URLs are added to the provided set to 
    avoid duplicates.

    Args:
        url (str): The URL of the category page to be scraped.
        url_set (set): A set to store unique product URLs extracted 
                       from the page.

    Returns:
        int: The total number of unique product URLs scraped from the
        category page.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(url)

        total_urls = 0
        while True:
            total_urls += await extract_product_urls(page, url_set)
            await scroll_page(page)

            if not await page.evaluate(
                '(window.innerHeight + window.scrollY) < document.body.offsetHeight'):
                break

        await browser.close()

    return total_urls

The function scrape_category_page is designed for getting unique product URLs from the category pages of zara through Playwright. Two parameters are accepted here: url corresponding to the category page address to be scraped and url_set containing a set of unique product URLs in order not to have repetitions. At the beginning, this function is oriented towards launching an instance of the browser in the async_playwright context, and then opens a new page while navigating to a required URL.

The function continuously extracts the product URLs from the page inside a loop. This is done by calling extract_product_urls and adding the new URLs to url_set. Once the URLs have been extracted, it call the scroll_page to scroll down and load their contents. It runs the loop until it finds out that there are no new URLs to scrape anymore-this is determined by checking if the current scroll position has reached the bottom of the page. After obtaining all the unique URLs, it closes the browser and returns the total number of unique URLs scraped. Using this method, it would mean that the function can fetch all the product links available on the site as soon as they are loaded dynamically as the user scroll down.

Establishing Connection to SQLite Database

def connect_db(db_name):
    """
    Establishes a connection to the SQLite database and creates the
    necessary table if it doesn't exist.
    
    This function connects to the specified SQLite database. If the
    database file does not exist, it  will be created. Additionally, 
    it ensures that the table for storing product URLs is created if 
    it is not already present, preventing duplication of table creation.

    Args:
        db_name (str): The name of the SQLite database file to connect to.

    Returns:
        Connection: A SQLite connection object that can be used to 
                    interact with the database.
    """
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS zara_product_urls (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            product_url TEXT UNIQUE
        )
    ''')
    conn.commit()
    return conn

connect_db: This establishes a connection to an SQLite db, and, if the table of product URL's hasn't been previously established, it creates the table. The function use name of the database as argument (db_name).

It then uses a cursor-a bridge between the function and the database.It used to run SQL commands. The function create a table called zara_product_urls, which is used to store product URLs. The table has two columns: id-automatically incrementing integer, unique for every entry-and product_url, which contains the actual URL and is required to be unique as a constraint to avoid duplicate entry.

Storing Unique Product URLs in SQLite

def save_urls_to_db(conn, url_set):
    """
    Inserts unique product URLs from a set into the SQLite database.

    This function takes a set of product URLs and inserts them into the 
    `zara_product_urls` table in the SQLite database. If a URL already 
    exists in the database, it will be skipped to avoid duplication.

    Args:
        conn (Connection): A SQLite connection object used to interact 
                           with the database.
        url_set (set): A set of unique product URLs to be inserted into 
                       the database.

    Returns:
        None: This function does not return a value.
    """
    cursor = conn.cursor()
    for url in url_set:
        try:
            cursor.execute(
                "INSERT INTO zara_product_urls (product_url) VALUES (?)",
                  (url,)
                  )
        except sqlite3.IntegrityError:
            # Skip duplicate entries
            continue
    conn.commit()

The save_urls_to_db function saves unique product URLs from a set into an SQLite database. It accepts two arguments: conn: this is the connection object with which to communicate to the database, and url_set: this is the set of unique product URLs.

Inside the method, it declares a cursor to execute SQL commands. Then it tries each URL in the url_set one by one. It tries to insert the record of each URL into the zara_product_urls table. If the record is inserted successfully, it saves the record into the database. But if there is already record of the URL that may violate the uniqueness constraint of the column product_url then an IntegrityError exception is raised. It catches the error and then just moves on to the next URL without doing anything and skips any duplicates. Upon trying to insert all the URLs, the function commits its changes to the database so that all entries are actually saved. It this way, keeping the database clean of any duplicated entries.


Main Function for URL Scraping Process

async def main():
    """
    The main entry point for the URL scraping process.

    This asynchronous function orchestrates the scraping of product URLs 
    from specified category pages. It reads category URLs from a text file, 
    scrapes each category page to extract unique product URLs, and saves 
    those URLs into an SQLite database. It ensures a random delay between 
    requests to avoid overloading the server.

    Steps performed:
    1. Reads category URLs from 'data/category_urls.txt'.
    2. Initializes a set to store unique product URLs.
    3. Iterates over each category URL to scrape product links.
    4. Connects to the SQLite database and saves the unique URLs.
    5. Closes the database connection.

    Returns:
        None: This function does not return a value. It prints the total 
              number of unique product URLs scraped at the end.
    """
    urls = read_category_urls('data/category_urls.txt')
    unique_urls = set()

    total_urls = 0
    for url in urls:
        total_urls += await scrape_category_page(url, unique_urls)
        time.sleep(random.uniform(2, 5))  # Random delay between 2 to 5 seconds

    # Connect to SQLite database and save the URLs
    conn = connect_db('zara_webscraping.db')
    save_urls_to_db(conn, unique_urls)
    conn.close()

    print(f"Total unique product URLs scraped: {total_urls}")

This is the main function, where URL scraping begins. The task is an asynchronous that coordinates multiple tasks to obtain URLs for products from several category pages of a web site.


It reads category URLs from a text file at 'Data/category_urls.txt'.Then the set unique_urls is defined whereby it stores the link, unique for each product, avoiding any type of URL from repeating to be saved.Then in a loop, it goes over category_url and prints the function call to scrape_category_page with product links that had been obtained. The total of how many unique URLs was gathered from all pages being updated by the variable total_urls.Avoids making too many requests by inserting an arbitrary delay between page scrapes in time.Once all category pages have been scraped, it connects to the SQLite db named zara_webscraping.db, saves unique urls by save_urls_to_db function and then closes. It prints out the total count of unique product URLs crawled, summing up the whole process.


This ensures the scraping happens in an orderly fashion, efficient, well-organized manner, and with respect for server resources.

Script Entry Point for Asynchronous URL Scraping

if __name__ == '__main__':
    """
    Entry point for the URL scraping application.

    This block checks if the script is being run as the main module. 
    If so, it executes the `main` function asynchronously to initiate 
    the URL scraping process. This ensures that the scraping logic is 
    executed when the script is run directly, while allowing it to 
    be imported without running the scraping process.

    Returns:
        None: This block does not return a value; it simply triggers 
              the execution of the main asynchronous function.
    """
    asyncio.run(main())

The if __name__ == '__main__': block is the entry point to the URL scraping procedure. That block takes into consideration the possibility of being called by running the script directly as the main program, or otherwise it's imported as a module in another script. If it's called directly, it calls the main function with asyncio.run(main()). It does not have a return value; rather it calls the execution of the main function that owns all the scraping activities.

STEP 2: Product Data Scraping From URLs


It is the script for getting detailed data about Zara products, saving the obtained results into the SQLite database, and both successful and failed attempts of web scraping. Creating a database sets up three tables: zara_product_urls for storing URLs together with the status of its scraping, zara_product_data to save product information that includes title, prices, description and materials.And the failed_urls - the list of URLs at which failure to scrape are also saved in another table. The web scraping part of the code uses Playwright to visit a webpage asynchronously, with a random user agent to simulate a real browser, and scrape HTML content. Then Beautiful Soup is employed to extract various product details such as title, description, prices with discount, materials like outer shell as well as materials from recycled sources, care instructions, and the country of origin. Those details are then inserted in the database. If a failure occurs, the URL is logged in the failed_urls table. Utility functions have also been provided for fetching pending URLs, updating the status of scraping, and for sophisticated database operations.


Imported Libraries

import asyncio
import random
import sqlite3
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

Imported the necessary libraries for asynchronous browser automation to scrape web data and store it in a database.

Initializing SQLite Database for Zara Product Storage

# Database setup and modification functions
def setup_database(db_name):
    """
    Sets up the database for storing Zara product URLs and scraped data.

    This function connects to a SQLite database and ensures that the
    necessary tables are created for storing product URLs, scraping statuses, 
    product data, and failed scraping attempts. Specifically, it creates:
    - A `zara_product_urls` table to store product URLs and their scraping 
      status, with the scraping_status defaulting to 0.
    - A `zara_product_data` table for storing detailed product information 
      such as title, prices, description, and material details.
    - A `failed_urls` table to record URLs where scraping attempts failed.

    Args:
        db_name (str): The name of the SQLite database file.

    Returns:
        sqlite3.Connection: A connection object to the SQLite database.

    Tables created:
        - `zara_product_urls`: Stores product URLs and a scraping status.
        - `zara_product_data`: Stores detailed product information.
        - `failed_urls`: Stores URLs where scraping failed.
    """
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    # Add scraping_status column to zara_product_urls if it doesn't exist
    cursor.execute('''CREATE TABLE IF NOT EXISTS zara_product_urls (
                        product_url TEXT PRIMARY KEY,
                        scraping_status INTEGER DEFAULT 0
                      )''')

    # Create product data table for storing the scraped information
    cursor.execute('''CREATE TABLE IF NOT EXISTS zara_product_data (
                        product_url TEXT PRIMARY KEY,
                        title TEXT,
                        original_price TEXT,
                        discount_price TEXT,
                        discount_percentage TEXT,
                        description TEXT,
                        outer_shell TEXT,
                        recycled_material TEXT,
                        care TEXT,
                        origin TEXT
                      )''')

    # Create failed_urls table for storing failed scraping attempts
    cursor.execute('''CREATE TABLE IF NOT EXISTS failed_urls (
                        url TEXT PRIMARY KEY
                      )''')

    conn.commit()
    return conn

The function setup_database initialize an SQLite database . On call, it connects the specific database file. When no such file already exists, one is automatically created. Then it will set up three tables:

  • zara_product_urls: The table contains the product URLs along with their scraping status that indicates whether the URL is processed or not. It actually defaults to 0 that represents "pending" which means the url is not scraped.

  • zara_product_data: Those tables which carry information related to each product, such as its URL, title, original price, discount price, percentage of discount, description, and materials.

  • failed_urls: It maintains a record of all the URLs where scraping attempts did not succeed, and it becomes easy to trace back the problem for debugging purposes.


To verify that these tables are indeed created, the function commits those changes to the database and returns a connection object that can be used for further database operations. It is needed to be configured so that the information gathered in the process of web scraping can be kept in an organized manner.


Getting Unscraped Product URLs from Database

def get_pending_urls(conn):
    """
    Retrieves all pending product URLs from the database.

    This function queries the `zara_product_urls` table to fetch 
    all URLs where the `scraping_status` is set to 0, indicating 
    that the product has not yet been scraped. It returns a list 
    of these pending URLs.

    Args:
        conn (sqlite3.Connection): A connection object to the SQLite 
        database.

    Returns:
        list: A list of product URLs (str) where `scraping_status` is 0.
    """
    cursor = conn.cursor()
    cursor.execute(
        'SELECT product_url FROM zara_product_urls WHERE scraping_status = 0'
    )
    return [
        row[0] for row in cursor.fetchall()
    ]

The get_pending_urls function gets all the product urls that have been stored in the database and have not been scraped yet. It requires a connection object to the SQLite database so we can interact with the database. A cursor is created inside the function to run SQL queries. The function makes a query to pick up all the URLs from the zara_product_urls table where the given scraping_status is 0, which means the specific products are pending scraping. Upon getting a run for this query, it pulls up all the rows matched and returns them in a list form that can be further processed into returning a list of URLs, which are represented as strings. It's very useful in determining the products that are still due for scrapping, hence helping in the completion of a good and thorough scrapping process.

Updating Scraping Status for Zara Product URLs

def update_scraping_status(conn, url, status):
    """
    Updates the scraping status of a given product URL in the database.

    This function modifies the `scraping_status` of a specific product URL 
    in the `zara_product_urls` table. The status value is typically used 
    to indicate whether a product has been successfully scraped (e.g., 
    1 for success, 0 for pending, or another value for failure).

    Args:
        conn (sqlite3.Connection): A connection object to the SQLite database.
        url (str): The product URL whose scraping status needs to be updated.
        status (int): The new scraping status value to be set.

    Returns:
        None: The function commits the change to the database and does not 
        return any value.
    """
    cursor = conn.cursor()
    cursor.execute(
        'UPDATE zara_product_urls SET scraping_status = ? WHERE product_url = ?',
        (status, url)
    )
    conn.commit()

The function update_scraping_status updates the database with the new status of scraping a certain product URL. The argument of this function is a SQLite database, a URL of a product whose status needs to be changed, and an integer value that is a new status for the given product. The function creates a cursor object. Each SQL statement is executed using a cursor. It then runs an SQL UPDATE statement, which modifies the scraping_status for the specified product URL to the new status value. Once the update is successful, it commits changes to the database to ensure updates were actually saved. This is a useful function in monitoring the process of the scrapes, where the program would know which URLs were successfully scraped and which are pending.

Log Scraping Failures in Database

def store_failed_url(conn, url):
    """
    Stores a failed URL in the database.

    This function inserts a URL into the `failed_urls` table if a 
    scraping attempt for that URL has failed. The `INSERT OR IGNORE` 
    clause ensures that the URL is only added if it is not already 
    present in the table, preventing duplicate entries.

    Args:
        conn (sqlite3.Connection): A connection object to the SQLite database.
        url (str): The URL of the product that failed to be scraped.

    Returns:
        None: The function commits the change to the database and does not 
        return any value.
    """
    cursor = conn.cursor()
    cursor.execute('INSERT OR IGNORE INTO failed_urls (url) VALUES (?)', (url,))
    conn.commit()

This function, called store_failed_url, is created to add failed URLs to the database during the implementation of the web scraping process. It accepts two parameters: a SQLite database connection object and the URL of the product that has failed to be scraped. A cursor is first created in the application for executing SQL commands. Then, it calls an INSERT OR IGNORE SQL statement to add the failed URL in failed_urls table. The OR IGNORE clause makes it ensure if the URL already exists in the table, it will not add it there again and thus prevent duplicated entries. At the end of the insertion, the function commits its database changes so that the failed URL will remain. This function allows for easier tracking of the URLs that run into problems and easy debugging or re-scrape attempts.

Accessing Webpages Using Playwright and Random User Agents

# Web scraping functions
async def visit_website(url, user_agents):
    """
    Asynchronously visits a website using a random user agent and returns 
    the HTML content.

    This function leverages Playwright to asynchronously visit a webpage 
    with a random user agent from the provided list. It simulates a real 
    browser environment (non-headless) to avoid detection, waits for the 
    page to fully load, and then returns the HTML content for further 
    processing.

    Args:
        url (str): The URL of the website to visit.
        user_agents (list): A list of user agent strings. A random user 
                            agent is selected 
        from this list to simulate different browsing environments.

    Returns:
        str: The HTML content of the webpage as a string after the page 
             has loaded.

    Note:
        - The browser runs in non-headless mode to mimic real user behavior.
        - A 5-second delay is introduced to ensure the page has time to 
          fully load.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        user_agent = random.choice(user_agents)
        context = await browser.new_context(user_agent=user_agent)
        page = await context.new_page()
        await page.goto(url)
        await page.wait_for_timeout(5000)  # 5 second delay to allow page load
        html_content = await page.content()
        await context.close()
        await browser.close()
        return html_content

The visit_website function is an asynchronous function designed for web scraping, so it does allow one to visit any given website with associated retrieval of its HTML while simulating real user behavior. It accepts two parameters: the URL of the site we want to visit, and a list of user agent strings.


This function uses web automation with the Playwright library to open a browser window in non-headless mode so you can actually see the browser interface as it works. To further hide the activity of scraping and avoid detection by the website, a random user agent is chosen from the list provided above. The function once launched, will navigate to the given URL, waits for 5 seconds to capture when the page loads its content, captures the HTML of the page, closes the browser context, and returns the captured HTML as a string. This function is helpful where we collect data on the websites, mimicking real actions of accessing website content by a browser.


A user agent is an identifying string for the browser, device, or software making the request to the web server. It includes details on the browser type, operating system, sometimes the specific device, and so on. In web scraping, these agents are crucial since they simulate a legitimate request by a browser that websites may not detect or block. Indeed, user agents are even analyzed by web servers to dynamically tailor the content or to bar access from bots. This would also eventually decrease the chance of being flagged or blocked by rotating or randomizing the user agents.

For instance, a user agent string for Google Chrome on Windows, say:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36.

Extracting the Product Title from HTML

# Data extraction functions
def extract_title(soup):
    """
    Extracts the product title from the provided BeautifulSoup object.

    This function searches for the product title in the HTML content
    using the 'h1' tag with the class 'product-detail-info__header-name'. 
    If found, it returns the title  as a string; otherwise, it returns None.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed 
                              HTML of the product page.

    Returns:
        str or None: The product title if found, otherwise None.
    """
    title_tag = soup.find('h1', class_='product-detail-info__header-name')
    return title_tag.text.strip() if title_tag else None

The extract_title function is supposed to retrieve the title of a product from a webpage that has already been parsed into a BeautifulSoup object. It scans within the HTML structure of the page for a product title by targeting the <h1> tag that possesses the name class product-detail-info__header-name. It extracts the text inside it, removes any leading or trailing whitespace, and returns the cleaned title as a string. If the title tag is not found in the HTML, the function returns None.

Extracting the Product Description from HTML

def extract_description(soup):
    """
    Extracts the product description from the provided BeautifulSoup
    object.

    This function searches for all 'div' tags with the class 
    'expandable-text__inner-content', then retrieves the text content
    from all 'p' tags within those 'div' tags. It concatenates the 
    extracted text into a single string, separated by commas.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing
                              the parsed HTML of the product page.

    Returns:
        str or None: A concatenated string of product descriptions
        if found, otherwise None.
    """
    divs = soup.find_all('div', class_='expandable-text__inner-content')
    
    if divs:
        description = [
            p.get_text() 
            for div in divs 
            for p in div.find_all('p')
        ]
    else:
        description = []

    return ", ".join(description) if description else None

The extract_description function is designed to retrieve the text describing a product from a processed version of a webpage as a BeautifulSoup object. It finds all <div> elements with the class expandable-text__inner-content, which typically contain the product descriptions. Then, within each of these <div> elements, it finds all <p> tags and returns their text contents. It then combine all the extracted descriptions into a list, which is then combined into a single string with each description separated by a comma. In case the function fails to find any matching <div> tags, it returns None, which will indicate no descriptions were found.


Asynchronous Scraping of Product Details and Database Storage

def extract_price(soup):
    """
    Extracts original and discount prices from the BeautifulSoup object.

    This function looks for the original price, discount price, and 
    discount percentage in the HTML content represented by the 
    BeautifulSoup object.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing 
        the parsed HTML of the product page.

    Returns:
        dict: A dictionary containing:
            - 'original_price': The original price of the product
                                (may be None if not found).
            - 'discount_price': The discounted price of the product
                               (may be None if not found).
            - 'discount_percentage': The discount percentage
                                     (may be None if not found).
    """
    price_info = {
        'original_price': None,
        'discount_price': None,
        'discount_percentage': None
    }

    price_tag = soup.find(
        'div', 
        class_='money-amount price-formatted__price-amount'
    )
    if price_tag:
        original_price = price_tag.find(
            'span', class_='money-amount__main'
        ).text.strip()
        price_info['original_price'] = original_price

    discount_percentage_tag = soup.find(
        'span', 
        class_='price-current__discount-percentage'
    )
    if discount_percentage_tag:
        price_info['discount_percentage'] = discount_percentage_tag.text.strip()

    discount_price_tag = soup.find(
        'span', 
        class_='price-current__amount'
    )
    if discount_price_tag:
        discount_price = discount_price_tag.find(
            'span', class_='money-amount__main'
        ).text.strip()
        price_info['discount_price'] = discount_price

    return price_info

The extract_price function is designed for getting information about the price from the HTML of a product page using BeautifulSoup. It initializes a dictionary, leaving fields to remember the original price, discounted price, and discount in percent, which are all initialized by None. The function checks where in the HTML structure the original price might be. If it finds it, it updates the dictionary. It also looks for the discount percentage and discount price in their respective HTML elements and updates the dictionary in an appropriate way. At last, the function returns the dictionary containing the extracted information about the prices with None values wherever the respective details could not be found.

Extracting Material Information from HTML

def extract_materials(soup):
    """
    Extracts information about the outer shell and recycled materials
    from the BeautifulSoup object.

    This function searches for the materials-related section of the 
    product page, identifying  both the outer shell material and any 
    recycled material used in the product. It retrieves  these details 
    from the specified HTML structure and returns them in a dictionary.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the 
        parsed HTML of the product page.

    Returns:
        dict: A dictionary containing:
            - 'outer_shell': The material used for the outer shell of 
                             the product (may be None if not found).
            - 'recycled_material': The recycled material used in the 
                                   product (may be None if not found).
    """
    materials_info = {
        'outer_shell': None,
        'recycled_material': None
    }

    # Extract outer shell material
    materials_section = soup.find(
        'div', 
        class_='product-detail-extra-detail__section', 
        attrs={'data-observer-key': 'materials'}
    )
    if materials_section:
        outer_shell = materials_section.find(
            'span', 
            class_='structured-component-text zds-heading-xs', 
            text='OUTER SHELL'
        )
        if outer_shell:
            materials_info['outer_shell'] = outer_shell.find_next(
                'span', 
                class_='structured-component-text zds-paragraph-m'
            ).text

    # Extract recycled material
    recycled_materials_section = soup.find(
        'div', 
        class_='product-detail-extra-detail__section', 
        attrs={'data-observer-key': 'recycledMaterials'}
    )
    if recycled_materials_section:
        recycled_material = recycled_materials_section.find(
            'span', 
            class_='structured-component-text zds-heading-xs', 
            text='RECYCLING MATERIAL'
        )
        if recycled_material:
            materials_info['recycled_material'] = recycled_material.find_next(
                'span', 
                class_='structured-component-text zds-paragraph-m'
            ).text

    return materials_info

The extract_materials is a function that will extract information about the material of the outer shell and any recycled materials of a product from its HTML representation using BeautifulSoup. This starts by initializing a dictionary that has outer_shell and recycled_material set as keys, both of them set to None. Initially, the function locates the product page containing materials information by looking for a specific div element. If this is section, it looks for outer shell material by finding a span with the word OUTER SHELL. Once found then it pulls the material text from the next span element. The function then repeats a similar process by looking for information concerning recycled materials in another specified div section using the search term "RECYCLING MATERIAL" and extracting the material text. Finally, the function returns this dictionary filled with information about the materials extracted; for those materials that could not be found, it fills in with the value of None.

Extracting Care Instructions and Country of Origin from HTML

def extract_care_and_origin(soup):
    """
    Extracts care instructions and country of origin from the 
    BeautifulSoup object.

    This function parses the HTML content of a product page to 
    retrieve care instructions and the product's country of origin, 
    if available. The care instructions are extracted from a list of
    care-related items, while the country of origin is determined from 
    the relevant section of the page.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the 
        parsed HTML of the product page.

    Returns:
        dict: A dictionary containing:
            - 'care': A string of care instructions joined by commas 
                      (may be None if not found).
            - 'origin': The country of origin of the product 
                        (may be None if not found).
    """
    care_info = {
        'care': None,
        'origin': None
    }

    # Extract care instructions
    care_section = soup.find(
        'div', 
        class_='product-detail-extra-detail__section', 
        attrs={'data-observer-key': 'care'}
    )
    if care_section:
        care_instructions = []
        care_list = care_section.find(
            'ul', 
            class_='structured-component-icon-list'
        )
        if care_list:
            care_instructions = [
                li.find(
                    'span', 
                    class_='structured-component-text zds-paragraph-m'
                ).text 
                for li in care_list.find_all('li')
            ]
        care_info['care'] = ', '.join(care_instructions) if care_instructions else None

    # Extract country of origin
    origin_section = soup.find(
        'div', 
        class_='product-detail-extra-detail__section', 
        attrs={'data-observer-key': 'origin'}
    )
    if origin_section:
        origin_spans = origin_section.find_all(
            'span', 
            class_='structured-component-text zds-paragraph-m'
        )
        for span in origin_spans:
            if 'Made in' in span.text:
                care_info['origin'] = span.text.split(' ')[-1]
                break

    return care_info

The extract_care_and_origin function uses BeautifulSoup to extract care instructions and country of origin of a product from its HTML content. Initially, it builds a dictionary with care and origin keys, initialized to None. It searches for a certain div area of the product page with care instructions. There, it retrieves a list of items related to care from the inner ul of this element. Then it iterates through each list item, extracting its content from span elements and forms a list. If they find such care instructions, they join all together as a string of comma-separated values and assign it to the dictionary using the key care. Finally, the function checks the country of origin on another part of the page; in this case, it's scanning through span elements that might contain the phrase "Made in." If the function finds such a span, it extracts the name of the country from the text and feeds this into the origin dictionary key. Finally, the function returns the populated dictionary containing the extracted care instructions and country of origin, with None values for any details that could not be found.


Asynchronous Scraping of Product Details and Database Storage

async def scrape_product_details(url, conn, user_agents):
    """
    Asynchronously scrapes product details from a given URL.

    This function visits the product page at the specified URL, 
    extracts key product details such as title, description, price,
    materials, care instructions, and country of origin, and stores
    the information in the `zara_product_data` table of the SQLite database. 
    If scraping fails, the URL is added to the `failed_urls` table.

    Args:
        url (str): The product URL to scrape.
        conn (sqlite3.Connection): A connection object to the SQLite
                                   database where product 
        details will be stored.
        user_agents (list): A list of user agent strings used to 
                            randomly select one for visiting the website.

    Extracted details include:
        - product_url: The URL of the product.
        - title: The product title.
        - description: The product description.
        - original_price: The original price of the product.
        - discount_price: The discounted price of the product (if any).
        - discount_percentage: The percentage discount (if applicable).
        - outer_shell: Information about the outer shell material.
        - recycled_material: Details of any recycled materials used.
        - care: Care instructions for the product.
        - origin: The country of origin of the product.

    Database operations:
        - The product details are inserted or replaced in the 
          `zara_product_data` table.
        - If scraping fails, the failed URL is added to the `failed_urls`
         table.

    Returns:
        None: The function commits the changes to the database and does
        not return any value.

    Exceptions:
        - If any error occurs during scraping, it is caught, the failed
          URL is printed, and the URL is stored in the `failed_urls` table.
    """
    try:
        html_content = await visit_website(url, user_agents)
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract product details
        product_details = {
            'product_url': url,
            'title': extract_title(soup),
            'description': extract_description(soup),
            **extract_price(soup),
            **extract_materials(soup),  # Call the materials extraction function
            **extract_care_and_origin(soup)  # Call the care and origin function
        }

        cursor = conn.cursor()
        cursor.execute('''INSERT OR REPLACE INTO zara_product_data 
                          (product_url, title, original_price, 
                           discount_price, discount_percentage, 
                           description, outer_shell, recycled_material, 
                           care, origin) 
                          VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''',
                       (product_details['product_url'], 
                        product_details['title'], 
                        product_details['original_price'], 
                        product_details['discount_price'], 
                        product_details['discount_percentage'], 
                        product_details['description'],
                        product_details['outer_shell'], 
                        product_details['recycled_material'], 
                        product_details['care'], 
                        product_details['origin']))
        conn.commit()
    except Exception as e:
        print(f"Failed to scrape {url}: {e}")
        store_failed_url(conn, url)

This function used to scrape product details from URL, writing the output to a SQLite database. The function is asynchronous. When invoked, it takes three parameters: product URL to scrape, SQLite connection object, and list of user agents for emulating different environments.


The function begins with a call to visit_website in order to attempt to get the HTML content of the product page. Once the HTML is accessed, BeautifulSoup parses this, which then extracts several details about the product, including title, description, original price, discount price, discount percentage, outer shell material, recycled materials, care instructions, and country of origin, through appropriate calls to extraction functions.


The function then enters the preparation stage to insert the retrieved product details into the zara_product_data table in the database. It makes use of an SQL command to insert a new record or replace an existing record if the product URL exists. In case of successful insertion, the function commits changes to the database.


This function catches any error that may arise due to the process of scraping and would for instance include the case in which the data cannot be fetched, database problems among others. When such an exception happens, an exception is thrown, a message is printed stating the URL at which the error happened, and the store_failed_url function is called to add the URL in a different table for unsuccessful attempts at scraping. It keeps track of every attempt to collect product data, and it is easier to troubleshoot through failed attempts at scraping. It, therefore, promotes the function of gathering and storing product data with associated processes as efficient as possible.

Main Orchestration for Asynchronous Web Scraping

# Main execution function
async def main(db_name, user_agents_file):
    """
    Orchestrates the web scraping process for product details.

    This function serves as the main entry point for the scraping 
    process. It initializes the database connection, loads user agent
    strings from a specified file, retrieves URLs that need to be scraped, 
    and iteratively scrapes product details from each URL. After successfully 
    scraping a URL, it updates the scraping status in the database.

    Args:
        db_name (str): The name of the SQLite database file where product 
                       details will be stored.
        user_agents_file (str): The path to a text file containing user 
                                agent strings, one per line.

    Returns:
        None: This function does not return a value but performs database
        operations and initiates web scraping.

    Process:
        - Sets up the database and creates necessary tables.
        - Reads user agents from the provided file.
        - Fetches URLs with a scraping status of 0 (pending) from the database.
        - For each pending URL, it:
            - Scrapes product details.
            - Updates the scraping status to 1 (indicating a successful scrape).
            - Introduces a random delay between requests to mimic human 
              behavior and avoid detection.

    Note:
        - The function is asynchronous and utilizes the `asyncio` library 
          for concurrent execution,  allowing for efficient handling of 
          multiple scraping tasks.
    """
    conn = setup_database(db_name)

    # Load user agents
    with open(user_agents_file, 'r') as f:
        user_agents = [line.strip() for line in f.readlines()]

    # Get pending URLs
    pending_urls = get_pending_urls(conn)

    for url in pending_urls:
        print(f"Scraping URL: {url}")
        await scrape_product_details(url, conn, user_agents)
        # Update scraping_status to 1 after successful scrape
        update_scraping_status(conn, url, 1)  
        # Random delay between requests
        await asyncio.sleep(random.uniform(1, 3))  

The main function is the core part of web scraping, where the coordination of the entire workflow occurs. This function starts with the connection to a SQLite database in which the product details are going to be stored, and necessary tables for setup. It then reads in a list of user agent strings from a file; these strings are used for simulating various web browsers in order not to be recognized as a bot.


The function fetches a list of URLs of products from the database with a "pending" status, or 0. These are the URLs that haven't been scraped correctly. For each URL, it prints the URL, and calls scrape_product_details to collect details about the product, and sets the scraping status to 1 in the database; thus, the URL has successfully been processed.The function introduces a random delay between requests that a website may not be able to detect. Between 1 and 3 seconds is set for the delay. The function is asynchronous, meaning it can perform multiple scraping tasks at once without waiting for each task to finish, making the process faster and more efficient. The function triggers the processing of the entire operation from fetching URLs to storing extracted data in the database.

Main Entry Point for the Web Scraping Script

if __name__ == "__main__":
    """
    Entry point for the web scraping application.

    This block of code is executed when the script is run directly. It 
    initiates the asynchronous scraping process by calling the `main` function
    with specified arguments:
    - The name of the SQLite database file ('zara_webscraping.db') where product
      details will be stored.
    - The path to the text file ('Data/user_agents.txt') containing user agent
      strings to be used during the scraping process.

    The `asyncio.run()` function is used to execute the `main` coroutine, 
    ensuring that  the asynchronous event loop is properly managed.
    """
    asyncio.run(main('zara_webscraping.db', 'Data/user_agents.txt'))

In this block of code, there is a definition to be assigned as an entry point to the web scraping application. The block executes when it was run directly by Python instead of being imported as a module. It starts the asynchronous scraping process by calling the main function with two arguments:

  • Database Name: 'zara_webscraping.db' - SQLite database to store all information related to the products available.

  • User Agents File: 'Data/user_agents.txt' This is a text file with a list of user agent strings. These user agents simulate different browsers while scraping.

The asynchronous nature of the program will be handled with asyncio.run(), so the main function is executed correctly, ensuring all tasks in this program are accomplished through an event loop. Then, the whole script can fetch many URLs at once without freezing or waiting between the execution of two tasks.


Libraries and Versions

This code utilizes several key libraries to perform web scraping and data processing. The versions of the libraries used in this project are as follows: BeautifulSoup4 (v4.12.3) for parsing HTML content, Requests (v2.32.3) for making HTTP requests, Pandas (v2.2.3) for data manipulation, and Playwright (v1.47.0) for browser automation. These versions ensure smooth integration and functionality throughout the scraping workflow.


Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.



36 views

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page