top of page
Writer's pictureAmbily Biju

Chasing Trends: A Comprehensive Guide to Scraping John Lewis Fashion


john lewis

The John Lewis website is the most popular online store that is famous for providing a wide range of high-quality merchandise, including women's clothing but also for its good customer service and excellent quality of products. It is available in various styles and brands to meet the requirements of different customers. The aim of web scraping is to gather vital information regarding women's clothing products. This will help understand what kind of product is available in the market and how consumers behave.

Web scraping is an automated data extraction technique that involves retrieving and parsing information from web pages. It allows us to collect large volumes of data efficiently, which would be time-consuming and labor-intensive to gather manually. By utilizing web scraping, we can bypass the limitations of traditional data collection methods and harness the power of the internet to access a wealth of information. This technique is particularly valuable in the retail sector, where dynamic pricing, product availability, and consumer reviews are crucial for making informed decisions.


The web scraping process is divided into two main steps:

  1. Product URL Scraping:

    • In this step, we systematically extract the links to individual product pages on the John Lewis website.

    • This initial step ensures that we capture all relevant products available for women’s clothing, providing a comprehensive list of URLs for further processing.

  2. Final Data Scraping from Product URLs:

    • In this step, we perform data extraction from the product URLs gathered in the first step.

    • Specific details about each item are extracted, including product names, prices, descriptions, images, ratings, and customer reviews.

This structured approach not only enhances the efficiency of our data collection but also ensures that we obtain rich and comprehensive information that can be used for analysis and reporting. The scraped data is subsequently cleaned and modified using OpenRefine and Python to ensure accuracy and enhance usability for analysis


An Overview of Libraries for Seamless Data Extraction


Requests

Requests is a library for sending HTTP requests when scraping the web. And during the product link scraping, the current script utilizes the usage of requests.get() to access the JSON data from the API of the John Lewis . During final data scrapping it fetches the HTML content of product listing pages by sending GET requests to the product URLs that have been taken out. Requests allows making it more manageable to handle tough tasks, such as query parameters management and response data management, so that makes it a good one for web scraping jobs.


BeautifulSoup


BeautifulSoup is part of package bs4, is used to parse the HTML and XML document. In the final phase of data scraping, after content in HTML was retrieved using the help of requests, a BeautifulSoup parser tree is made to navigate the document structure. It has different ways to find elements by tag, class or id, so it's easy to get specific data like product titles, prices and descriptions.


SQLite3


SQLite3 offers an embedded lightweight database solution in Python, which allows it to be used where full-scale separate database servers aren't needed. During the product link scrape, SQLite3 is used to instantiate and manage the local SQLite database that will be used to store the extracted product URLs.The script connects to the database, which makes tables if they don't exist and inserts data while handling possible duplicate entries with the UNIQUE rule into the URL field. Finally, in the last part of the scraping process, SQLite3 is used to store the product information, making it easy to manage and then get the data back .


Random

The Random module is a built-in Python library. It has functions for creating random numbers and choosing random items. In web scraping, it makes the process look more human by choosing random user-agent strings and the time between requests. It helps avoid being detected and prevent getting banned from the website.


Time


The Time module is another built-in library offering a set of functions related to time. In the context of web scraping, it's mainly used to implement pauses or delays in the script execution which simulating natural browsing behavior . The Time library is used throughout your scraping code to introduce delays between requests, to be in strict compliance with polite scraping practice.


Why SQLite Outperforms CSV for Web Scraping Projects


SQLite is an excellent choice for managing scraped data storage due to its simplicity, reliability, and efficiency. SQLite is a self-contained, serverless, and zero-configuration database: it can easily be integrated into the web scraping workflow without having to add the overhead of setting up and maintaining larger database systems. It stores data directly on the file system. In scraping, large amounts of data or URLs need to be stored locally and quickly, so it's ideal for that. The lightweight design allows for fast retrieval of data with simple queries, making it much more efficient than the reliance on CSV files, that, when having large amounts of data, become cumbersome and slow.


One of the most valuable features for SQLite in web scraping is that it can graciously handle interruptions. This means that, while monitoring the status of scrapping in the database, we can keep track of which URLs have successfully scrapped and which have not be scrapped. The best thing about SQLite is that in case of an interruption caused by network failure or system crash, we may continue scraping from where it left off and not repeat the same URLs. This is really helpful when scraping big datasets; otherwise, we lose all our progress.


Methodology for Extracting Women's Clothing Data


STEP 1 : Product url scraping


The script for John Lewis clothing product url scraping is intended to automate the process of collecting product URLs from various categories. The script is started through setting up a SQLite database, in which unique product URLs will be stored so that there are no duplicates. This also defines global HTTP headers to mimic legitimate browser request while interacting with the website's API.


A crucial part of the script is the PAGES dictionary that holds category-specific data used to dynamically build URLs sending requests to download JSON files from the John Lewis website . The script sends HTTP requests to each of those URLs, retrieves JSON responses, and extracts product URLs from objects labeled 'product'.These URLs are then stored within the SQLite database so that every product URL is stored in a structured and organized manner. The script gracefully deals with duplicate entries and only stores unique ones. It is modularly built by creating the database, generating request urls to the API, fetching product data, and saving the results to the database . This script forms the first step in the pipeline for web scraping; it offers the product URLs, on which subsequent data extraction is based.


Importing Libraries

import requests
import sqlite3

This script imports requests for retrieving web data, and sqlite3 for managing SQLite databases.

Database Initialization

# Database setup
DB_NAME = 'john_lewis_webscraping.db'

This line of code establishes the name of the SQLite database that will be used to store the scraped data. The variable DB_NAME is assigned the value 'john_lewis_webscraping.db', indicating that the database will be created or accessed with this specific filename

Defining Headers to Simulate Browser Behavior

# HTTP headers defined globally
HEADERS = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Accept-Language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6',
    'Cache-Control': 'no-cache',
    'User-Agent': (
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/125.0.0.0 Safari/537.36'
),
    'Cookie': (
        'ak_bmsc=3CC4C7AA93F86407924DF90D6DB64E27~000000000000000000000000000000~'
        'YAAQyQDeF2jQOTySAQAAEpAXRxl7kJkv2BnukKaCgsx8FNqRLuvW1P4F9XM+uk6guy8qlIvd+'
        'XMts+qOMgRDQLnaNP7DXKccQsp3Q7HDFGykERoCBdeUHT/WSG8EGeQoZeW7svpwx81GJ5I1sf'
        '6iJYUNGa6Dn0YN4yQ1i3/2ZHsP8Z7xpUuK2gV5am/e3xD2BZg9LDr/uqelGhbKCu5ttwo2ZKg'
        'FibfN+W2o+aBWuro51vrN1DoBHrgMbUNT7oINqJi9gtX5c3yIzuhskgxbBNfVhsgQ4ZW8sFtA'
        'jFIVU9d34w6YusX1qdIFjH+wP8Uwfj+SdoMex0BY0xKddiYQhHw6wcO/qmnooH2qcxSSLPlw8'
        'goW/xOL9MyInOECJA==; mt.v=5.1670165555.1727760274135'
)
}

This code snippet defines a dictionary called HEADERS, which contains key-value pairs representing the HTTP headers to be used in web requests during the scraping process.

HTTP headers are crucial for facilitating communication between the client (our web scraper) and the server. They provide additional context and details about the request being made, which is important when scraping data from websites . By customizing these headers, the scraping script can better mimic a standard web browser, increasing the likelihood of successfully retrieving data from the target website while avoiding potential blocks or restrictions.

Defining PAGES Dictionary for Dynamic Network API Requests

# Define the PAGES dictionary with corresponding facet IDs and subcategories
# Define the PAGES dictionary with corresponding facet IDs and subcategories
# This dictionary is used to dynamically construct URLs that return JSON files 
# from a network API endpoint. These JSON files contain product data for 
# different categories of women's clothing.

PAGES = {
    "petite": (1, 4, "N-fls"),
    "plus-size": (1, 8, "N-flt"),
    "womens-shirts-tops": (1, 33, "N-fm4"),
    "womens-shorts": (1, 4, "N-fm1"),
    "womens-skirts": (1, 8, "N-fm2"),
    "all-womens-sportswear-brands": (1, 7, "N-5w3p"),
    "womens-sweatshirts-hoodies": (1, 3, "N-pgxz"),
    "womens-swimwear-beachwear": (1, 9, "N-fm3"),
    "womens-trousers-leggings": (1, 13, "N-fm5"),
    "linen": (1, 8, "N-6vko"),
    "womens-jumpsuits-playsuits": (1, 5, "N-fly"),
    "womens-jumpers-cardigans": (1, 11, "N-pgz6"),
    "womens-jeans": (1, 6, "N-7j5h"),
    "womens-holiday-shop": (1, 4, "N-7ljz"),
    "womens-dresses": (1, 47, "N-flw"),
    "co-ords-suits": (1, 3, "N-pju0"),
    "womens-coats-jackets": (1, 14, "N-flv"),
}

The PAGES dictionary defines the several categories of women's clothing. Every category is associated with three values: the first and last page numbers of the product listings and a unique facet ID for that category. These values are later used to dynamically construct URLs that, in turn, request the JSON files from the network API endpoint for each clothing category . In this case, "womens-dresses" links to 47 in total pages with a facet ID of "N-flw," and it allows the API to go through all of those products in the category. This structure is useful when one has to iterate multiple categories efficiently and collect product urls in a clean, structured format directly from the API, avoiding complex HTML scraping processes. This avoids situations of web page rendering and infinite scrolling problems and simplifies fetching product urls.


There are advantages in scraping the product data from a JSON file using the network API endpoint over traditional web scraping. Responses in JSON form provide much cleaner, structured data, which is much easier to parse relative to extracting information from complicated HTML. It does not have to deal with infinite scrolling as well, which makes it more efficient and quicker. Another reason is that the API tends to be more stable and not easily broken when layouts change on websites. Lastly, from the network API scraping point of view, this reduces the occurrence of triggering CAPTCHAs or getting blocked, thus a more reliable and streamlined way of retrieving product information.

Establishing the Database Connection

def create_database():
    """
    Create SQLite database and products table if they do not exist.

    This function establishes a connection to the SQLite database specified by
    the `DB_NAME` variable. If the database doesn't exist, it will be created.
    The function also creates a table named 'products' with two columns: 
    an auto-incrementing 'id' and a unique 'url'. This ensures that
    product URLs can be stored and retrieved in an organized manner.

    The table schema:
        - id (INTEGER, PRIMARY KEY, AUTOINCREMENT): Unique identifier for each 
                                                    product entry.
        - url (TEXT, UNIQUE): URL of the product, ensuring no duplicates.

    Returns:
        None
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS urls (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE
        )
    ''')
    conn.commit()
    conn.close()

The create_database function is responsible for setting up the database for the web scraping script. It connects to the SQLite database specified by the variable DB_NAME, creates it if not present initially. Here, a table named urls is created to store the product URLs in a table which contains two columns: id as a unique identifier, and auto-increments with each new entry, and url as the product URL with a unique constraint to avoid repetition. Thereby, all the product URLs are well-organized and easily accessible for further processing. After constructing the table, the function commits changes and closes the database connection to ensure integrity of the data. Having setup tasks for the database consolidated into this function ensures the code is clear and modular, easier to maintain and manage. This function sets up a great foundation for storing and retrieving URLs, having collected from the scraping operation.


Function to Generate URLs for Product Data Scraping

def get_urls_to_scrape(pages):
    """
    Generate a list of URLs to scrape for product data from each category.

    This function constructs URLs based on the categories defined in the `PAGES`
    dictionary. Each category has a range of pages to scrape, and within each page,
    multiple chunks (assumed to be 8) are queried. The URLs point to JSON-based API 
    responses from the John Lewis website, where product details can be extracted.

    Args:
        pages (dict): A dictionary containing categories as keys, and a tuple as 
                      values where:
                        - The first element is the starting page number.
                        - The second element is the ending page number.
                        - The third element is the facet identifier (`n_value`).

    Returns:
        list: A list of complete URLs to scrape product data for all categories 
              and pages.
    """
    base_url = "https://www.johnlewis.com/standard-plp/api/product-chunks"
    urls_to_scrape = []
    
    for facet, (start_page, end_page, n_value) in pages.items():
        for page in range(start_page, end_page):
            for chunk in range(1, 9):  # Assuming 8 chunks per page
                url = (f"{base_url}?page={page}&chunk={chunk}&term=&type=browse"
                       f"&facetId=women,{facet},_,{n_value}&sortBy="
                       f"&price=&priceBands=&listHead=&lolcode=")
                urls_to_scrape.append(url)

    return urls_to_scrape

The get_urls_to_scrape function is designed to construct a list of URLs to scrape the product urls from different categories on the John Lewis website. The input is a predefined PAGES dictionary, where each key represents a category through which the corresponding products will be passed, and is associated with a tuple containing the starting and ending page numbers, as well as a facet identifier, n_value, that is helpful for filtering out the products.


To begin constructing the URLs, the function defines a base URL, https://www.johnlewis.com/standard-plp/api/product-chunks, which is the endpoint for getting JSON responses. An empty list, urls_to_scrape, is initiated that will hold the produced URLs.


The function then loops over each of the categories present in the dictionary PAGES, retrieving the start_page, end_page, and n_value for each category. For every category, it creates a loop to iterate over page numbers from start_page to end_page. In that loop, it iterates over an inclusive range of chunks from 1 to 8, indicating there are eight chunks of product data in each page of the website.For each iteration, the function creates the URLs formatted with strings that include parameters like page, chunk, and facetId, which would be necessary in order for the API to return the desired product data in JSON format. The constructed URL is appended to the list urls_to_scrape for storage.


Finally, the function returns the total list of URLs, therefore, facilitating efficient data extraction from multiple pages and product categories. This structured approach allows users systematically to extract detailed information about the product from the API.


Function to Retrieve Product URLs from API Responses

def fetch_product_urls(urls):
    """
    Fetch product URLs from a list of API URLs.

    This function sends HTTP GET requests to a list of URLs (expected to return 
    JSON responses). It extracts and returns the product URLs from each response 
    where the item type is marked as 'product'. If a request fails, the function 
    prints an error message with the status code.

    Args:
        urls (list): A list of URLs to fetch product data from. These URLs are 
                     expected to point to an API returning JSON responses.

    Returns:
        list: A list containing the product URLs extracted from the JSON data.
              Only items marked as 'product' in the response are considered.

    Error Handling:
        - If a request fails (non-200 status code), an error message is printed 
          with the respective URL and status code.
    """
    all_product_urls = []

    for url in urls:
        response = requests.get(url, headers=HEADERS)
        
        if response.status_code == 200:
            json_data = response.json()
            all_product_urls.extend(
                item.get('url') 
                for item in json_data 
                if item.get('type') == 'product'
            )
        else:
            print(f"Failed to fetch data from {url}. "
                  f"Status code: {response.status_code}")

    return all_product_urls

The function, fetch_product_urls, retrieves product URLs by sending HTTP GET requests to a list of API URLs providing JSON responses, then extracting the product information and returning a list of 'product' categorized items along with their URLs. When the request cannot be sent successfully (i.e., the status code was not 200), an error message with the URL that was unable to be reached and its status code will be printed.

It accepts one input parameter, urls, which is a list of URLs pointing to APIs expected to return JSON responses with product urls. The function initializes an empty list called all_product_urls to store the extracted URL. The function then cycles through each URL provided in the list and sends an HTTP GET request using the requests library. Its headers are those defined by the HEADERS variable.


Upon receiving the response, the function checks its status code. If the status code is 200 and the request was successful then, the function tries to parse the response JSON data. If such parsing has succeeded, it extracts URLs of 'product' items, appending them to the all_product_urls list. When the status code is not 200 then, prints an error message containing information on the URL that failed and some specific status code. Lastly, the method returns a list of the gathered product URLs collected from the responses of the API. The list only includes items that have been labeled 'product' on the JSON data.


Function to Save Product URLs to SQLite Database

def save_to_database(urls):
    """
    Save product URLs to the SQLite database.

    This function inserts the provided product URLs into the 'products'
    table of the SQLite database. Each product URL is combined with a 
    base URL (John Lewis's domain) to form a full product URL before 
    insertion.If a URL already exists in the database 
    (i.e., a duplicate entry), it is skipped using error handling
    for SQLite's IntegrityError.

    Args:
        urls (list): A list of relative product URLs to be saved. These
                    URLs are expected to be relative paths that will 
                    be prepended with the base URL 
                    (https://www.johnlewis.com).

    Returns:
        None

    Database:
        - Connects to the SQLite database specified by `DB_NAME`.
        - Inserts each full product URL into the 'products' table.
        - Automatically ignores duplicate URLs using the UNIQUE 
          constraint on the 'url' column, ensuring no 
          duplicate entries are saved.

    Error Handling:
        - Catches and ignores sqlite3.IntegrityError to skip duplicate URLs.
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    base_url = "https://www.johnlewis.com"

    for url in urls:
        full_url = f"{base_url}{url}"  # Prepend base URL to the product URL
        try:
            cursor.execute(
                'INSERT INTO urls (url) VALUES (?)', 
                (full_url,)
            )
        except sqlite3.IntegrityError:
            # Ignore duplicates
            pass

    conn.commit()
    conn.close()

A function for saving to the database, save_to_database, inserts product URLs to the 'urls' table in an SQLite database. It takes a list of relative product URLs as input and combines each with a base URL from John Lewis's domain to create complete URLs for insertion into the database.


The function initiates the connection to the SQLite database defined with the DB_NAME variable and creates a cursor object that runs SQL commands. The function sets the base URL as equal to "https://www.johnlewis.com." Then, for each URL in the list the function was given, it prepended the base URL to make it a full product URL. By wrapping the function in a try-except block, it will attempt to run an INSERT statement of the full URL into the 'urls' table. When it finds out that the URL already exists in the database and its UNIQUE constraint has been violated, it raises IntegrityError. Function catches the error and continues, effectively ignoring any duplicate entry of the record to not jeopardize further actions.


After trying to add all the URLs, the function commits the changes done in the database and closes the connection. As such, such a design ensures that all unique product URLs are stored without repetition, thereby maintaining integrity in the database.

Main Function to Orchestrate the Web Scraping and Data Saving Process

def main():
    """
    Main function to orchestrate the entire scraping and saving process.

    This function performs the following steps:
    
    1. Creates the SQLite database and initializes the 'products' table if it 
       doesn't exist by calling `create_database()`.
    2. Generates a list of URLs to scrape for product data using the `PAGES` 
       dictionary by calling `get_urls_to_scrape()`.
    3. Fetches product URLs from the generated URLs using the `fetch_product_urls()` 
       function.
    4. Saves the fetched product URLs to the SQLite database on a category-wise
       basis by calling `save_to_database()`. Only URLs matching each category
       are saved in batches. A confirmation message is printed for each category
       after saving its URLs.

    Returns:
        None

    Workflow:
        - Database setup: Creates or connects to the database and table.
        - URL generation: Creates API URLs to scrape product data for each
                          category.
        - Data fetching: Extracts product URLs from the JSON API responses.
        - Data saving: Inserts product URLs into the database, handling 
                       duplicates.
    """
    create_database()
    
    # Generate URLs to scrape
    urls_to_scrape = get_urls_to_scrape(PAGES)

    # Fetch product URLs
    product_urls = fetch_product_urls(urls_to_scrape)

    # Save URLs to database batch-wise
    for category in PAGES.keys():
        category_urls = [
            url for url in product_urls 
            if category in url
        ]
        save_to_database(category_urls)
        print(
            f"Saved {len(category_urls)} URLs for category "
            f"'{category}' to the database."
        )

The main function acts as the logical controller for the web scraping and data-saving process, thereby maintaining an order that helps to keep things organized and streamlined. The first step of the main function calls the create_database() function, creating a connection to the SQLite database and defining the table 'urls' if it does not already exist there, establishing a proper framework for data storage. After this setup, the function creates a list of URLs that need to be scraped by calling get_urls_to_scrape(PAGES), which creates API endpoints for the different product categories based on the predefined PAGES dictionary. After creating the list of URLs, the function fetches product URLs using fetch_product_urls(urls_to_scrape), which sends HTTP GET requests and fetches relevant product URLs from the returned JSON data.


The function then iterates over each category in the dictionary PAGES and filters down the fetched product URLs to match specific category criteria. The URL is then saved to the SQLite database using save_to_database(category_urls) to only store relevant URLs. To give some positive feedback of the process, a confirmation message is then printed after saving the URLs in each category with the number of URLs that are added successfully to the database. Overall, the main function encapsulates the whole workflow-from database creation and generation of URLs, data retrieval, and storage-in managing each one coherently.


Script Entry Point and Execution Trigger

if __name__ == '__main__':
    """
    Entry point of the script.

    This conditional block checks if the script is being run 
    directly (as opposed  to being imported as a module). 
    If so, it invokes the `main()` function to execute 
    the web scraping process, including database setup, URL generation, 
    data fetching, and saving product URLs to the SQLite database.
    """
    main()

The conditional block: if __name__ == '__main__': serves as an entry point of the script, thus ensuring that the following code would run only when run directly, rather than imported as a module in another script. In this block, the main() function is called, which starts up the whole process of web scraping. This involves initial setting-up of the SQLite database, generating the necessary URLs for scraping product data, fetching the product URLs from the API responses, and then saving those URLs into the SQLite database. The structure of this approach allows for a clean separation of script execution and module importation, thus easier testing and reuse of this code.


STEP 2 : Final data scraping


This script aims to gather all the product information comprehensively from an e-commerce website using the powerful libraries of Python, such as requests, BeautifulSoup, and sqlite3. The main aim is to collect valuable information of various products.


The SQLite database is thus established in the script to store and manage scraped information in a more efficient manner. This will include information such as product URLs, titles, brands, retail prices, sale prices, description, and specifications available with customer reviews. It helps in organizing the information in an easily accessible and usable manner.


The code employs a lot of the best practices towards optimizing the scraping process while ensuring resilience against a potential block by the website. With the implementation of robust error handling and user agent rotation, such mechanisms help protect the integrity of the scraping operation while reducing the likelihood of interruptions.


Dividing the workflow into modular functions that perform related tasks: creating and managing database tables, retrieving unscraped URLs, and scraping product details from an individual product page. This structured approach ensures thorough data collection while allowing a very clear and organized codebase which in turn makes insightful analyses feasible based on the collected data.


Importing Libraries

import requests
import random
import time
import sqlite3
from bs4 import BeautifulSoup

This script uses libraries like requests for web requests, random and time for delays, sqlite3 for database management, and BeautifulSoup for HTML parsing.


Database Path Configuration

DB_PATH = 'john_lewis_webscraping.db'

The DB_PATH variable is essential for establishing a connection to the SQLite database, allowing for the creation of tables, insertion of data, querying, and ultimately managing the scraped data throughout the web scraping project.

Defining the Default Data Template

# Default data template
DEFAULT_DATA = {
    'url': '',
    'brand': 'N/A',
    'title': 'N/A',
    'original_price': 'N/A',
    'discount_price': 'N/A',
    'product_code': 'N/A',
    'description': 'N/A',
    'product_specification': 'N/A',
    'size_fit': 'N/A',
    'average_rating': 'N/A',
    'review_count': 'N/A'
}

DEFAULT_DATA is the dictionary which is used as a template for organizing and storing product information into the web scraping project. Every key has an attribute related to the product, such as url, brand, title, original_price, discount_price, product_code, description, product_specification, size_fit, average_rating, review_count. Initially, values are used with placeholder where most of the attributes default to an empty string or 'N/A' to indicate that the data was not available. This would be helpful in ensuring consistency in the collection of data while providing a structured format for the scraped details about the products and thus easier handling of missing or incomplete data during the process of scraping.


Creating Tables in SQLite Database

def create_tables():
    """
    Create the necessary tables in the SQLite database for storing scraped product data and logging failed URLs.

    This function creates two tables in the SQLite database:

    1. `scraped_data`: Stores details of scraped product data. Each product entry includes attributes such as:
       - `id`: Auto-incrementing primary key for each product.
       - `url`: The URL of the product.
       - `brand`: The brand of the product.
       - `title`: The title of the product.
       - `original_price`: The original price of the product (if available).
       - `discount_price`: The discounted price of the product (if available).
       - `product_code`: Unique identifier or code for the product (if available).
       - `description`: Text description of the product.
       - `product_specification`: Detailed product specifications (if available).
       - `size_fit`: Information regarding the product's size and fit (if available).
       - `average_rating`: Average customer rating for the product (if available).
       - `review_count`: The number of customer reviews for the product (if available).

    2. `failed_urls`: Stores URLs that failed to be scraped along with the reason for failure. This helps in 
       diagnosing scraping errors for specific URLs. Each entry includes:
       - `id`: Auto-incrementing primary key for each failed URL.
       - `url`: The URL that failed to be scraped.
       - `reason`: The reason for failure (e.g., connection error, parsing issue).

    The function establishes a connection to the SQLite database specified by the `DB_PATH` variable,
    executes SQL commands to create the tables if they don't already exist, commits the changes,
    and finally closes the database connection.
    """
    conn = sqlite3.connect(DB_PATH)
    c = conn.cursor()
    
    # Create scraped_data table with scraping_status column
    c.execute('''
        CREATE TABLE IF NOT EXISTS scraped_data (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT,
            brand TEXT,
            title TEXT,
            original_price TEXT,
            discount_price TEXT,
            product_code TEXT,
            description TEXT,
            product_specification TEXT,
            size_fit TEXT,
            average_rating TEXT,
            review_count TEXT,
        )
    ''')

    # Create failed_urls table
    c.execute('''
        CREATE TABLE IF NOT EXISTS failed_urls (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT,
            reason TEXT
        )
    ''')

    conn.commit()
    conn.close()

The create_tables function is responsible for setting up the necessary tables in the SQLite database to store scraped product data and log any failed URLs encountered during the scraping process. It begins by establishing a connection to the database specified by the DB_PATH variable and creates a cursor for executing SQL commands.

Two tables are defined within this function:

  1. scraped_data Table: This table stores detailed information about each product that has been successfully scraped. It includes several attributes such as:

    • id: An auto-incrementing primary key for unique identification of each product.

    • url: The URL from which the product data was retrieved.

    • brand, title, original_price, discount_price, product_code, description, product_specification, size_fit, average_rating, and review_count: Various attributes providing comprehensive details about the product.

  2. failed_urls Table: This table logs URLs that could not be scraped, along with the reasons for their failure. The entries in this table include:

    • id: An auto-incrementing primary key for each failed URL.

    • url: The URL that could not be processed.

    • reason: A description of the failure, such as a connection error or parsing issue.

After defining the tables, the function commits the changes to the database and closes the connection, ensuring that the database is updated and ready for future operations. By utilizing this structured approach, the function facilitates efficient data storage and helps in diagnosing any scraping issues that arise.


Retrieving a Random User Agent from Database

def get_random_user_agent():
    """
    Retrieve a random user agent string from the `user_agents` table in the SQLite database.

    This function connects to the SQLite database defined by the `DB_PATH` variable and executes a SQL query to
    select a random user agent from the `user_agents` table. The `RANDOM()` function is used to ensure a random
    row is selected. The function returns the user agent string if a result is found, otherwise returns `None`.

    Returns:
        str: A randomly selected user agent string, or `None` if the table is empty.
    """
    conn = sqlite3.connect(DB_PATH)
    c = conn.cursor()
    c.execute("SELECT user_agent FROM user_agents ORDER BY RANDOM() LIMIT 1")
    result = c.fetchone()
    conn.close()
    return result[0] if result else None

The get_random_user_agent function is intended to fetch a random user agent string from the table of user_agents located within the SQLite database specified by the variable DB_PATH. The usage is mainly applicable in a web scraping environment, while different agents or strands avoid detection and blocking by target websites. To start with, it connects to the SQLite database and creates a cursor to execute SQL commands. Then the function will execute a query that selects a random user agent from the table 'user_agents' through using the RANDOM() function, fetch the result through fetchone(), and check if there is a user agent string; in case of availability, it will return this user agent string; otherwise, upon an empty table, it returns None. Finally, the function ends by closing the database connection to free the memory used and ensure proper resource management. This function works efficiently to provide a random user agent, thus enhancing the strength and flexibility of web scraping operations.

Function to Retrieve Unscraped URLs

def get_unscraped_urls():
    """
    Retrieve URLs that have not been scraped from the `urls` table in the SQLite database.

    This function connects to the SQLite database using the `DB_PATH` variable and fetches all URLs 
    from the `urls` table where `scraping_status` is set to 0, indicating they have not yet been scraped.
    The results are returned as a list of URLs.

    Returns:
        list: A list of URLs with `scraping_status = 0`. If no unscraped URLs are found, 
        an empty list is returned.
    """
    conn = sqlite3.connect(DB_PATH)
    c = conn.cursor()
    c.execute("SELECT url FROM urls WHERE scraping_status = 0")
    results = c.fetchall()
    conn.close()
    return [row[0] for row in results] if results else []

The get_unscraped_urls function is designed to fetch URLs which have yet to be scraped from the urls table within the SQLite database specified by the DB_PATH variable. It is an essential part of any web scraping operation because it makes the whole program record which URLs yet need scraping so that the process is not only efficient but also tidy as well. When this function is invoked, a connection to the SQLite database will be made and it proceeds further by creating a cursor, which is used to perform SQL commands. It then performs a query that selects all URLs for which the scraping_status column contains 0; this means those URLs haven't been processed yet. Fetchall() is used to fetch the results from the query before closing the database connection. Finally, the function returns a list of unscraped URLs by extracting the first element from each row of the results; otherwise, it defaults to an empty list if no URLs are found. That's why this function plays an important part in handling the scraping workflow and making sure all the target URLs are processed sooner or later.


Function to Scrape Product Data

def scrape_data(url):
    """
    Scrape product data from a given URL with retry functionality and user-agent rotation.

    This function retrieves product details like title, brand, prices, description, specifications, 
    size and fit, ratings, and reviews from an e-commerce page. It attempts up to three retries 
    with random delays between requests to avoid detection. A random user agent is selected from a 
    database for each request to mimic real-user behavior.

    Args:
        url (str): The product page URL to scrape.

    Returns:
        dict: A dictionary containing the scraped data or default values on failure:
              - 'url' (str): Product URL.
              - 'brand' (str): Product brand.
              - 'title' (str): Product title.
              - 'original_price' (str): Original price.
              - 'discount_price' (str): Discounted price.
              - 'description' (str): Product description.
              - 'product_code' (str): Product identifier.
              - 'product_specification' (str): Product specifications.
              - 'size_fit' (str): Size and fit information.
              - 'average_rating' (str): Average rating.
              - 'review_count' (str): Number of reviews.

    Notes:
    - Random delays (2-6 seconds) are used to avoid detection.
    - The function retries up to three times if a request fails.
    - Default values are returned if scraping fails.
    """
    retries = 3  # Number of retries
    for attempt in range(retries):
        try:
            user_agent = get_random_user_agent()
            if not user_agent:
                print("No user agents found in the database.")
                return {**DEFAULT_DATA, 'url': url}
            
            headers = {
                "User-Agent": user_agent,
                "Accept": "application/json, text/plain, */*",
                "Accept-Encoding": "gzip, deflate, br, zstd",
                "Accept-Language": "en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6",
                "Cache-Control": "no-cache",
                "Origin": "https://www.johnlewis.com",
                "Pragma": "no-cache",
                "Priority": "u=1, i",
                "Referer": "https://www.johnlewis.com/",
                "Sec-Ch-Ua": "\"Google Chrome\";v=\"125\", \"Chromium\";v=\"125\", \"Not.A/Brand\";v=\"24\"",
                "Sec-Ch-Ua-Mobile": "?0",
                "Sec-Ch-Ua-Platform": "\"Linux\"",
                "Sec-Fetch-Dest": "empty",
                "Sec-Fetch-Mode": "cors",
                "Sec-Fetch-Site": "same-site",
                "X-Client-Id": "web-pdp-scaffold-ui",
                "X-Correlation-Id": "d7d9e4ab-61a5-4bf1-a0cd-4218fa122758",
            }

            delay = random.uniform(2, 6)
            time.sleep(delay)

            response = requests.get(url, headers=headers, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')

            # Initialize data dictionary
            data = {**DEFAULT_DATA, 'url': url}

            data['brand'], data['title'] = scrape_title_and_brand(soup)
            data['original_price'], data['discount_price'] = scrape_prices(soup)
            data['description'], data['product_code'] = scrape_description_and_code(soup)
            data['product_specification'] = scrape_product_specifications(soup)
            data['size_fit'] = scrape_size_and_fit(soup)
            data['average_rating'], data['review_count'] = scrape_reviews_and_ratings(soup)

            return data
        
        except requests.RequestException as e:
            print(f"Request error for URL: {url} - {e}")
            return {**DEFAULT_DATA, 'url': url}
        
        except Exception as e:
            print(f"An error occurred for URL: {url} - {e}")
            return {**DEFAULT_DATA, 'url': url}

The scrape_data function was created to extract product information from a specified e-commerce URL with the help of retry functionality and user-agent rotation in order to maximize the effectiveness of scraping and reduce the risk of detection. Upon calling the function, it will attempt to fetch different types of product information: title, brand, prices, description, specifications, size and fit, average ratings, and review counts.


After being called with a product url, the function initializes a retry mechanism to make as many attempts as will eventually deliver the data from the target source. It starts with selecting a random user agent from a database to behave exactly like real users, so that all the requests seem genuine. If no user agents are available, the function returns a dictionary of default values, including the URL provided.


The function makes HTTP headers for the browser request with the selected user agent and the headers that best match the actual application. To avoid being flagged as a bot, it ensures to introduce a random delay of between 2 and 6 seconds to every request. The function then sends a GET request to the specified URL, taking timeout in case a response would take too long. If successful, the function parses the HTML content using BeautifulSoup and uses helper functions to extract the desired product details.If any error occurs while processing a request, then the network or general exceptions are caught by the function, and it prints an error message with default return values.


Extracting Product Title and Brand

def scrape_title_and_brand(soup):
    """
    Extract the product's title and brand from the given BeautifulSoup object.

    This function locates the product's title using the 'h1' tag with a 
    specific 'data-testid'. It checks for two possible brand-related 
    span elements within the title: one for the brand and one for 
    "Anyday" brand. If a brand is found, it is extracted and removed 
    from the title. If no brand or title is found, 'N/A' is returned 
    for both.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object containing 
        parsed HTML of the product page.

    Returns:
        tuple: A tuple with:
            - brand (str): The product brand, or 'N/A' if not found.
            - title (str): The product title with the brand removed, 
            or 'N/A' if not found.
    """
    title_element = soup.find('h1', {'data-testid': 'product:title'})
    if title_element:
        brand_element = title_element.find('span', 
            {'data-testid': 'product:title:otherBrand'})
        anyday_element = title_element.find('span', 
            {'data-testid': 'product:title:anyday'})
        
        brand = 'N/A'
        if brand_element:
            brand = brand_element.text.strip()
        elif anyday_element:
            brand = anyday_element.text.strip()
        
        full_title = title_element.text.strip()
        title = (full_title.replace(brand, '')
                 .strip() if brand != 'N/A' else full_title)
        return brand, title
    return 'N/A', 'N/A'

The scrape_title_and_brand function will fetch the title and brand of the product from a parsed HTML page in the form of a BeautifulSoup object. It starts by locating the title from the structure of the HTML by looking for 'h1' tag where the data-testid attribute specifically identifies it as the title of the product. The function then uses the title element to search for two optional elements that would contain the brand name - one general brand, and one specifically for the "Anyday" brand. In case of a find for either of the brand elements, the function extracts the brand name, removes any extra white space, and, finally, removes the brand from the full title in order to output a clean title. In case the title and brand are not found, the function returns 'N/A' for brand and title. The returned scraped data will be organized and easy to use, with a clear division between the name of the product and its brand.


Scraping Product Prices

def scrape_prices(soup):
    """
    Scrape the original and discount prices of the product 
    from the BeautifulSoup object.

    This function searches for the product's price information 
    within the HTML content. It looks for elements that indicate 
    the original and discounted prices and returns both. If a 
    discount price is not found, it assumes the current price 
    is the only price. If no prices are found, it returns 'N/A' 
    for both values.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
        the parsed HTML content.

    Returns:
        tuple: A tuple containing:
            - original_price (str): The original price of the 
            product, or 'N/A' if not found.
            - discount_price (str): The discount price of the 
            product, or 'N/A' if not found.
    """
    price_element = soup.find('p', {'data-test': 'price'})
    if price_element:
        price_prev_elements = price_element.find_all('span', 
            {'data-test': 'price-prev'})
        price_now_element = price_element.find('span', 
            {'data-test': 'price-now'})
        
        if price_now_element and price_prev_elements:
            original_price = price_prev_elements[0].text.strip()
            discount_price = price_now_element.text.strip()
        elif price_now_element:
            original_price = discount_price = price_now_element.text.strip()
        
        return original_price, discount_price
    return 'N/A', 'N/A'

The scrape_prices function parses out the original and discounted prices of a product using an HTML page as represented by a BeautifulSoup object. It begins by searching for a particular paragraph element with the price information -- it is identified by the data-test attribute set to 'price'. Inside this element, the function searches for two types of price ranges: one that signifies original price (data-test set to 'price-prev') and another which signifies current price (data-test set to 'price-now').


If both original and current prices are present, the function retrieves and cleans for both prices. Otherwise, it sets the current price for both original and discounted prices meaning there is no discount. If no price information is ever found, the function will return 'N/A' for both prices. This method ensures that the scraped price data is very accurate and structured, offering clear information regarding the product's pricing for further analysis. Scraping Product Description and Code

def scrape_description_and_code(soup):
    """
    Scrape the product description and product code from the 
    BeautifulSoup object.

    This function retrieves the product's description and 
    product code from the HTML content. If the description 
    or product code is not available, it returns 'N/A' for 
    the missing fields.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object 
        containing the parsed HTML content.

    Returns:
        tuple: A tuple containing:
            - description (str): The description of the 
            product, or 'N/A' if not found.
            - product_code (str): The product code, or 'N/A' 
            if not found.
    """
    # Locate the product code
    code_element = soup.find(
        'p', {'data-testid': 'description:code'})
    
    product_code = (
        code_element.text.strip().replace("Product code: ", 
        "") if code_element else 'N/A')
    
    # Locate the description
    description_element = soup.find(
        'div', {'data-testid': 'description:content'})
    
    description = (
        description_element.text.strip() if description_element 
        else 'N/A')
    
    return description, product_code

The scrape_description_and_code function is made to fetch both product description and product code from a webpage by using a BeautifulSoup object that represents the parsed HTML content.


Finding the Product Code: The function looks for the paragraph element that has the product code, determined by using the data-testid attribute set to 'description'. It retrieves the text if this element is available and removes the prefix "Product code"; otherwise, it defaults to returning 'N/A' for the product code.


Extraction of Product Description: The function searches for a div element with data-testid set to 'description'. If this is present, the function continues to fetch and clean the text. If not, it assigns 'N/A' to the description.


Lastly, this function will return a tuple that holds the extracted description and product code, then when any information is missing it's clearly marked as 'N/A'. The idea behind this structure helps maintain clarity in the scrapped data.


Function to Scrape Product Specifications

def scrape_product_specifications(soup):
    """
    Scrape the product specifications from the BeautifulSoup object.

    This function extracts the product specifications from 
    the HTML content in the form of a list of label-value 
    pairs (e.g., "Color: Red"). If no specifications are 
    found, it returns 'N/A'.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object 
        containing the parsed HTML content.

    Returns:
        str: A string containing the product specifications, 
        with each specification separated by a vertical bar 
        (' | '). Returns 'N/A' if no specifications are found.
    """
    # Locate the specification list
    specs_elements = soup.find(
        'dl', {'data-testid': 'product:specification:list'})
    
    if not specs_elements:
        return 'N/A'
    
    # Extract specifications as a list of strings
    specs = []
    for item in specs_elements.find_all(
        'div', 
        class_='ProductSpecificationAccordion_productSpecificationListItem__azQDX'):
        
        label = item.find(
            'dd', {'data-testid': 'product:specification:list:label'})
        
        value = item.find(
            'dt', {'data-testid': 'product:specification:list:value'})
        
        if label and value:
            specs.append(
                f"{label.text.strip()}: {value.text.strip()}"
            )
    
    return ' | '.join(specs) if specs else 'N/A'

The function scrape_product_specifications is designed to extract product specifications from an object of BeautifulSoup that represents HTML. First, it accesses the section for specifications with a specific data-testid. If not found, the function returns 'N/A'. If found, it then iterates through every item in a list of specifications to collect the label and value for each specification using dd and dt tags for each. These label-value pairs are formatted as "Label: Value", and a list holds them. It simply concatenates the specifications into a single string separated by vertical bars (' | ') or returns 'N/A' if no specifications are extracted. At this end, a clear and structured representation of product specifications would be ensured.


Function to Scrape Size and Fit Information

def scrape_size_and_fit(soup):
    """
    Extract size and fit information from the given BeautifulSoup object.

    This function searches the product page's HTML, parsed by 
    BeautifulSoup, to find the "Size and Fit" section. It locates 
    all relevant details such as size and fit descriptions, 
    formatting them as a string. If no size and fit information 
    is found, 'N/A' is returned.

    The function searches for a <details> tag with a specific 
    'data-testid' attribute corresponding to the size and fit 
    section. Each size and fit attribute is extracted from <dt> 
    and <dd> elements within the section, which correspond to 
    labels and their respective values. The extracted information 
    is returned as a formatted string.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
        parsed HTML content.

    Returns:
        str: A formatted string with the size and fit details, 
        or 'N/A' if no information is found. Each size and fit 
        attribute is displayed as 'label: value'.
    """
    size_fit_details = []

    # Locate the size and fit section
    size_fit_element = soup.find(
        'details', {'data-testid': 'accordion:size-and-fit'})
    
    if size_fit_element:
        # Find all items in the size and fit list
        items = size_fit_element.find_all(
            'div', 
            class_='ProductSizeAndFitAccordion_listItem__yc2wV')
        
        for item in items:
            label_element = item.find(
                'dt', {'data-testid': 'product:size-and-fit:list:label'})
            
            value_element = item.find(
                'dd', {'data-testid': 'product:size-and-fit:list:value'})
            
            if label_element and value_element:
                label = label_element.text.strip().replace(" - ", "")  # Clean up the label text
                value = value_element.text.strip()
                size_fit_details.append(f"{label}: {value}")

    return '\n'.join(size_fit_details) if size_fit_details else 'N/A'

The scrape_size_and_fit function will extract the size and fit information from a given BeautifulSoup object, containing the parsed HTML for a product page. This function finds a details tag that has a selected size and fit section with an applied 'data-testid'. Within this section, the function searches for <dt> and <dd> elements representing the labels and their respective values. The extracted information is arranged as a string in the following format: 'label : value'. In case no size and fit information is found, the function returns 'N/A'. That way, users get an explicit view of the product's sizing and fitting, hence assisting them in making informed purchase decisions.


Function to Scrape Reviews and Ratings

def scrape_reviews_and_ratings(soup):
    """
    Extract the average rating and review count from the product page.

    Finds the rating and review count from the HTML using the provided BeautifulSoup object. 
    The rating is extracted from the 'title' attribute, while the review count is located 
    within a button element. Returns 'N/A' if either is not found.

    Args:
        soup (BeautifulSoup): Parsed HTML content of the product page.

    Returns:
        tuple: (average_rating, review_count) as strings, or 'N/A' if unavailable.
    """
    # Locate the rating element using the class attribute
    rating_element = soup.find('a', class_='PriceAndReviews_ratings__GmUan')
    
    if rating_element:
        # Extract the rating from the title attribute
        average_rating = rating_element['title'].split(" ")[0]  # "4.4 out of 5 stars" -> "4.4"
        
        # Locate the review count element (button text)
        review_count_element = rating_element.find('button')
        review_count = (review_count_element.text.strip() 
                        if review_count_element 
                        else 'N/A')
    else:
        average_rating = 'N/A'
        review_count = 'N/A'
    
    return average_rating, review_count

The scrape_reviews_and_ratings function is so designed to extract the average product rating and review count from a product page by accepting a BeautifulSoup object that represents the parsed HTML content. Based on its class attribute, the function locates the rating element and obtains the average rating from its 'title' attribute, effectively isolating the numerical value. Furthermore, within this rating section, it searches for a button element in order to extract the review count. In the case when either the rating or the review count is not accessible, the function returns 'N/A' for both. This function allows users to retrieve key metrics about product feedback, which may help them make decisions regarding a product based on customer experiences and satisfaction.


Function to Save Scraped Product Data to SQLite Database

def save_to_database(data):
    """
    Save the scraped product data to the SQLite database.

    This function connects to the SQLite database specified by the `DB_PATH`, 
    and inserts the scraped product details into the `scraped_data` table. The 
    function includes product attributes such as the URL, brand, title, original 
    price, discounted price, product code, description, product specifications, 
    size and fit, average rating, review count, and a scraping status indicating 
    successful data capture.

    Args:
        data (dict): A dictionary containing the product data to be saved, 
                     with keys corresponding to the following fields in the 
                     database:
                     - 'url' (str): The product's URL.
                     - 'brand' (str): The product's brand.
                     - 'title' (str): The product's title.
                     - 'original_price' (str): The original price of the product.
                     - 'discount_price' (str): The discounted price of the product.
                     - 'product_code' (str): The product's unique identifier.
                     - 'description' (str): A brief description of the product.
                     - 'product_specification' (str): The product's specifications.
                     - 'size_fit' (str): Information on the product's size and fit.
                     - 'average_rating' (str): The average user rating.
                     - 'review_count' (str): The total number of reviews.
    
    """
    conn = sqlite3.connect(DB_PATH)
    c = conn.cursor()
    
    # Insert data into scraped_data table
    c.execute('''
        INSERT INTO scraped_data (url, brand, title, original_price, 
                                   discount_price, product_code, 
                                   description, product_specification, 
                                   size_fit, average_rating, review_count)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    ''', (data['url'], data['brand'], data['title'], 
          data['original_price'], data['discount_price'], 
          data['product_code'], data['description'], 
          data['product_specification'], data['size_fit'], 
          data['average_rating'], data['review_count']))
    
    conn.commit()
    conn.close()

The save_to_database function saves scraped product data to an SQLite database. The connection is established to the DB specified by the DB_PATH and the product details will be inserted into the table named scraped_data. The passed dictionary, which contains various attributes like the URL for the product, brand, title, original and discounted prices, product code, description, specifications, size and fit, average rating, and review count. The function executes an SQL INSERT query to insert this data into the database. After that, the connection is closed. This function guarantees that scraped data are stored in a structured format persistently in the database for later retrieval and analysis.


Function to Mark URL as Scraped in the Database

def mark_url_scraped(url):
    """
    Mark a specific URL as scraped in the database.

    This function updates the `scraping_status` field of a given URL 
    in the `urls` table of the SQLite database. Setting the scraping 
    status to 1 indicates that the URL has been processed and the 
    associated data has been scraped.

    Args:
        url (str): The URL to mark as scraped, which must already 
                    exist in the `urls` table.

    Raises:
        sqlite3.Error: If an error occurs while connecting to the 
                        database or executing the update statement.

    """
    conn = sqlite3.connect(DB_PATH)
    c = conn.cursor()
    c.execute("UPDATE urls SET scraping_status = 1 WHERE url = ?", 
              (url,))
    conn.commit()
    conn.close()

The function mark_url_scraped will update the status of scraping for a given URL within the SQLite database. It connects to the database using the DB_PATH and updates the field scraping_status in the urls table, setting it to 1, which means the URL was correctly processed, and its data has been scraped. The function receives the URL as an argument and makes sure that such a URL exists in the table. The changes are committed, and the connection is closed after the execution of the update. This function helps track which URLs have already been scraped.

Function to Mark URL as Scraped in the Database

def main():
    """
    Main function to run the web scraping process.

    This function orchestrates the entire scraping workflow. It initializes
    the process by creating necessary database tables, retrieves a list of
    URLs that have not yet been scraped, and iterates over each URL to 
    perform the following steps:
    
    1. Scrapes data from the URL.
    2. Saves the scraped data to the SQLite database.
    3. Marks the URL as scraped in the database.

    The function prints relevant status messages to the console to 
    provide feedback on the scraping progress.

    Raises:
        Exception: If any error occurs during data scraping or database 
                    operations, it may halt the process. Specific error 
                    handling can be implemented as needed.

    """
    create_tables()
    urls = get_unscraped_urls()
    
    for url in urls:
        print(f"Scraping URL: {url}")
        data = scrape_data(url)
        save_to_database(data)
        mark_url_scraped(url)
        print(f"Data saved for URL: {url}")

The primary function is also the central controller for launching the entire web scraping process. It initializes the whole process by performing a call to create_tables(), which initializes the required database tables for storing scraped data. The next step it calls is the function get_unscraped_urls() to retrieve a list of URLs that haven't yet been processed. For each URL in this list, the function calls in sequence a more general process : it first calls scrape_data(url) to get product details from the web page. Then it passes this extracted data to save_to_database(data), where product information is inserted into an SQLite database. Finally, it calls mark_url_scraped(url) to modify the database, marking the URL as successfully scraped. This function also gives a status update by printing the relevant messages on the console, making it easier to track the scraping progress in real time. In case of some errors during scraping or operation on the database, it can be handled implementing specific mechanisms of error handling. Overall, the function manages and automates the complete cycle of scraping, with correct handling of URLs and optimum storage.

Script Entry Point: Initializing Web Scraping

if __name__ == "__main__":
    """
    Entry point for the script.

    This block checks if the script is being run as the main program
    and, if so, calls the main function to initiate the web scraping
    process.
    """
    main()

The if name == "__main__": block serves as the entry point for the script. It checks whether the script is being executed as the main program, rather than being imported as a module into another script. If the script is run directly, this block ensures that the main() function is called, initiating the entire web scraping process. This structure is a common Python practice to differentiate between running a script directly or importing its functions into other modules. It guarantees that the scraping workflow is only triggered when the script is executed intentionally.


Libraries and Versions


This code utilizes several key libraries to perform web scraping and data processing. The versions of the libraries used in this project are as follows: BeautifulSoup4 (v4.12.3) for parsing HTML content, Requests (v2.32.3) for making HTTP requests. These versions ensure smooth integration and functionality throughout the scraping workflow.

Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.


FAQ SECTION


1. What is web scraping, and how is it relevant to fashion trends?

Answer:Web scraping is the process of extracting data from websites to analyze and gain insights. In the fashion industry, web scraping helps gather data on pricing, promotions, product availability, and customer reviews. For John Lewis fashion, it can uncover key trends, track competitor strategies, and inform better inventory and marketing decisions.


2. Can I scrape John Lewis fashion data without technical expertise?

Answer:Yes! With our web scraping services, you don't need technical expertise. We handle the entire process—from data extraction to delivering clean, actionable datasets tailored to your requirements. Our team ensures data is gathered ethically and complies with legal standards.


3. What kind of insights can I gain from John Lewis fashion data?

Answer:By scraping John Lewis fashion data, you can gain insights into product pricing, seasonal promotions, top-selling categories, customer preferences, and emerging trends. This information can help optimize your pricing strategies, improve inventory planning, and stay ahead in the competitive fashion market.


4. How do you ensure data accuracy and compliance while scraping?

Answer:We prioritize data accuracy by using advanced tools and techniques to minimize errors during the scraping process. Our compliance practices include adhering to the website’s terms of service and applicable data protection regulations to ensure ethical and lawful data extraction.


9 views

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page