top of page

How to Automate Ray-Ban Product Discovery: A Web Scraping Approach

  • Writer: Shahana farvin
    Shahana farvin
  • Apr 23
  • 30 min read

How to Automate Ray-Ban Product Discovery: A Web Scraping Approach


Introduction


Ever been curious about how programmers extract large blocks of information from websites without manually copying and pasting? The answer lies in web scraping, a technique of web data extraction automation. Instead of manually scraping product descriptions or prices, programmers program to issue HTTP requests, extract HTML data, and parse it to get particular data.With the aid of special libraries and software, web scraping can navigate web pages, handle HTTP responses, and parse HTML or XML, making it a viable method of harvesting structured data. Be it for tracking price movements, extracting product information, or scraping contact details, the process cuts down the time by hundreds of hours if done manually.However, web scraping is not all about programming—it is also about knowing a website's structure, its robots.txt file, and its terms of service in order to scrape both legally and ethically.


Ray-Ban is an iconic eye-wear brand that sells timeless styles of high-quality sunglasses and optical frames. Founded in 1936, the company first designed anti-glare glasses for U.S. Air Force pilots. Over the years, Ray-Ban has gained the status of a cultural phenomenon: wayfarers and aviators have become legendary, while classic models have gained legendary status through their sheer ubiquity. Presently, Ray-Ban is one of the international brands that offer a variety of sunglasses, eyeglasses, and prescription lenses for men and women, as well as children. Styles range from classic to contemporary, thereby satisfying consumers worldwide. Scrape all data from the websites of Ray-Ban for information on what they are offering, pricing strategies, and how they position themselves in the market.


The scraping process is divided into two main phases, each implemented in a separate Python script:

  • URL Collection: Ray-Ban's categories pages, for example sunglasses or eyeglasses are accessed by the first script and the product URLs are collected. It makes use of Playwright to automate the browser interaction. It handles pop-ups and scrolls through a page to make sure that it has fetched all the products. Collected URLs with their category and target gender are persisted in a SQLite database.

  • Product Data Extraction The second script retrieves all the stored URLs from the database and then visits each product page to extract more elaborate information. This script makes use of browser automation by Playwright and Beautiful Soup for parsing HTML content. The script extracts data such as product name, collection, color options, model code, frame description, pricing, and discounts. All the detailed product information is saved back into the SQLite database.


Tools and Technologies


This Ray-Ban web scraping project relies on three key technologies: Playwright, Beautiful Soup, and SQLite. Each one plays an integral role in the data collection and storage process.


Playwright is a powerful web automation library that allows the scraper to control web browsers programmatically. It's like an invisible hand opening up a web browser, typing in URLs, clicking on buttons, and scrolling through pages-all much faster than any human mind can do. In this project, Playwright is essential in crossing the Ray-Ban website. Handling dynamic content and interacting with elements such as popups or "Load More" buttons is also important for the project. It could even simulate different devices or user agents to avoid detection as a bot. Where the ability to wait for certain elements to load is really handy is in dealing with dynamic product listings like Ray-Ban's.


Once the Playwright has loaded the web page, Beautiful Soup goes into work. Beautiful Soup is a Python library specifically for parsing HTML content. If you think of a webpage as a tree-like structure, Beautiful Soup helps you climb this tree to find exactly the information you need. In the context of this Ray-Ban scraper, Beautiful Soup is used to extract specific data elements like product names, prices, or color options from the HTML of each product page. It's like having a smart assistant that can quickly read through a complex document and highlight all the important information for you.


The third key technology in this project is SQLite. SQLite is a lightweight, file-based database system. Think of it as a high-class filing cabinet built into your programme. In this Ray-Ban scraper, SQLite has two very important purposes. Firstly, SQLite saves the URLs of all products for which initial scanning information was found. That allows the scraper to remember which pages it needs to visit again to gather detailed information, even if the program is stopped and restarted. SQLite captures all the data extracted from each page in terms of the detailed information of the product. This builds up a structured collection of queryable data for any of the Ray-Ban products under consideration. What's beautiful about SQLite is that it's easy to implement and use, but quite powerful to manage the large amount of data created with web scraping.


These three tools together comprise the backbone of the Ray-Ban scraping project: Playwright takes care of interacting with the web; Beautiful Soup extracts the relevant data; and SQLite is robust storage for storing the data. These three combined will allow for efficient, structured, and reliable gathering of data from the Ray-Ban website, which can then be stored as a valuable dataset for further analysis or usage.


Data Cleaning and Refinement


Sometimes, after extracting raw data by web scraping, it must be cleaned and purified. It is necessary to ensure the reliability and consistency of the collected data to analyze it or use it in another form. Usually, a powerful tool like OpenRefine may be used for more manual working with messy data or pandas can be utilized in Python when working in more of a programmatic way.


The tasks at this stage encompass standardizing formats, missing values, removal of duplicates, and keeping track of the data obtained for consistency.


These data can be used to do meaningful market analysis, compare products, or integrate them into other systems.


By combining effective web scraping strategies with careful data cleaning, this would then result in an all-inclusive and accurate dataset of Ray-Ban products, meaning insight for the company into its eyewear offerings and pricing.


Urls Collection


This script in Python, then, would collect product information systematically from the website Ray-Ban, much like a digital shopping assistant that would visit various sections of Ray-Ban's online store. The script will use Playwright-the tool that supports and automates browser-to navigate through various product categories of sunglasses, eyeglasses, and smart glasses designed for the different demographics like men and women, and children, etc. It will behave like a normal visitor to the website by choosing random user agents and browsing naturally. For each category it navigates through, it will handle popup windows, scroll down the pages, click "Load More" buttons for revealing more products, and extract URLs of products. This information is then collated together and compiled in a SQLite database. Upon completion, it offers a comprehensive catalogue of Ray-Ban that can be easily accessed and analyzed later. The script is developed with sensitivity to website behavior, including error handling for common issues such as timeout, and even adding in delay between actions to avoid saturating the server.


Import Section

import sqlite3
import asyncio
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
from bs4 import BeautifulSoup
import random
import sys

The import section will bring in all the tools necessary for this web scraping project. SQLite3 is used for database operations to store the scraped data. Asyncio enables asynchronous programming that makes the scraping process more efficient. Playwright is the main web automation tool that controls the browser, while BeautifulSoup helps parse the HTML content. Random is used to add delays and select user agents, and sys takes care of system level operations such as checking for files and programmable exits.


load_user_agents(file_path='user_agents.txt')

def load_user_agents(file_path='user_agents.txt'):
   """
   Loads user agents from a text file.

   Args:
       file_path (str): Path to the text file containing user agents. Defaults to 'user_agents.txt'.

   Returns:
       list: A list of user agent strings.

   Raises:
       SystemExit: If the file is not found or is empty.
   """
   try:
       with open(file_path, 'r') as file:
           user_agents = [line.strip() for line in file if line.strip()]
       if not user_agents:
           print(f"Error: The file {file_path} is empty.")
           sys.exit(1)
       return user_agents
   except FileNotFoundError:
       print(f"Error: User agent file not found: {file_path}")
       sys.exit(1)

This function opens and reads a text file full of different browser user agents, which are text strings telling websites what kind of browser and system is trying to access it. Imagine it like having a collection of different disguises for your scraper. The function uses a simple file reading operation to get these user agents and puts them into a list that the scraper can use later.


If something is going wrong - for instance, if your file is nowhere to be found or is empty - the function will notify you and halt the script. If you weren't using user agents and didn't have this early failure point, websites could simply identify that you were running a scraper rather than a normal browser. The whole function is actually built with the intention of failing early, which is a better thing than to start running into issues during actual scraping work.


initialize_database(db_name='rayban_products.db')

def initialize_database(db_name='rayban_products.db'):
   """
   Initializes the SQLite database by creating a connection and setting up the table structure.
  
   The function connects to the SQLite database with the name provided in the `db_name` argument.
   If the database does not exist, it will be created. The function then creates a table named
   'products' with columns for storing product URLs, categories, and gender information.
   If the table already exists, it will not be recreated, ensuring that existing data is preserved.

   Args:
       db_name (str): The name of the SQLite database file. Defaults to 'rayban_products.db'.

   """
   conn = sqlite3.connect(db_name)
   cursor = conn.cursor()
  
   cursor.execute('''
       CREATE TABLE IF NOT EXISTS urls (
           url TEXT NOT NULL,
           category TEXT NOT NULL,
           gender TEXT NOT NULL
       )
   ''')
   conn.commit()
   conn.close()

Consider this function as an attempt to set up a filing cabinet for all that information you're going to collect regarding products. This function creates a SQLite database file (or simply opens it if it already exists) and ensures there's a table within it with the proper structure to store product URLs, categories, and gender information. The table is simply called 'urls' and functions very similarly to a spreadsheet with three columns: the product webpage address, what kind of product it is (sunglasses or eyeglasses), and who it's made for - men, women, kids, etc.


The nice thing about this function is that it uses "IF NOT EXISTS" within its SQL command, so if you run the script multiple times, it won't accidentally erase all your existing data. Kind of like making sure you don't have a filing cabinet already before building another one. Once everything is in place, it closes the database connection, kind of like locking up the filing cabinet when you are done.


save_to_database(url,category,gender,db_name='rayban_products.db')

def save_to_database(url, category, gender, db_name='rayban_products.db'):
   """
   Saves the product URL, category, and gender information to the SQLite database.

   The function establishes a connection to the SQLite database specified by the `db_name` argument. It then inserts a new record into the 'products' table, storing the provided `url`, `category`,and `gender`. The connection to the database is closed after the operation to ensure that resources are properly released.

   Args:
       url (str): The URL of the product to be saved.
       category (str): The category of the product (e.g., 'sunglasses', 'eyeglasses').
       gender (str): The target gender for the product (e.g., 'Men', 'Women').
       db_name (str): The name of the SQLite database file. Defaults to 'rayban_products.db'.

   """
   conn = sqlite3.connect(db_name)
   cursor = conn.cursor()
  
   cursor.execute('''
       INSERT INTO urls (url, category, gender)
       VALUES (?, ?, ?)
   ''', (url, category, gender))
  
   conn.commit()
   conn.close()

This function actually goes ahead and saves this information in your database. Every time the scraper encounters a product page, this function opens up the database connection to the filing cabinet, creates a new record with the product's URL, what category it belongs to, and who it's designed for, and then safely closes the connection again. Sort of like having a secretary who knows exactly how to file each piece of information into its right place.


The function is minimalist and task-oriented-it does one thing: save data, and it does that well. It uses the SQL INSERT command, appropriate changes being committed before closing it in itself, then closes the database to be sure everything has been cleaned up and in order.


close_popup(page)

async def close_popup(page):
   """
   Attempts to close any popup that appears when first accessing the site.
  
   Args:
       page: Playwright page object
   """
   try:
       # Wait a moment for any popups to appear
       await page.wait_for_timeout(2000)
      
       # Try multiple possible selectors for the close button
       close_button_selectors = [
           'button.close-button',           # Common class name for close buttons
           'button[aria-label="Close"]',    # Accessibility label
           '.modal-close',                  # Common modal close class
           '.popup-close',                  # Common popup close class
           'button.dismiss-button',         # Common dismiss button class
           '[data-testid="close-button"]',  # Test ID
           '//button[contains(@class, "close")]'  # XPath for buttons containing 'close' in class
       ]
      
       for selector in close_button_selectors:
           try:
               # Try to find and click the close button
               if selector.startswith('//'):
                   # Handle XPath selector
                   await page.wait_for_selector(selector, timeout=2000, state='visible')
                   await page.click(selector)
               else:
                   # Handle CSS selector
                   await page.wait_for_selector(selector, timeout=2000, state='visible')
                   await page.click(selector)
               print("Successfully closed popup")
               return
           except:
               continue

       # If no close button found, try pressing Escape key
       await page.keyboard.press('Escape')
       print("Attempted to close popup with Escape key")
      
   except Exception as e:
       print(f"Error handling popup: {e}")

This one is your popup-fighting hero. Websites, mainly e-commerce sites, will often show a popup asking you to subscribe to their newsletters or provide some discount for you. Scraper time can be seriously messed with, so this function is used to close these, trying the various ways to dismiss unwanted interruptions. It is like a security guard with all the variations of how politely an unwanted interruption can be dismissed.


The function waits a short amount of time to allow the pop-ups to occur, then attempts several different methods to locate and click on close buttons in order to find close, dismiss, or X buttons. If the function can't find any buttons to click, it then defaults to pressing the Escape key. It is a particularly clever function because it doesn't give up if one method fails, but rather keeps trying different approaches until it either succeeds or has tried all of the possible methods.


scrape_rayban_product_urls(category_url, category, gender)

async def scrape_rayban_product_urls(category_url, category, gender):
   """
   Asynchronously scrapes product URLs from the Ray-Ban website for a specific category and gender.

   This function performs several key operations:
   1. Launches a Chromium browser instance with a random user agent
   2. Navigates to the specified category URL
   3. Handles any popup dialogs that appear
   4. Scrolls through the page to load all products (handles lazy loading)
   5. Clicks "Load More" button if present to reveal additional products
   6. Extracts product URLs from the loaded page
   7. Saves the collected URLs to a SQLite database

   Args:
       category_url (str): The full URL of the Ray-Ban category page to scrape
           (e.g., 'https://www.ray-ban.com/usa/sunglasses/men-s')
       category (str): The product category identifier
           (e.g., 'sunglasses', 'eyeglasses', 'smart-glasses')
       gender (str): The target gender/age group for the products
           (e.g., 'Men', 'Women', 'Kids', 'Unisex')

   Raises:
       PlaywrightTimeoutError: If page loading or element selection timeouts occur
       Exception: For any other errors during the scraping process

   Notes:
       - Uses Playwright for browser automation
       - Implements scrolling to handle lazy-loaded content
       - Includes random delays between actions to mimic human behavior
       - Saves results directly to a SQLite database
       - Closes browser resources properly even if errors occur
   """
   base_url = "https://www.ray-ban.com"
   user_agents = load_user_agents()
   async with async_playwright() as p:
       browser = await p.chromium.launch(headless=False)
       user_agent = random.choice(user_agents)
       context = await browser.new_context(user_agent=user_agent)
       page = await context.new_page()

       try:
           # Navigate to the page
           await page.goto(category_url, wait_until="domcontentloaded", timeout=120000)
          
           # Try to close any popup that appears
           await close_popup(page)
          
           # Continue with the rest of the scraping process
           last_height = await page.evaluate('document.body.scrollHeight')
           while True:
               await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
               await page.wait_for_timeout(2000)
               new_height = await page.evaluate('document.body.scrollHeight')
               if new_height == last_height:
                   break
               last_height = new_height

           await page.wait_for_selector('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div > div.rb-load-more', timeout=60000)

           while True:
               try:
                   load_more_button = await page.query_selector('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div > div.rb-load-more > button')
                   if load_more_button:
                       await load_more_button.click()
                       await page.wait_for_timeout(random.randint(3000, 5000))
                   else:
                       break
               except PlaywrightTimeoutError:
                   print("Timeout error occurred while waiting for 'Load More Products' button.")
                   break

           html_content = await page.content()
           soup = BeautifulSoup(html_content, 'html.parser')
           product_elements = soup.select('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div > div.rb-products.grid > a')
           product_urls = [base_url + element['href'] for element in product_elements]
          
           for url in product_urls:
               save_to_database(url, category, gender)

           print(f"Scraped {len(product_urls)} products for {category} - {gender}")
      
       except PlaywrightTimeoutError:
           print(f"Page.goto timeout exceeded while trying to load {category_url}")
       except Exception as e:
           print(f"An error occurred: {e}")
       finally:
           await context.close()
           await browser.close()

This is the 'workhorse' part of the scraping operation; basically, a professional shopper who knows exactly how to navigate through Ray-Ban's website and gather product information. The function begins by launching a browser, using Playwright, with a randomly selected user agent, one of those disguises we discussed earlier. It navigates to the specified category page on Ray-Ban's website and then begins gathering product URLs.


At the same time, the function performs several complex tasks, among which attempting to close all pop-ups that appear, scrolling the full page so that all products are loaded in case some web pages use lazy loading that displays only the first products as one scrolls, clicks "Load More" buttons in case they exist, and extracts any product URLs found. It is also well designed to wait for the actual content loading and includes random delays between actions to make the browsing pattern look more human-like.


It includes error handling for common problems such as timeout errors and network problems that may go wrong. If something does go wrong, it ensures to properly close the browser and clean up after itself. All the URLs it collects get saved to the database using the save_to_database function we discussed earlier.


main()

async def main():
   """
   Main function to initialise the database and run the scraping tasks asynchronously.

   This function initialises the SQLite database by calling `initialize_database()`, ensuring that the necessary table structure is in place. It then sequentially runs the `scrape_rayban_product_urls()` function for various product categories and genders, scraping and storing product URLs from each category on the Ray-Ban website. The asynchronous nature of the function allows for efficient handling of multiple web scraping tasks.
   """
   initialize_database()

   # usage with different categories and genders
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/sunglasses/men-s', 'sunglasses', 'Men')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/sunglasses/women-s', 'sunglasses', 'Women')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/sunglasses/toddlers', 'sunglasses', 'Toddlers')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/sunglasses/little-kids', 'sunglasses', 'Little-Kids')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/sunglasses/kids', 'sunglasses', 'Kids')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/sunglasses/teenager', 'sunglasses', 'Teenagers')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/eyeglasses/men-s', 'eyeglasses', 'Men')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/eyeglasses/women-s', 'eyeglasses', 'Women')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/eyeglasses/toddlers', 'eyeglasses', 'Toddlers')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/eyeglasses/little-kids', 'eyeglasses', 'Little-Kids')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/eyeglasses/kids', 'eyeglasses', 'Kids')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/eyeglasses/teenager', 'eyeglasses', 'Teenagers')
   await scrape_rayban_product_urls('https://www.ray-ban.com/usa/ray-ban-meta-smart-glasses', 'smart-glasses', 'Unisex')
  
   print('Product URLs have been scraped and saved to the database.')

if __name__ == '__main__':
   asyncio.run(main())

The major function will be compared to an orchestra conductor; it harmonizes all the functions together and makes them work. It begins with ensuring that the database is ready. Then, it orchestrates a series of scraping operations on the different categories and genders of Ray-Ban products by calling the initialize_database function.


It repeats the scraping function to feed each combination of product category (sunglasses, eyeglasses, smart-glasses) and target demographic (men, women, kids, etc). As the function makes use of async/await patterns to carry out this operation efficiently, it's actually like having multiple workers collecting information simultaneously but in an organised way. When everything is done, it prints a message to let you know the scraping is complete.


Product Data Extraction


This is a Python code implementing an advanced web scraping system to provide detailed product information from the Ray-Ban website. Using asynchronous programming with Playwright for browser automation, BeautifulSoup for HTML parsing, and SQLite for data storage, this script takes off by first initializing a database to store URLs and scraped data and systematically processes each unscraped URL. It travels through each product page, noticing possible roadblocks such as country selection pop-up, scrolling to lazy-loaded content, and extracting name, collection, price, color and technical specifications of the products. Scrape data is saved to the database with the URL marked as processed. The system has error handling, logging, and randomised user agents to make it reliable and detectable. This comprehensive approach will result in efficient scalabe scraping of large quantities of product pages while maintaining structured database information.


Import Section

import sqlite3
import asyncio
import random
import os
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import logging
import json

This section imports all the required libraries for scraping the data of the product. It includes pandas for data manipulation, sqlite3 for database operations, asyncio for asynchronous programming, os for file operations, playwright for web automation, BeautifulSoup for HTML parsing, logging, and JSON handling. Such imports lay out the basis of a solid web scraping script to operate on so many aspects of collecting, processing, and storing data.


load_user_agents Function

def load_user_agents(file_path):
   """
   Loads a list of user agents from a specified file, filtering out empty lines.
   This function is used to provide rotating user agents for web scraping to avoid detection and potential IP blocks.

   Args:
       file_path (str): Path to the file containing user agents, one per line

   Returns:
       list: A list of cleaned user agent strings

   Raises:
       FileNotFoundError: If the specified file_path does not exist
   """
   if os.path.exists(file_path):
       with open(file_path, 'r') as f:
           # Create list comprehension to strip whitespace and filter empty lines
           return [line.strip() for line in f if line.strip()]
   else:
       raise FileNotFoundError(f"The file {file_path} does not exist.")

The `load_user_agents` function is another important part of web scraping; it's a function used to increase the likeliness of mimicking different web browsers and avoid detection. This function reads a list of user agent strings from a file, the contents of which can be rotated during the course of your scrape so that all requests appear as if they are coming from different browsers or devices.


It begins by checking if a specified file exists using `os.path.exists(file_path)`. If a file is found, it opens the file and reads its contents. A list comprehension is used to process each line of the file, stripping whitespace and filtering out any empty lines. This efficient approach ensures that only valid user agent strings are included in the final list.


The function raises a `FileNotFoundError` with a custom error message if the specified file is not found. This type of error handling is important because it alerts the user right away if there's an issue with the user agent file, preventing the scraper from running without this critical component. A separate file for user agents makes updating and maintaining the list of user agents easy without having to modify the main script.


parse_product_name Function

def parse_product_name(soup):
   """
   Extracts the product name from the BeautifulSoup object of a product page.
   Targets the main product title element using a specific CSS selector path.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       str: The product name if found, 'N/A' if the element is not present
   """
   # CSS selector for the product name heading
   product_name_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.sticky-sidebar > div > div > div.rb-product-information > div.rb-product-information__product-name-and-wishlist > h1.rb-product-name')
   return product_name_elem.get_text(strip=True) if product_name_elem else 'N/A'

`parse_product_name` will extract the name of the product from a product page's HTML content. It takes as an argument a BeautifulSoup object, an object that already represents the parsed HTML content of a product page. This is an important function in the scraping process for accurately identifying and then categorising the products.


The following function uses a particular CSS selector to find the element containing the name of the product. The selector in this case: `body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.sticky-sidebar > div > div > div.rb-product-information > div.rb-product-information__product-name-and-wishlist > h1.rb-product-name` is pretty specific, and this means that the HTML structure of the target website must be complex and the product name must be deeply nested within the DOM.


If the element is found, it uses the `get_text(strip=True)` function to extract the text content of the element, removing any leading or trailing whitespace. In the case that the element is not found-this could be if the structure of the page has changed, or if there is a mistake on the way the page loads-this function returns 'N/A.' This fallback ensures that the scraping process can continue even if some product names can't be extracted, hence maintaining data consistency.


parse_collection Function

def parse_collection(soup):
   """
   Extracts the collection name from the BeautifulSoup object of a product page.
   Collection name typically indicates the product line or series.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       str: The collection name if found, 'N/A' if the element is not present
   """
   # CSS selector for the collection name element
   collection_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.sticky-sidebar > div > div > div.rb-product-information > div.rb-product-information__status-message-label > div > div > p')
   return collection_elem.get_text(strip=True) if collection_elem else 'N/A'

Thus, the `parse_collection` function is designed using a similar pattern as that of the `parse_product_name` function but for collecting the name of a collection of a product. In most e-commerce scenarios, the name of a collection is a series or product line and would be of use in categorizing and analyzing the product.


It also employs a specific CSS selector to locate the element that contains the name of the collection. The selector `body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.sticky-sidebar > div > div > div.rb-product-information > div.rb-product-information__status-message-label > div > div > p` might indicate a common find inside a paragraph on a product page in a section that bears the name of the collection.


It returns the text content of the element in case it is found or 'N/A' otherwise in a similar way like the `parse_product_name` function does. In this way, all the parsing functions have a uniform approach that helps maintain uniform data structure even in cases where some information is missing from some product pages.


parse_colors_number Function

def parse_colors_number(soup):
   """
   Extracts the number of available colors from the BeautifulSoup object.
   This information is typically displayed in the color selection section.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       str: The number of colors available if found, 'N/A' if the element is not present
   """
   # CSS selector for the colors count element
   colors_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.sticky-sidebar > div > div > div.rb-right-shoulder__info > div.rb-colours > div.rb-colours__title')
   return colors_elem.get_text(strip=True) if colors_elem else 'N/A'

The `parse_colors_number` function aims to extract the number of different colours available for a given product. This might prove to be very useful for inventory analysis or in studies on product variety, or as a feature in any tool that helps to compare products.


The function addresses an element of the page, typically holding colour count information. The CSS selector applied(`'body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.sticky-sidebar > div > div > div.rb-right-shoulder__info > div.rb-colours > div.rb-colours__title'`) suggests that normally, such information is held in a title element of the color selection part of the product page.


Like all other parsing functions, it will return a 'N/A' if the element was not found; otherwise, it returns the text content of the element. This is something whose output might then need further processing to provide only the numeric value if that's how the website presents the color count (e.g., "5 colours available" might need to be parsed to just "5").


parse_model_code Function

def parse_model_code(soup):
   """
   Extracts the model code from the BeautifulSoup object of a product page.
   Model code is a unique identifier for the product variant.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       str: The model code if found, 'N/A' if the element is not present
   """
   # CSS selector for the model code element
   code_elem = soup.select_one('#-answer >div > div.rb-product-details__model-code')
   return code_elem.get_text(strip=True) if code_elem else 'N/A'

The function `parse_model_code` will retrieve the unique identifier for a specific model code of a product variant. Model codes are of great importance in the e-commerce business and inventory management since this helps in the unique identification of different product variants. This particularly applies to situations where multiple colors, sizes, or configurations of the same product are available.


This function uses a different method of selecting the target element than previous ones used. Instead of the long nested CSS selector, it uses an ID selector (`'#-answer >div > div.rb-product-details__model-code'`). That would imply that the model code is probably found in a more regularly structured part of the page, possibly in a product details or specifications section.


The function adheres to the standard pattern returning the text content if the element exists, and 'N/A' otherwise. The model code data extracted by this function can be valuable for cross-referencing products between different systems or databases and tracking specific product variants in inventory or sales analyses.


parse_frame_description Function

def parse_frame_description(soup):
   """
   Extracts the detailed frame description including various features and specifications.
   Creates a dictionary mapping feature labels to their corresponding values.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       dict: Dictionary containing feature labels as keys and their values as values
   """
   frame_description = {}
   # Select all feature elements
   features = soup.select('.rb-product-detail__features > div')
   for feature in features:
       # Extract label and value for each feature
       label = feature.select_one('.rb-product-detail__label').get_text(strip=True)
       value = feature.select_one('.rb-product-detail__feature > span').get_text(strip=True)
       frame_description[label] = value
   return frame_description

The `parse_frame_description` function is somewhat more complex than the previous parsing functions. It is supposed to collect a very detailed description of the product's frame with other features and specifications. This creates a well-formed representation of the particular features of the product, which can be extremely valuable for really detailed product comparisons, or for filling up really comprehensive databases with the information about products.


Unlike all other functions that return a single string value, this function returns a dictionary. This function starts by initializing an empty dictionary called `frame_description`. It then chooses all elements with the class `.rb-product-detail__features > div`, representing probably a list of feature elements on the product page.


For each feature element, the function extracts the two pieces of information- the label (which is the dictionary key) and the value (which is the dictionary value). It uses particular selectors to locate these in each element: the label with `.rb-product-detail__label` and the value with `.rb-product-detail__feature > span`. This works well for fetching multiple features along with their details in an organized way.


The resulting dictionary gives a full impression of the product's specifications, which are easily processed or stored in a database to then be analyzed further. This level of detail is indispensable for products for which technical specifications play a crucial role for consumers, as with eyewear or other special products.


parse_price_details Function

def parse_price_details(soup):
   """
   Extracts pricing information including MRP, sale price, and discount details.
   Handles both regular and sale pricing scenarios.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       tuple: Contains (mrp, sale_price, discount)
           - mrp (str): Original/Maximum Retail Price
           - sale_price (str): Discounted price if available
           - discount (str): Discount percentage if available
   """
   # First try to find regular price
   price_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.rb-sticky-bar > div.rb-sticky-bar-left > div.rb-sticky-bar-left__title > div > span.rb-prices__normal')
   if price_elem:
       # If regular price found, no discount scenario
       mrp = price_elem.get_text(strip=True)
       return mrp, mrp, 0
   else:
       # Handle sale price scenario
       price_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.rb-sticky-bar > div.rb-sticky-bar-left > div.rb-sticky-bar-left__title > div > span.rb-prices__list')
       sale_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.rb-sticky-bar > div.rb-sticky-bar-left > div.rb-sticky-bar-left__title > div > span.rb-prices__discounted')
       discount_elem = soup.select_one('body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-pdp__sidebar > div.rb-sticky-bar > div.rb-sticky-bar-left > div.rb-sticky-bar-left__title > div > span.rb-promo-badge.rb-promo-badge--small')
      
       # Extract values with fallbacks
       mrp = price_elem.get_text(strip=True) if price_elem else '0'
       sale_price = sale_elem.get_text(strip=True) if sale_elem else '0'
       discount = discount_elem.get_text(strip=True) if discount_elem else '0'
       return mrp, sale_price, discount

The `parse_price_details` function should handle the sometimes tricky business of gathering pricing information from product pages. It can be applied equally to simple cases where prices are straightforward and extracted, as well as to more complex cases where products are on sale, making it relevant in a wide range of pricing schemes found in e-commerce.


It first tries to find a regular price element. When it does find one, it assumes that no discount has been applied and returns the same value for both MRP and sale price, with a discount of 0. This covers products that are not on sale.


If no regular price is found, it then looks for sale pricing elements. It searches for three separate pieces of information: original price (MRP), discounted price, and discount percentage. Each of these are extracted by using the proper CSS selectors targeting different elements on the page.


The function uses fallback mechanism, just in case some of the following elements are not found; consequently it returns '0' by default. Thus, the function will always return a consistent tuple of (mrp, sale_price, discount), possibly with some information missing from the page. It makes the function robust against variations of page structure or missing price information.


parse_color_options Function

def parse_color_options(soup):
   """
   Extracts all available color options for the product.
   Processes the color variant buttons to get their alt text descriptions.

   Args:
       soup (BeautifulSoup): Parsed HTML content of the product page

   Returns:
       str: Comma-separated string of available color options, or 'N/A' if none found
   """
   # Find the container div for color options
   target_div = soup.find('div', class_='rb-colours-list')
   if target_div:
       # Find all color variant buttons and extract their image alt texts
       buttons = target_div.find_all('button', class_='rb-colour-variant')
       alt_texts = [img['alt'] for button in buttons for img in button.find_all('img') if 'alt' in img.attrs]
       return ', '.join(alt_texts)
   return 'N/A'

The `parse_color_options` function parses out any information on all the different color options available for a product. It comes in especially handy for products that are available in more than one color, as it gathers all the options on the color.


The function starts with the search for a div element by class `rb-colours-list` - probably containing all colour option elements. If this container exists, it proceeds to look inside it for all button elements with the class `rb-colour-variant`. These buttons typically represent individual colour options.


The function retrieves the `alt` text of any img elements contained within the button for each button it finds. This typically contains a name or description of the color. It is a nice trick, using accessibility features (alt text) to get meaningful color information that might be more likely to be in use and more descriptive than some effort to parse colour codes or names from other attributes.


Lastly, the function concatenates all the colour names it has extracted into a single string separated by commas. When no colour options are found, in the sense that either the container div was missing or there were no buttons of color variant type, the function will return 'N/A'. This way, even if colour information is unavailable, the function will always return a string and the data structure will always be consistent.


 scrape_product_details Function

async def scrape_product_details(page, url, category, gender):
   """
   Asynchronously scrapes detailed product information from a Ray-Ban product page using Playwright.
   Handles various page interactions including country selection popups, scrolling, and dynamic content loading.
  
   Args:
       page (Page): Playwright page object for browser interaction
       url (str): The product URL to scrape
       category (str): Product category (e.g., 'sunglasses', 'eyeglasses')
       gender (str): Target gender for the product
      
   Returns:
       dict: A dictionary containing scraped product details including:
           - url: Product URL
           - category: Product category
           - gender: Target gender
           - name: Product name
           - collection: Collection name
           - number_of_colors: Available color count
           - model_code: Product model code
           - frame_description: Dictionary of frame details
           - mrp: Maximum Retail Price
           - sale_price: Current sale price
           - discount: Discount percentage
           - colors: Available color options
   """
   try:
       # URL encode spaces to ensure proper formatting
       encoded_url = url.replace(" ", "%20")
       # Navigate to the page with extended timeout for slow connections
       response = await page.goto(encoded_url, timeout=60000)
       logging.info(f"Scraping {url} - Response status: {response.status}")

       # Check for successful page load
       if response.status != 200:
           logging.error(f"Failed to load page, status code: {response.status}")
          
       # Handle country selection popup that appears on first visit
       try:
           # Complex selector for the country selection button (first country option)
           await page.locator('#rb-header-app > div.modal-wrapper.modal-wrapper--header.modal-wrapper--display > div.modal-content-wrapper > div > div.rb-modal-content > div > div > span > div.rb-country-overlay-modal__flag-container > a:nth-child(1)').click()
           logging.info("Country selection popup handled successfully.")
           # Wait for popup animation to complete
           await asyncio.sleep(2)
       except Exception as e:
           logging.warning(f"Country selection popup could not be handled: {e}")

       # Scroll the page to trigger lazy loading of content
       for _ in range(10):
           await page.mouse.wheel(0, 1000)  # Scroll down 1000 pixels
           await asyncio.sleep(1)  # Wait for content to load

       # Get initial page content and parse with BeautifulSoup
       content = await page.content()
       soup = BeautifulSoup(content, 'html.parser')

       # Selectors for product details accordion button
       button_selector = "body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-left-shoulder > div.rb-accordions > div.rb-accordion.rb-accordion--with-custom-icon.rb-accordions__product-details > button"
       open_button_selector = "body > div.rb-app__main.static-header.loaded.rb-app__header--static > div.rb-pdp-page > div > div.rb-pdp__scrollable-area > div.rb-left-shoulder > div.rb-accordions > div.rb-accordion.rb-accordion--with-custom-icon.rb-accordion--is-open.rb-accordions__product-details > button"

       # Check if product details section is already open
       is_open = await page.query_selector(open_button_selector)
       if not is_open:
           # Click to open product details section
           await page.click(button_selector)
           logging.info("Product details section opened.")
           # Wait for content to load after clicking
           await page.wait_for_timeout(2000)

       # Get updated page content after opening details section
       content2 = await page.content()
       soup2 = BeautifulSoup(content2, 'html.parser')

   except Exception as e:
       logging.error(f"Could not extract product details from {url}: {e}")
      
   # Return all scraped data in a structured dictionary
   return {
       'url': url,
       'category': category,
       'gender': gender,
       'name': parse_product_name(soup),
       'collection': parse_collection(soup),
       'number_of_colors': parse_colors_number(soup),
       'model_code': parse_model_code(soup2),
       'frame_description': parse_frame_description(soup2),
       'mrp': parse_price_details(soup)[0],
       'sale_price': parse_price_details(soup)[1],
       'discount': parse_price_details(soup)[2],
       'colors':parse_color_options(soup)
   }

The `scrape_product_details` function is a wrapper for the web scraping operation. It is an asynchronous function that utilizes Playwright to step through an individual product page to extricate detailed information. It is a "one-stop shop" that encapsulates a whole interaction with a web page and retrieval of data points on a product.


The function first encodes the URL to handle spaces, navigates to the page with an extended timeout for slow connections, logs the response status. This is important for monitoring the flow of scraping as the status code needs to not be 200.


A good characteristic of this function is that it deals with dynamic page elements. It tries to handle a country selection popup that may appear on the first visit to the site. This really shows prudence in dealing with the typical, real-world challenges of web scraping. The function implements scrolling to also have an effect of triggering lazy loading of content, which would ensure all the necessary information is loaded before the extraction process begins.


The function then applies the parsing functions defined above (`parse_product_name`, `parse_collection`, etc.), extracting different pieces of information from the page. In so doing, it also handles the opening of a product details section if it's not already open-an example of how to interact with the page in order to obtain hidden information. Finally, it compiles all the extracted data into a structured dictionary, providing one with a comprehensive set of information about the product.


init_database Function

def init_database():
   """
   Initializes SQLite database and creates necessary tables for storing product data.
   Creates two main tables: 'urls' for tracking URLs to scrape and 'data' for storing product information.
   Also handles database schema updates by adding new columns if needed.
  
   Returns:
       sqlite3.Connection: Database connection object
      
   Tables Created:
       urls:
           - id: Primary key
           - url: Product URL
           - category: Product category
           - gender: Target gender
           - scraped: Flag indicating if URL has been processed
          
       data:
           - All product details fields corresponding to scrape_product_details output
   """
   # Create connection to SQLite database
   conn = sqlite3.connect('rayban_products.db')
   cursor = conn.cursor()

   # Create URLs table if it doesn't exist
   cursor.execute('''
       CREATE TABLE IF NOT EXISTS urls (
           id INTEGER PRIMARY KEY AUTOINCREMENT,
           url TEXT,
           category TEXT,
           gender TEXT,
           scraped INTEGER DEFAULT 0
       )
   ''')

   # Check if scraped column exists, add if missing
   cursor.execute("PRAGMA table_info(urls)")
   columns = [column[1] for column in cursor.fetchall()]
   if 'scraped' not in columns:
       cursor.execute("ALTER TABLE urls ADD COLUMN scraped INTEGER DEFAULT 0")

   # Create data table for storing product information
   cursor.execute('''
       CREATE TABLE IF NOT EXISTS data (
           url TEXT,
           category TEXT,
           gender TEXT,
           name TEXT,
           collection TEXT,
           number_of_colors TEXT,
           model_code TEXT,
           frame_description TEXT,
           mrp INTEGER,
           sale_price INTEGER,
           discount INTEGER,
           colors TEXT
       )
   ''')

   conn.commit()
   return conn

The `init_database` function is responsible for initializing the SQLite database that will hold all the scraped data. This function also shows good practices in the management of databases and design of schemas for web scraping projects.


The function creates two main tables: 'urls' and 'data'. The 'urls' table is designed to track URLs that need to be scraped, including metadata like category and gender, and a flag to indicate whether the URL has been processed. The 'data' table is structured to store all the detailed product information that will be scraped.


An interesting feature of this function is that it allows the database schema to be updated. It checks whether the 'scraped' column exists in the 'urls' table and adds it if it is missing. This forward-thinking approach allows for easy updates to the database structure without needing to recreate the entire database.


It returns a connection object, which is quite a good practice since it allows the calling code to also manage the database connection lifecycle.


load_urls_from_db Function

def load_urls_from_db(conn):
   """
   Retrieves all unscraped URLs from the database along with their associated metadata.
  
   Args:
       conn (sqlite3.Connection): Database connection object
      
   Returns:
       list: List of dictionaries containing unscraped URLs with their metadata:
           - url: Product URL
           - category: Product category
           - gender: Target gender
          
   Note:
       Only returns URLs where scraped=0 in the database
   """
   cursor = conn.cursor()
   # Select only unscraped URLs
   cursor.execute("SELECT url, category, gender FROM urls WHERE scraped = 0")
   # Convert results to list of dictionaries for easier handling
   return [{'url': row[0], 'category': row[1], 'gender': row[2]} for row in cursor.fetchall()]

The `load_urls_from_db` function is responsible for fetching unscraped URLs from the database. This function is fundamental to the scraping workflow by providing the list of the URLs that need to be processed.


The function performs a SQL query to fetch the records of URLs for which 'scraped' status is 0, or in other words, URL records that have not yet been processed. Then, through a list of dictionaries, convert the query results. Hence, it becomes easier to iterate over the list created by the main scraping function and access the metadata associated with the scraped URLs.


This function allows the process to be paused and resumed effectively as it will always start with the URLs that haven't been processed yet by retrieving only the unscraped URLs.


update_url_status Function

def update_url_status(conn, url):
   """
   Updates the scraped status of a URL in the database to mark it as processed.
  
   Args:
       conn (sqlite3.Connection): Database connection object
       url (str): The URL to mark as scraped
      
   Note:
       Sets scraped=1 for the specified URL in the urls table
   """
   cursor = conn.cursor()
   # Mark URL as scraped
   cursor.execute("UPDATE urls SET scraped = 1 WHERE url = ?", (url,))
   conn.commit()

The `update_url_status` is one of the simplest workflows yet most important in scraping that marks a URL in the database as scraped after the processing.


It takes two parameters: database connection and URL. What it does is perform the SQL UPDATE statement that simply sets the 'scraped' flag of given URL to 1. The function immediately commits its change, ensuring that it's always keeping the database consistent.


This function can update the status of each URL after the scraping so as not to scrape over and over again and easily track progress in long-running scraping operations.


save_to_db Function

def save_to_db(conn, product_data):
   """
   Saves scraped product information to the database.
   Handles conversion of complex data types (like dictionaries) to JSON for storage.
  
   Args:
       conn (sqlite3.Connection): Database connection object
       product_data (dict): Dictionary containing all scraped product information
           Must contain all fields corresponding to the data table schema
          
   Note:
       Converts frame_description dictionary to JSON string for storage
       Commits the transaction immediately after insertion
   """
   cursor = conn.cursor()
   # Convert frame description dictionary to JSON string
   frame_description_json = json.dumps(product_data['frame_description'])
  
   # Insert product data into database
   cursor.execute('''
       INSERT INTO data (url, category, gender, name, collection, number_of_colors,
                        model_code, frame_description, mrp, sale_price, discount, colors)
       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
   ''', (product_data['url'], product_data['category'], product_data['gender'],product_data['name'], product_data['collection'], product_data['number_of_colors'],product_data['model_code'], frame_description_json, product_data['mrp'],product_data['sale_price'], product_data['discount'], product_data['colors']))
   conn.commit()

The `save_to_db` function saves product data scrapped to the database. This function is used for saving complex data types in a SQLite database.


One interesting feature of this function is how it handles the 'frame_description' field. Since this field is a dictionary, which SQLite cannot store directly, this dictionary is converted to a JSON string before being inserted. This makes it easy to store structured data in an easily retrievable and parsable format.


It makes use of parameterised SQL queries for the inserting process and hence avoids the use of SQL injection attacks. Then it commits the transaction, just after the insertion happens. So, in result, every product data has saved instantly after processing.


main Function

async def main():
   """
   Main execution function that orchestrates the entire scraping process.
   Manages database connections, browser instances, and the scraping workflow.
  
   Process Flow:
   1. Initializes database connection
   2. Loads unscraped URLs
   3. For each URL:
       - Creates new browser instance with random user agent
       - Attempts to scrape product details
       - Saves data to database
       - Updates URL status
       - Closes browser instance
   4. Closes database connection
  
   Note:
       Uses Playwright in non-headless mode for browser automation
       Implements error handling and logging for each step
       Runs asynchronously for better performance
   """
   # Initialize database and load unscraped URLs
   conn = init_database()
   urls_to_scrape = load_urls_from_db(conn)
   user_agents = load_user_agents('user_agents.txt')

   # Start Playwright context
   async with async_playwright() as playwright:
       for url_data in urls_to_scrape:
           url = url_data['url']
           category = url_data['category']
           gender = url_data['gender']
           print(f"Scraping: {url}")
          
           # Launch new browser instance for each URL
           browser = await playwright.chromium.launch(headless=False)
           # Create new context with random user agent
           context = await browser.new_context(user_agent=random.choice(user_agents))
           page = await context.new_page()

           try:
               # Scrape product details and save to database
               product_data = await scrape_product_details(page, url, category, gender)
               save_to_db(conn, product_data)
               update_url_status(conn, url)
               logging.info(f"Successfully saved data for {url}")
           except Exception as e:
               logging.error(f"Error scraping {url}: {e}")
           finally:
               # Clean up browser resources
               await page.close()
               await context.close()
               await browser.close()

   # Close database connection
   conn.close()

# Entry point of the script
if __name__ == "__main__":
   asyncio.run(main())

The `main` function is the orchestrator of the entire scraping process. It is an asynchronous function that ties all other components of the scraper together.

This function begins initializing the database and loads the list of URLs to scrape while loading the list of user agents that are used for randomization of browser's identity for any request.


The main characteristic of this function is that it utilizes Playwright for automating the interactions in the browser. For every URL, it starts a new browser instance with a random user agent. This results in a new context for each scraping operation, thus helping avoid detection and potential IP blocks.


The function employs a try-except block for each URL; this way, it will allow scraping to continue even when a failure is caused by one URL. It records successes and errors, which is useful information for monitoring the scraping operation. After processing each URL, it cleans up resources by the browser; this is crucial for memory management with long-running scraping.


Lastly, the function closes the database connection to manage resources properly. At the entry point of the script, calling `asyncio.run(main())` allows asynchronous execution of the whole process of scraping, which could improve performance, particularly in network operations and concurrent scraping.


Conclusion


Web scraping has revolutionized how we gather and analyze data, making it possible to automate product discovery and gain insights efficiently. This project demonstrated how web scraping can be used to extract detailed product information from Ray-Ban’s eyewear collections, streamlining the research process that would otherwise be tedious and time-consuming. By leveraging Playwright and Beautiful Soup, we navigated the complexities of dynamic web content, ensuring accurate and structured data collection.


While web scraping is a powerful tool, it is essential to adhere to ethical and legal considerations, respecting website policies and terms of service. Moving forward, the extracted data can be used for price tracking, trend analysis, or even building a recommendation system. This project not only highlights the potential of automation in e-commerce research but also opens doors to further innovations in data-driven decision-making.


Connect with Datahut for top-notch web scraping services that bring you the valuable insights you need hassle-free.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page