top of page

How to Scrape Oakley's Eye-wear Data?

  • Writer: Shahana farvin
    Shahana farvin
  • 13 hours ago
  • 24 min read
How to Scrape Oakley's Eye-wear Data?

Think of having a super-fast assistant who can explore websites and pick out only the things you need. That’s what web scraping does - it saves you from the boring job of copying and pasting information one by one. It’s a handy skill, especially when you want to collect a lot of data quickly and without stress.


In this blog, we’ll walk through a fun and useful project where we collect data about glasses from Oakley’s official website. You’ll see how we set everything up, how we go step by step to collect the data, and how we clean it to make it ready for use. By the end, you’ll understand how a real web scraping project works from start to finish.


Now, Oakley’s website isn’t your typical web page where everything shows up right away. Instead, the website loads a lot of its content on the go, which means it uses background processes (called JavaScript, though we won’t dive deep into that here). Because of this, scraping Oakley’s website is a bit more challenging than usual.


Also, the website doesn’t really like bots or automated tools grabbing its data, so it has a few roadblocks to stop that from happening. But don’t worry - we’ll show you how to work around these challenges in a smart way using the right tools.


We’ll break the whole task into two clear steps:


  1. First, we’ll use a tool called Playwright to go through Oakley’s main pages and collect links to each product. Think of it like making a list of all the glasses available on the site.


  2. Next, we’ll visit each of those links. For every single product page, we’ll again use Playwright to open the page. Then, we’ll use another tool called Beautiful Soup to pull out all the important details—like the name, price, and other information about the glasses.


In the next sections, we’ll break down each part of the process - Scrape Oakley's Eye-wear. You’ll see the code, how it works, and some tips to handle common issues.


Scrape Oakley's Eye-wear Products Urls


The Python script we’ve built is designed to collect product details from Oakley’s website. These include sunglasses, prescription sunglasses, and eyeglasses. It carefully goes through each category, grabs the links for all the products, and saves those links into a small local database called SQLite (just think of it as a digital notebook where data is saved neatly).


Since Oakley’s website loads some content in the background and doesn’t show everything all at once, we’ve used a few smart tricks to handle that. The script behaves in a way that’s similar to how a real person would browse the site—slow and steady—so it doesn’t raise any red flags. It also has backup steps in case something goes wrong, helping it retry without crashing.


To keep things running quickly and smoothly, the script uses a method that allows it to do many small tasks at the same time. It also uses a browser tool that lets it "click around" and "scroll" on web pages, just like you would.


The script follows a clear plan—it visits one category at a time, gathers the data, and moves on to the next. It takes short breaks between actions and changes the way it looks to the website (by switching user agents, which basically means it pretends to be a different browser or device) to avoid getting blocked. All of this helps make the scraping process smoother and safer.


Imports


import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import random
import time
import sqlite3

Before our scraper can start collecting data, we need to bring in a few important tools—just like preparing everything before starting a task.


Here’s what each tool does:

  • asyncio helps the script run more smoothly by allowing it to handle many small tasks at once, instead of waiting for one to finish before starting another.

  • playwright is what lets our script open and control a web browser. It can click on things, scroll through pages, and even wait for content to load—just like a person would.

  • BeautifulSoup helps us read the web page’s content and find the exact details we want, like product names or prices.

  • random and time are used to make our scraper act more naturally. We use them to take small breaks and mix up its actions so it doesn’t look like a robot.

  • sqlite3 gives us a way to store the data we collect in a simple, organized format. It keeps everything saved and easy to use later.


When we put all these tools together, our scraper can explore websites, gather the data we need, and save it safely—without standing out or getting blocked.


User Agent Handling Functions

def load_user_agents(file_path):
   """
   Load user agents from a file.

   This function reads a text file containing user agent strings, one per line.
   It filters out any empty lines and returns a list of valid user agent strings.
   Using multiple user agents helps in mimicking different browsers and potentially
   avoiding detection as a bot, which is crucial for web scraping tasks.

   Args:
       file_path (str): Path to the file containing user agents.

   Returns:
       list: A list of user agent strings.
   """
   with open(file_path, 'r') as file:
       user_agents = [line.strip() for line in file.readlines() if line.strip()]
   return user_agents

Websites can sometimes tell when a tool or script—like our scraper—is visiting instead of a real person. One way they do this is by checking something called a user agent. This is a small piece of information your browser shares when it visits a site. It tells the website what kind of device and browser you're using.


To avoid being noticed, our scraper changes its user agent regularly. That’s where the load_user_agents function comes in. It opens a file that contains many different user agent strings—these are just text versions of different browser identities. The function reads them all and keeps them ready.

def get_random_user_agent(user_agents):
   """
   Select a random user agent from the provided list.

   This function is used to randomise the user agent for each request. By using
   different user agents, the script can simulate requests coming from various
   browsers and devices. This randomization helps in distributing the requests
   and potentially reducing the chances of being blocked by the website due to
   suspicious activity patterns.

   Args:
       user_agents (list): A list of user agent strings.

   Returns:
       str: A randomly selected user agent string.
   """
   return random.choice(user_agents)

The get_random_user_agent function is the one that actually picks a user agent from the list we loaded earlier. Every time our scraper visits the website, it chooses one of those user agents at random.


This means that each visit can look slightly different—sometimes like it's coming from a phone, other times from a different browser or device. This simple trick helps our scraper stay under the radar and avoid being blocked.


By using all these functions together, our scraper doesn't stand out. It behaves more like regular web traffic, making it harder for websites to tell that it's an automated tool collecting data.


Main Scraping Function

async def scrape_product_urls(category_url, category_name, user_agents, retries=3):
   """
   Scrape product URLs from a given category page.

   This function is the core of the scraping process. It uses Playwright to navigate
   to the category URL and interact with the page. The function simulates scrolling
   and clicking the 'Load More' button to ensure all products are loaded. It then
   uses BeautifulSoup to parse the HTML and extract product URLs. The function
   includes error handling and retry logic to deal with potential network issues
   or anti-scraping measures. It's designed to be resilient and can retry the
   scraping process multiple times if errors occur.

   Args:
       category_url (str): The URL of the category page to scrape.
       category_name (str): The name of the category being scraped.
       user_agents (list): A list of user agent strings to use for requests.
       retries (int, optional): Number of retry attempts in case of failure. Defaults to 3.

   Returns:
       list: A list of dictionaries containing product URLs and their categories.
   """
   product_urls = []
  
   for attempt in range(retries):
       try:
           async with async_playwright() as p:
               browser = await p.chromium.launch(headless=False, args=["--disable-http2"])
               user_agent = get_random_user_agent(user_agents)
               context = await browser.new_context(user_agent=user_agent)
               page = await context.new_page()
              
               print(f"Navigating to {category_url} with User-Agent: {user_agent}")
               await page.goto(category_url, timeout=120000)
              
               while True:
                   await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
                   await page.wait_for_timeout(5000)
                  
                   load_more_button = await page.query_selector(
                       '#skipToMainContent > div.replaceProducts > div > div > div.lazy-load-pagination > div > a'
                   )
                   if load_more_button:
                       is_visible = await load_more_button.is_visible()
                       if is_visible:
                           await load_more_button.click()
                           print("Clicked 'Load More' button.")
                           await page.wait_for_timeout(10000)
                       else:
                           print("'Load More' button is not visible.")
                           break
                   else:
                       print("No 'Load More' button found.")
                       break
              
               html = await page.content()
               soup = BeautifulSoup(html, 'html.parser')
              
               base_url = "https://www.oakley.com/"
              
               footer_divs = soup.find_all("div", class_="prod-tile_footer")
               product_links = [
                   {"url": base_url + div['data-href'], "category": category_name}
                   for div in footer_divs if 'data-href' in div.attrs
               ]

               await browser.close()
              
           return product_links
      
       except Exception as e:
           print(f"Error on {category_url}: {e}")
           if "ERR_HTTP2_PROTOCOL_ERROR" in str(e):
               print("HTTP/2 Protocol Error: Retrying with random delay...")
               time.sleep(random.uniform(5, 15))
           else:
               print(f"Retrying ({attempt+1}/{retries})...")
               time.sleep(5)
  
   return product_urls

This part of the scraper is where most of the action happens. Here's what it does:

First, it picks a random user agent to stay hidden and then visits a product category page on the website. Once there, it scrolls down the page—just like a real person looking through all the items.


While scrolling, it watches for a "Load More" button. If it sees one, it clicks it to load more products. It keeps doing this until the page shows everything and there’s nothing left to load.


Once the page is fully loaded, the scraper grabs all the content and uses BeautifulSoup to look through it and collect the links for each product.


If something goes wrong—maybe the page didn’t load properly or there was an error—it doesn’t quit right away. The function will try again, up to three times, to make sure it gets the job done.


Category Scraping Coordinator

async def scrape_all_categories_sequentially(categories, user_agents):
   """
   Scrape product URLs from all specified categories sequentially.

   This function acts as a coordinator for the scraping process. It iterates through
   each category in the provided list and calls the scrape_product_urls function
   for each one. The sequential approach ensures that categories are scraped one
   at a time, which can help in managing resources and avoiding overwhelming the
   target website with simultaneous requests. This function aggregates the results
   from all categories into a single list, providing a comprehensive collection
   of all scraped product URLs across all specified categories.

   Args:
       categories (list): A list of dictionaries containing category URLs and names.
       user_agents (list): A list of user agent strings to use for requests.

   Returns:
       list: A list of dictionaries containing all scraped product URLs and their categories.
   """
   all_products = []
  
   for category in categories:
       print(f"Scraping category: {category['name']}")
       category_products = await scrape_product_urls(category['url'], category['name'], user_agents)
       all_products.extend(category_products)
  
   return all_products

This function helps the scraper visit different sections of the website—one at a time. Think of it like following a plan that lists all the product categories we want to check, such as sunglasses, prescription glasses, and more.


It goes through each category in order. For every category, it calls the main function that does the scraping, waits for it to finish, and then moves on to the next one. By doing this step-by-step, we make sure we don’t overload the website or appear suspicious.


Once all categories have been visited, it collects every product link found from each one and combines them into one big list. This gives us a full list of all the products we’re interested in—across all sections of the site


Database Functions


These functions are all about organizing and saving the information we collect.

def init_database():
   """
   Initialise the SQLite database and create the products table if it doesn't exist.

   This function sets up the SQLite database for storing the scraped product data.
   It creates a new database file if it doesn't exist, or connects to an existing one.
   The function also ensures that the necessary table structure is in place by
   executing a CREATE TABLE IF NOT EXISTS query. This approach allows the script
   to be run multiple times without duplicating the table structure. The table
   is designed with an auto-incrementing primary key, a category field, and a
   unique URL field to prevent duplicate entries.

   Returns:
       sqlite3.Connection: A connection object to the SQLite database.
   """
   conn = sqlite3.connect('oakley_products.db')
   cursor = conn.cursor()
   cursor.execute('''
       CREATE TABLE IF NOT EXISTS products (
           id INTEGER PRIMARY KEY AUTOINCREMENT,
           category TEXT,
           url TEXT UNIQUE
       )
   ''')
   conn.commit()
   return conn

The init_database function gets everything ready for storing data. First, it checks if the database (where we’ll store everything) already exists. If it doesn’t, the function creates one.


Then it checks if the right kind of storage space—called a table—is available in the database. If it’s not there, the function creates it. This makes sure we have the right setup before we begin saving product information.

def save_to_database(data, conn):
   """
   Save the scraped product data to the SQLite database.

   This function is responsible for persisting the scraped data into the SQLite
   database. It iterates through the list of scraped products and attempts to
   insert each one into the database. The function uses parameterized queries
   to prevent SQL injection vulnerabilities. It also handles potential
   IntegrityError exceptions, which could occur if a duplicate URL is encountered
   (due to the UNIQUE constraint on the url field). This approach ensures that
   the database remains consistent and free of duplicates, even if the scraping
   the process is run multiple times or encounters duplicate products.

   Args:
       data (list): A list of dictionaries containing product URLs and their categories.
       conn (sqlite3.Connection): A connection object to the SQLite database.
   """
   cursor = conn.cursor()
   for product in data:
       try:
           cursor.execute('''
               INSERT INTO products (category, url) VALUES (?, ?)
           ''', (product['category'], product['url']))
       except sqlite3.IntegrityError:
           print(f"Duplicate URL found: {product['url']}")
   conn.commit()
   print("Data saved successfully to the database.")

The save_to_database function handles the saving part. It takes all the product details we’ve collected and stores them neatly in the database.


It’s smart enough to check if a product is already saved. If it finds that the same item is already there, it skips it—so we don’t end up saving the same thing twice.


Together with the setup functions, this makes sure everything our scraper finds is stored safely, without any confusion or repetition. This way, we can easily come back and use the data whenever we need it.


Main Function

async def main():
   """
   Main function to orchestrate the scraping process and database operations.

   This function serves as the entry point and coordinator for the entire scraping
   operation. It begins by loading the list of user agents, which will be used
   throughout the scraping process to vary the requests. It then defines the
   categories to be scraped, each with a specific URL and name. The function
   initiates the scraping process by calling scrape_all_categories_sequentially,
   which handles the actual data collection. Once the scraping is complete, the
   function initializes the database connection and saves the collected data.
   Finally, it ensures proper closure of the database connection. This structured
   approach allows for clear separation of concerns between data collection,
   storage, and overall process management.
   """
   user_agents = load_user_agents('useragents.txt')
  
   categories = [
       {"url": "https://www.oakley.com/en-us/category/sunglasses", "name": "Sunglasses"},
       {"url": "https://www.oakley.com/en-us/category/prescription/sunglasses", "name": "Prescription Sunglasses"},
       {"url": "https://www.oakley.com/en-us/category/prescription/eyeglasses", "name": "Prescription Eyeglasses"}
   ]
  
   all_products = await scrape_all_categories_sequentially(categories, user_agents)
  
   conn = init_database()
   save_to_database(all_products, conn)
   conn.close()

The main function is the part that brings everything together. It doesn’t do the scraping itself, but it tells all the other parts when to start and what to do.


First, it loads the list of user agents—these are used to help our scraper move around the website without being noticed. Then it sets up the list of categories we want to scrape.


Once everything is ready, it starts the scraping process by calling the function that goes through each category. While the scraping is happening, the main function just waits and watches.


After all the data is collected, the main function takes over again. It tells the program to save the data into the database. Finally, it makes sure everything is closed properly, like turning off the lights after the job is done.


Script Execution

if __name__ == "__main__":
   asyncio.run(main())

Now we come to the last and simplest part—but also one of the most important. This is the "on switch" that kicks off everything we've built so far.


The line if name == "__main__": is just a smart check. It asks, "Are we running this script directly, or is it being used somewhere else?" If the answer is yes—meaning we’ve run the script directly—it moves ahead and starts everything.


What comes next is asyncio.run(main()). This is like hitting the play button for the whole operation. It tells Python, "Let’s start our main function and keep things moving efficiently."


Since our scraper works on multiple tasks at once (like visiting pages, scrolling, loading content, and saving data), asyncio.run() acts like a smart stage manager. It keeps all the moving parts in sync and ensures everything happens in the right order—without wasting time.


So with this one simple line, we turn on our entire scraping system. From disguising itself to gathering product info to saving everything neatly in a database—this is where it all begins.


Scraping Products Data


Now that we've gathered all the product URLs from Oakley's sunglasses, prescription sunglasses, and eyeglasses sections, it's time to dive deeper. This next script is responsible for visiting each collected URL and extracting full product details such as name, price, description, features, and more.


Import Section

import asyncio
import sqlite3
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

As we've explained before, we start by importing all the essential tools for our web scraper: asyncio, sqlite3, playwright, and BeautifulSoup.


Database Setup Function

def setup_database(db_name):
   """
   Set up the SQLite database for storing product information.

   This function performs the following tasks:
   1. Connects to the specified SQLite database.
   2. Checks if the 'scraped' column exists in the 'products' table and adds it if not present.
   3. Creates a new 'scraped_products' table if it doesn't already exist.

   Args:
       db_name (str): The name of the SQLite database file.

   Returns:
       None
   """
   conn = sqlite3.connect(db_name)
   cursor = conn.cursor()

   # Check if 'scraped' column exists in products table
   cursor.execute("PRAGMA table_info(products)")
   columns = [column[1] for column in cursor.fetchall()]
  
   # Alter products table to add scraped column if it doesn't exist
   if 'scraped' not in columns:
       cursor.execute('''
       ALTER TABLE products ADD COLUMN scraped INTEGER DEFAULT 0
       ''')

   # Create scraped_products table
   cursor.execute('''
   CREATE TABLE IF NOT EXISTS scraped_products (
       id INTEGER PRIMARY KEY,
       url TEXT,
       collection TEXT,
       title TEXT,
       sale_price TEXT,
       original_price TEXT,
       discount TEXT,
       number_of_colors TEXT,
       size TEXT,
       fit TEXT,
       bridge TEXT,
       light_transmission TEXT,
       light_conditions TEXT,
       information_notice TEXT,
       product_code TEXT,
       size_ TEXT,
       description TEXT,
       content TEXT
   )
   ''')

   conn.commit()
   conn.close()

The setup_database function is the foundation for storing all the product data we scrape. Before any scraping begins, this function prepares our database to make sure everything is in place.


One smart thing it does is check whether the 'scraped' column already exists in the database. If it doesn’t, the function adds it. This makes the setup flexible and backward-compatible—meaning, if you’ve used the database before, it won’t break or lose data when the structure changes. Instead, it smoothly updates the schema while keeping existing data safe.


The most important part of this function is creating the scraped_products table. This table acts like a blueprint, organizing how we store product information. It defines specific fields (or columns) for each attribute we plan to collect—such as name, URL, category, and so on. This structured setup allows us to scrape a wide range of product details and store them neatly in one place.


In the end, this careful setup makes it easy to search, analyze, or even visualize the data once the scraping is done.


Retrieve Unscraped URLs Function

def get_unscraped_urls(db_name):
   """
   Retrieve URLs of products that have not been scrapped yet.

   Args:
       db_name (str): The name of the SQLite database file.

   Returns:
       list: A list of tuples containing (id, url) for unscraped products.
   """
   conn = sqlite3.connect(db_name)
   cursor = conn.cursor()
   cursor.execute("SELECT id, url FROM products WHERE scraped = 0")
   urls = cursor.fetchall()
   conn.close()
   return urls

The get_unscraped_urls function plays a key role in making our scraping process efficient. Its job is to find out which products still need to be scraped. It does this by checking the database for any products where the 'scraped' flag is set to 0, meaning they haven’t been processed yet.


This helps the scraper focus only on new or pending work, instead of going over the same products again and again. It saves time and system resources by skipping what’s already done.


The function returns a list of tuples, with each tuple containing two important things: the product's ID and its URL. The URL tells the scraper where to go, and the ID helps us update the database later to mark the product as "scraped" once we're done.


This setup is simple but very useful. It gives the scraper a clear checklist of what’s left to do, while also making it easy to keep track of progress and avoid duplication.


Update Scraped Status Function

def update_scraped_status(db_name, product_id):
   """
   Update the 'scraped' status of a product in the database.

   Args:
       db_name (str): The name of the SQLite database file.
       product_id (int): The ID of the product to update.

   Returns:
       None
   """
   conn = sqlite3.connect(db_name)
   cursor = conn.cursor()
   cursor.execute("UPDATE products SET scraped = 1 WHERE id = ?", (product_id,))
   conn.commit()
   conn.close()

The update_scraped_status function acts like an accountant for our scraper. After a product has been successfully scraped, this function updates the database to mark that product as "done."


It does this by setting the 'scraped' flag to 1 for that specific product ID. This small action plays a big role in keeping the scraping process clean and organized—ensuring we don’t scrape the same product more than once.


This update helps the scraper remember what’s already been covered, which is especially useful if the process is paused and resumed later. Even after multiple runs or interruptions, the scraper can pick up exactly where it left off without repeating any work.


In short, it’s a simple but essential function for building a reliable and efficient scraping system that can scale to handle lots of data without getting confused or duplicating effort.


Save Scraped Product Function

def save_scraped_product(db_name, product_data):
   """
   Save the scraped product details to the 'scraped_products' table.

   Args:
       db_name (str): The name of the SQLite database file.
       product_data (dict): A dictionary containing the scraped product information.

   Returns:
       None
   """
   conn = sqlite3.connect(db_name)
   cursor = conn.cursor()
   cursor.execute('''
   INSERT INTO scraped_products (
       url, collection, title, sale_price, original_price, discount,
       number_of_colors, size, fit, bridge, light_transmission,
       light_conditions, information_notice, product_code, size_,
       description, content
   )
   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
   ''', (
       product_data['url'], product_data['collection'], product_data['title'],
       product_data['sale_price'], product_data['original_price'], product_data['discount'],
       product_data['number_of_colors'], product_data['size'], product_data['fit'],
       product_data['bridge'], product_data['light_transmission'], product_data['light_conditions'],
       product_data['information_notice'], product_data['product_code'], product_data['size_'],
       product_data['description'], product_data['content']
   ))
   conn.commit()
   conn.close()

The save_scraped_product function is like the librarian of our scraper. Once the scraper has gathered all the details about a product, this function makes sure everything is neatly stored in the database.


It uses something called a parameterized SQL query—which is a safe way to insert data. This prevents security issues like SQL injection and also makes sure each piece of data is stored in the right format.


The function handles a wide range of product details—from basic information like the product’s URL and title, to very specific measurements like bridge width or light transmission. This shows how thorough our scraper is: it collects not just surface-level info, but deep product specs too.


By storing all of this in a well-structured way, the function allows us to analyze, filter, or even visualize the data later with precision. It's an essential step that turns raw scraped data into organized, usable information.


parse_collection Function

def parse_collection(soup):
   """
   Parse the collection name from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product pa+++++++++++++ge.
  
   Returns:
   str: The collection name if found, 'N/A' otherwise.
   """
   collection_tag = soup.select_one("#pdhero > div.wrapper > form > div > div.pdp-area.pdp-sidebar.oo-w-100 > div > div.items-backBadge.slider-gallery-items__backBadge > div > p")
   if collection_tag:
       return collection_tag.get_text(strip=True)
   return 'N/A'

The parse_collection function is like a label reader for our scraper. Its job is to find the name of the product’s collection—basically, the group or series the product belongs to.


It does this by looking for a specific part of the product page using a CSS selector. If it finds that part, it grabs the text, trims any extra spaces from the beginning or end, and returns it.


If the scraper doesn’t find the collection info on the page, it doesn’t panic—it just returns "N/A" to let us know that the collection name wasn’t available.


This makes the scraper flexible and robust, even when some product pages are missing expected details.

Next, we use the same method for product tittle, sale price, original price, discount, number of colors, size, the bridge, fit, light transmissions, light conditions, notice of information, product code, description and content. Take a look at the functions:


parse_product_title Function

def parse_product_title(soup):
   """
   Parse the product title from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The product title if found, 'N/A' otherwise.
   """
   title_tag = soup.select_one('#pdhero > div.wrapper > form > div > div.pdp-sidebar-wrapper__mobile > div > h1 > span')
   if title_tag:
       return title_tag.get_text(strip=True)
   return "N/A"

parse_sale_price Function

def parse_sale_price(soup):
   """
   Parse the sale price from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The sale price if found, 'N/A' otherwise.
   """
   sale_price_tag = soup.select_one('[data-test="sale-price"]')
   if sale_price_tag:
       return sale_price_tag.get_text(strip=True)
   return "N/A"

parse_original_price Function

def parse_original_price(soup):
   """
   Parse the original price from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The original price if found, the sale price if original price is not found.
   """
   original_price_tag = soup.select_one('[data-test="original-price"]')
   if original_price_tag and original_price_tag.get_text(strip=True):
       return original_price_tag.get_text(strip=True)
   return parse_sale_price(soup)

parse_discount Function

def parse_discount(soup):
   """
   Parse the discount percentage from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The discount percentage if found, '0' otherwise.
   """
   discount_tag = soup.select_one('[data-test="percentage-off"]')
   if discount_tag and discount_tag.get_text(strip=True):
       return discount_tag.get_text(strip=True)
   return "0".

parse_number_of_colors Function

def parse_number_of_colors(soup):
   """
   Parse the number of available colours from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The number of colours if found, 'N/A' otherwise.
   """
   colors_tag = soup.select_one("#pdhero > div.wrapper > form > div > div.pdp-area.pdp-sidebar.oo-w-100 > div > div.pdp-filter-by-technology.pdp-filter-by-technology__abtest.nosize > div.oo-w-100.genericActionBox.o21_bg.thumbnails > div > div > div > h2 > span.oo-text.bold.o21_text-color2.colorLabel")
   if colors_tag:
       return colors_tag.get_text(strip=True)
   return "N/A"

parse_size Function

def parse_size(soup):
   """
   Parse the size information from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The size information if found, 'N/A' otherwise.
   """
   size_tag = soup.select_one("#pdhero > div.wrapper > form > div > div.pdp-area.pdp-sidebar.oo-w-100 > div > div.product-cart-wrapper > div.sizeSelectorWrapper__mobile > div > div > div:nth-child(1) > span")
   if size_tag:
       return size_tag.get_text(strip=True)
   return "N/A"

parse_fit Function

def parse_fit(soup):
   """
   Parse the fit information from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The fit information if found, 'N/A' otherwise.
   """
   fit_tag = soup.select_one("#pdhero > div.wrapper > form > div > div.pdp-area.pdp-sidebar.oo-w-100 > div > div.product-cart-wrapper > div.sizeSelectorWrapper__mobile > div > div > h2 > span.o21_text-normal")
   if fit_tag:
       return fit_tag.get_text(strip=True)
   return "N/A"

parse_bridge Function

def parse_bridge(soup):
   """
   Parse the bridge size information from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The bridge size if found, 'N/A' otherwise.
   """
   bridge_tag = soup.select_one("#size-guide-panel > div.size-selection-box.form-field.select-size.hidden-md-down > div.fit-description > span:nth-child(2)")
   if bridge_tag:
       return bridge_tag.get_text(strip=True)
   return 'N/A'

parse_light_transmission Function

def parse_light_transmission(soup):
   """
   Parse the light transmission percentage from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The light transmission percentage if found, 'N/A' otherwise.
   """
   transmission_tag = soup.select_one("#pdhero > div.pdpBottom > div.pdpBottom-featandtechAccordion > div > div > div.lensDetailsAccordion.oo-flex > div.lensDetailsAccordion-section > div > ul > li:nth-child(1) > span.o21_text-bold", {'data-field': 'lightTransmission percentage'})
   if transmission_tag:
       return transmission_tag.get_text(strip=True)
   return 'N/A'

parse_light_conditions Function

def parse_light_conditions(soup):
   """
   Parse the recommended light conditions from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The recommended light conditions if found, 'N/A' otherwise.
   """
   conditions_tag = soup.select_one("#pdhero > div.pdpBottom > div.pdpBottom-featandtechAccordion > div > div > div.lensDetailsAccordion.oo-flex > div.lensDetailsAccordion-section > div > ul > li:nth-child(2) > span.o21_text-bold", {'data-field': 'lightTransmission lightingCondition'})
   if conditions_tag:
       return conditions_tag.get_text(strip=True)
   return "N/A"

parse_information_notice Function

def parse_information_notice(soup):
   """
   Parse the information notice from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The information notice if found, 'N/A' otherwise.
   """
   notice_tag = soup.select_one("#pdhero > div.pdpBottom > div.pdpBottom-featandtechAccordion > div > div > div.lensDetailsAccordion.oo-flex > div.lensDetailsAccordion-section > div > ul > li:nth-child(4) > span.o21_text-bold")
   if notice_tag:
       return notice_tag.get_text(strip=True)
   return "N/A"

parse_product_code Function

def parse_product_code(soup):
   """
   Parse the product code from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The product code if found, 'N/A' otherwise.
   """
   code_tag = soup.select_one("#pdhero > div.pdpBottom > div.pdpBottom-productInfoAccordion > div > div > div > h2.pdpBottom-productInfo-code.o21_text-color2.o21_text8.o21_text-medium.text-uppercase > span.o21_text-color1")
   if code_tag:
       return code_tag.get_text(strip=True)
   return "N/A"

parse_description Function

def parse_description(soup):
   """
   Parse the product description from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The product description if found, 'N/A' otherwise.
   """
   description_tag = soup.select_one("#pdhero > div.pdpBottom > div.pdpBottom-productInfoAccordion > div > div > div > div.pdpBottom-productInfo-description.o21_text-color2.o21_text7.o21_text-medium")
   if description_tag:
       return description_tag.get_text(strip=True)
   return "N/A"

parse_content Function

def parse_content(soup):
   """
   Parse additional content from the product page.
  
   Args:
   soup (BeautifulSoup): The BeautifulSoup object of the product page.
  
   Returns:
   str: The additional content if found, 'N/A' otherwise.
   """
   content_tag = soup.select_one("#pdhero > div.pdpBottom > div.pdpBottom-productInfoAccordion > div > div > div > div.accordionBox.oo-w-100.pdpAccordion.pdpBottom-productInfo-readmore.o21_bg > div > div")
   if content_tag:
       return content_tag.get_text(strip=True)
   return "N/A"

scrape_product_details Function

async def scrape_product_details(page, url):
   """
   Scrape product details from a given URL using Playwright and BeautifulSoup.

   This function navigates to the product page, waits for the content to load,
   and then extracts various product details using BeautifulSoup.

   Args:
       page (Page): A Playwright page object.
       url (str): The URL of the product page to scrape.

   Returns:
       dict: A dictionary containing the scraped product details.
   """
   await page.goto(url, timeout=60000)
   await page.wait_for_timeout(2000)  # Wait for 2 seconds to ensure page content is loaded

   content = await page.content()
   soup = BeautifulSoup(content, 'html.parser')

   return {
       'url': url,
       'collection': parse_collection(soup),
       'title': parse_product_title(soup),
       'sale_price': parse_sale_price(soup),
       'original_price': parse_original_price(soup),
       'discount': parse_discount(soup),
       'number_of_colors': parse_number_of_colors(soup),
       'size': parse_size(soup),
       'fit': parse_fit(soup),
       'bridge': parse_bridge(soup),
       'light_transmission': parse_light_transmission(soup),
       'light_conditions': parse_light_conditions(soup),
       'information_notice': parse_information_notice(soup),
       'product_code': parse_product_code(soup),
       'size_': parse_size_(soup),
       'description': parse_description(soup),
       'content': parse_content(soup)
   }

The scrape_product_details function is the core engine of the scraping process—it’s where the real work of collecting product information happens.


This function is asynchronous, which means it can handle many tasks efficiently without waiting for one to finish before starting the next. It uses Playwright to control a web browser and visit a specific product page.


Once it receives a browser page object and the product URL, it:

  1. Navigates to the product page using page.goto(), with a generous timeout in case the page takes a while to load.

  2. Waits for 2 seconds to give JavaScript content enough time to fully appear on the screen.

  3. Extracts the page’s HTML content and passes it to BeautifulSoup, which helps break down the HTML into something Python can easily work with.


Then, the function uses a set of parsing helper functions like parse_title, parse_collection, and others. Each one is in charge of pulling out a specific piece of data—like the product’s name, category, price, or size.


All this data is then packed into a dictionary, which makes it easy to store and access each product's details later on.


This structure keeps the scraping logic modular and clean: if one part of the webpage changes, you only need to update the relevant parsing function—not the entire scraping process.


main Function

async def main():
   """
   Main entry point for the web scraping process.

   This function performs the following tasks:
   1. Set up the database.
   2. Retrieves unscraped URLs from the database.
   3. Launches a Playwright browser instance.
   4. Iterates through unscraped URLs, scraping product details for each.
   5. Saves scraped data to the database and updates the scraped status.
   6. Handles any errors that occur during scraping.
   7. Closes the browser instance after scraping is complete.

   Returns:
       None
   """
   db_name = "oakley_products.db"

   # Set up the database
   setup_database(db_name)

   # Get unscraped URLs from the database
   urls_to_scrape = get_unscraped_urls(db_name)

   async with async_playwright() as playwright:
       browser = await playwright.chromium.launch(headless=False)
       page = await browser.new_page()

       for product_id, url in urls_to_scrape:
           try:
               # Scrape product details
               product_details = await scrape_product_details(page, url)

               # Save scraped product details
               save_scraped_product(db_name, product_details)

               # Update scraped status
               update_scraped_status(db_name, product_id)

               print(f"Successfully scraped and saved product {product_id}")
           except Exception as e:
               print(f"Error scraping product {product_id} ({url}): {e}")

       await browser.close()

The main function acts as the conductor of the entire scraping process—it brings everything together and makes sure each part of the scraper performs its role correctly.


Here’s what it does step-by-step:

  1. Sets up the database using a helper function. It ensures that everything is ready to store the scraped data.

  2. Gets a list of product URLs that haven’t been scraped yet. This smart step helps the scraper resume from where it left off—no wasting time on already-processed products.


Next, it moves into the core scraping loop:

  1. It launches a Playwright browser using a context manager (async with). This ensures that the browser and system resources are cleaned up properly when scraping is done.

  2. For each unscraped product URL:

    • It calls the scrape_product_details() function to extract information.

    • Then it saves that data to the database.

    • And finally, it marks the product as “scraped” to avoid reprocessing it later.


This entire process is wrapped in a try-except block, meaning if one product fails to scrape (due to a timeout or broken page), it will log the error and move on without stopping the whole program. That’s important when you're dealing with a large list of products—you want the script to be robust and resilient.


Finally, once everything is done, the function closes the browser and finishes the process.


Script Execution

# Entry point for running the script
if __name__ == "__main__":
   asyncio.run(main())

The final section of the script is conditional code to check if the script is being run as a main program which is the same as we did in scraping urls part above which is the entry point the code.


Conclusion


Web scraping Oakley eyewear products highlights how automation can greatly simplify the extraction of structured information from online stores. By using Playwright, we were able to navigate dynamic web pages—ensuring that all product content was fully loaded—before passing the HTML to Beautiful Soup for data extraction. This automated approach eliminates the need for manual data collection and allows for efficient gathering of product names, prices, and specifications.


Such a technique proves highly valuable for tasks like price comparison, market research, and inventory tracking, where access to large volumes of product data in a usable format is essential.


However, web scraping also comes with important responsibilities. Since each website has its own structure and terms of service, it’s critical to respect robots.txt directives and adhere to ethical scraping practices. This includes setting appropriate delays between requests, handling retries gracefully, and avoiding excessive strain on servers.


This project demonstrates how Playwright and Beautiful Soup can be effectively combined to automate product data collection. It enables more accessible and scalable data-driven research, especially in the dynamic and competitive world of e-commerce.


AUTHOR


I’m Shahana, a Data Engineer at Datahut. I focus on designing intelligent data pipelines that transform complex, dynamic web data into clean, structured insights—empowering brands to make smarter decisions in fashion, eyewear, and e-commerce.


At Datahut, we’ve spent over a decade helping companies leverage automation to streamline product tracking, competitor monitoring, and market research. In this blog, I guide you through how we used Playwright and Beautiful Soup to scrape Oakley eyewear products—extracting specifications, pricing, and availability from dynamic web pages with precision.


If your team is looking to automate product data collection in the eyewear space or beyond, reach out to us through the chat widget on the right. We’d love to help you build a solution that fits your goals.


Frequently Asked Questions (FAQs)


1. Why did this project use both Playwright and Beautiful Soup instead of just one tool?

Playwright and Beautiful Soup serve different purposes. Playwright is used to render JavaScript-heavy pages and interact with dynamic content, such as scrolling and clicking "Load More" buttons. Once the page is fully loaded, Beautiful Soup efficiently parses the HTML and extracts structured data like product names, prices, descriptions, and specifications. Combining both tools provides a reliable and scalable scraping workflow.


2. Why can't Oakley's website be scraped using simple HTTP requests?

Oakley's website relies heavily on JavaScript to load product listings and details dynamically. A simple HTTP request only retrieves the initial HTML and often misses content generated after the page loads. Browser automation tools like Playwright can execute JavaScript, making them ideal for scraping modern e-commerce websites.


3. What product information can be extracted from Oakley's eyewear pages?

The scraper can collect a wide range of product details, including:

  • Product name and collection

  • Sale and original prices

  • Discount percentage

  • Available colors

  • Size and fit information

  • Bridge width

  • Lens light transmission details

  • Product descriptions

  • Product codes

  • Additional technical specifications

This data can be stored in a database for analysis, monitoring, or reporting.


4. How does the scraper avoid getting blocked while collecting data?

The scraper uses several techniques to mimic normal user behavior, including:

  • Rotating user agents

  • Adding delays between actions

  • Browsing categories sequentially

  • Handling retries when requests fail

  • Using a real browser through Playwright

These practices help reduce the chances of triggering anti-bot systems while maintaining responsible scraping behavior.


5. What are the practical uses of scraping Oakley eyewear data?

Scraping eyewear product data can support various business and research applications, such as:

  • Competitor price monitoring

  • Product catalog analysis

  • Market research and trend tracking

  • Inventory and assortment analysis

  • E-commerce intelligence dashboards

  • AI and data science projects requiring structured product datasets

By automating data collection, businesses can gain insights much faster than through manual research.

 
 

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page