top of page
  • Aparna Jayan V

How to Scrape Product Information from Costco using Python


How to scrape product information from Costco using Python

In today's data-driven world, where the ability to collect and analyze information is crucial for success, web scraping has become an indispensable tool for businesses and individuals. It enables the extraction of valuable data from websites, providing valuable insights, a competitive edge, and informed decision-making capabilities.


In this blog post, we'll explore how to utilize Python for web scraping and extracting product information from Costco's website. Our focus will be on the "Electronics" category, with specific emphasis on the "Audio/Video" subcategory. Our goal is to extract key features such as product name, brand, color, item ID, category, connection type, price, model, and description for each electronic device.


From this category of products, the followings features are extracted:

  • Product URL: URL of each electronic device

  • Product Name: Name of the electronic device as it appears on the website.

  • Brand: Brand name of the electronic device

  • Color: The color of the electronic device

  • Item Id: Unique identifier assigned to a specific electronic device

  • Category: Category type to which the product belongs from the 4 subcategories under Audio/Video

  • Connection Type: Method by which the device connects to other devices or systems

  • Price: cost of the device

  • Model: Specific version or variant of a device

  • Description: Overall functionality and key features of the device


Getting started with scraping Costco product data


Before we dive into the code, we'll need to install a few libraries and dependencies. We'll be using Python for our scraping, along with two popular libraries for web scraping: Beautiful Soup and Selenium. Beautiful Soup allows us to parse HTML and XML documents, while Selenium automates web browsers for testing and scraping purposes.


Once we have our libraries installed, we'll inspect the website structure to identify the elements we need to extract. This will involve examining the HTML code for the website and identifying the specific tags and attributes that contain the data we're interested in.

With this information in hand, we'll begin writing our Python code to scrape the website.


We'll use Beautiful Soup to extract the data and Selenium to automate the browser actions needed to scrape the website. Once we have our script written, we'll run it and save the data to a CSV file for easy analysis.


Installing Required Packages:


# Importing necessary libraries

import pandas as pd
from lxml import etree as et
from bs4 import BeautifulSoup
from selenium import webdriver
pd.options.mode.chained_assignment = None
from selenium.webdriver.common.by import By

  • Pandas is a library for data manipulation and analysis. It is used to store and manipulate the data scraped from the website.We have used ‘pandas’ here to convert the data from a dictionary format to a DataFrame format, which is more suitable for data manipulation and analysis, and then to save the DataFrame in CSV format, to make it easy to open and use in other software.

  • lxml is a library for processing XML and HTML documents. It is used to parse the HTML or XML content of the webpage.Here we are using ‘lxml’ with ‘et’ to navigate and search the HTML document's tree-like structure where ‘et’ stands for Element Tree, which is a module in the ‘lxml’ library that provides a simple and efficient way to work with XML and HTML documents.

  • BeautifulSoup is a library that makes it easy to scrape information from web pages. It allows you to parse the HTML or XML content of the webpage and extract the data you're interested in.The BeautifulSoup library is being used here to parse the HTML content obtained from the web page.

  • Selenium is a library that allows you to automate web browsers. It is used to automate the process of navigating to a webpage and interacting with it, like clicking buttons and filling out forms.

  • Webdriver is a package used by Selenium to interact with web browsers. It allows you to control the browser and execute JavaScript commands.The Selenium library with webdriver module is used here to automate the interaction with a web page, by creating an instance of a web driver and navigating to a specific URL, it allows to get the source code of the web page, which can then be parsed and analyzed.


driver = webdriver.Firefox()

One of the most important things you'll need to do when using Selenium is to create an instance of a web driver. A web driver is a class that interacts with a specific web browser, such as Chrome, Firefox, or Edge. In this code snippet, we're creating an instance of the Chrome web driver by using webdriver.Chrome(). This line of code allows us to control the Chrome browser and interact with web pages just like a user would.


With the web driver, we can navigate different pages, interact with the page's elements, fill out forms and click buttons, and extract the necessary information. With this powerful tool, we can automate tasks and gather data in a much more efficient way. With the power of Selenium and web drivers, you can unlock the full potential of web scraping and automate your data collection process like a pro!



Understanding the web scraping functions


Now that we have a basic understanding of web scraping and the tools we'll be using, it's time to dive into the code. Now we'll be taking a closer look at the functions that we've defined for the web scraping process. Defining functions allows code organization, reusability, and maintainability, making it easier to understand, debug and update the codebase.


We'll explain the purpose of each function defined and how it contributes to the overall process.


Function to extract content:



# Function to extract content from page

def extract_content(url):
    driver.get(url)
    page_content = driver.page_source
    soup = BeautifulSoup(page_content, 'html.parser')
    return soup
    

A function called extract_content is created, which takes a single argument, URL, and uses Selenium to navigate to that URL, retrieve the page source, parse it into a BeautifulSoup object using lxml parser, which is then passed to et.HTML() and converted to an Element Tree object. We can use the returned dom object to navigate and search the HTML document's tree-like structure and extract the information we need from the page.



Function to click on a URL:



This function uses the find_element() method with By.XPATH to locate the “'Audio/Video” category link from the costco electronics website and click() method to navigate to that page.This function allows us to navigate to the specific link on the website by clicking on it and then extract the contents of that page.



# Function to click the electronic cateory 'Audio/Video' and extract content from the page

def click_url(driver):
    driver.find_element
                (By.XPATH, '//*[@id="navpills-sizing/a[3]').click()
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    return 
    

Function to extract category links:


How to scrape product information from Costco using Python

Upon navigating to the Audio/Video category, this function extracts the links of the 4 subcategories displayed, allowing for further scraping on those specific pages. The xpath() method of the dom object is used to find all elements that match the specified xpath expression. Here, the xpath allows to select all the “href” attributes of the “a” elements that are descendants of elements with the class "categoryclist_v2".



# Function to get the urls of sub categories under Audio/Video

def category_links(soup):
    category_link = []
    for div in soup.find_all
           ('div', attrs={"class": "col-xs-12 col-lg-6 col-xl-3"}):
        for links in div.find_all('a'):
            category_link.append(links['href'])
    category_link = category_link[:4]
    return category_link
    

Function to extract product links:


With the 4 subcategory links obtained, we will now proceed to scrape all the links of the products present under these categories.



# Function to extract urls of products and adding it to the dataframe

def product_links(soup):
    product_urls = []
    for links in category_links(soup):
        content=extract_content(links)
        for product_section in content.find_all
                         ('div', {'automation-id': 'productList'}):
            for product_link in product_section.find_all('a'):
                product_urls.append(product_link['href'])
    product_urls = list(set(product_urls))
    valid_urls = 
             [url for url in product_urls if url.endswith('.html')]
    data['product_url'] = valid_urls
    return 

This function makes use of the category_links() and extract_content() functions that were previously defined, to navigate to each subcategory page and extract the links of all the products present under each subcategory.The function uses the xpath() method of the content object to select all the product links by the given xpath expression which selects all the “href” attributes of the “a” elements that are descendants of elements with the automation-id "productList" and whose “href” attribute ends with ".html".


Function to extract product name:


With the links of all the products obtained, we will now proceed to scrape the necessary features of each product. The function uses a try-except block to handle any errors that may occur while extracting the features.



# Function to extract product name

def get_product_name(soup):
    try:
        name = soup.find('h1',{'automation-id':'productName'}).text
        data['product_name'].iloc[product] = name
    except:
        name = "Product name is not available"
        data['product_name'].iloc[product] = name
    return name

Inside the try block, the function uses the xpath() method of the dom object to select the text of the element that has the class "product-title". If the product name is not available the function assigns the value "Product name is not available" to the 'product_name' column in the dataframe “data” at the position of the current product.


Function to extract brand of the product:



# Function to extract brand of the product

def get_brand(soup):
    try:
        product_brand = soup.find('div',{'itemprop':'brand'}).text
        data['brand'].iloc[product] = product_brand
    except:
        product_brand = "Brand is not available"
        data['brand'].iloc[product] = product_brand
    return 

The function uses the xpath() method of the dom object to select the text of the element that has the itemprop "brand." If the brand name is not available the function assigns the value "Brand is not available" to the column “brand”.


Function to extract the price of the product:



# Function to extract price of the product

def get_price(soup):
    try:
        product_price = soup.find
               ('span',{'automation-id':'productPriceOutput'}).text
        data['price'].iloc[product] = product_price
        data[['price']] = data[['price']].astype(str)
        data['price'] = data['price'].apply
                                          (lambda x: x.strip("-."))
        if data['price'].iloc[product] == '':
            product_price = "Price is not available"
            data['price'].iloc[product] = product_price
    except:
        pass
    return product_price  
        

The function uses the xpath() method of the dom object to select the text of the element that has the automation-id "productPriceOutput". If the price is not available the function assigns the value "Price is not available" to the column “price”.


Function to extract item Id of the product:



# Function to extract item id of the product

def get_item_id(soup):
    try:
        product_id = soup.find('input',{'name':'addedItem'})['value']   
        data['item_id'].iloc[product] = product_id
    except:
        product_id = "Item Id is not available"
        data['item_id'].iloc[product] = product_id
    return product_id

This function uses the xpath() method of the dom object to select the text of the element that has the id "item-no”.If the product id is not available the function assigns the value "Item Id is not available" to the column “item_id”.


Function to extract description of the product:



# Function to extract description of the product

def get_description(soup):
    try:
        product_description = soup.find
                           ('div',{'itemprop':'description'}).text
        data['description'].iloc[product] = product_description
        data['description'] = data['description'].astype(str)
        data['description'] = data['description']
                                  .apply(lambda x: x.strip('\n '))
        if data['description'].iloc[product] == '':
            product_description = "Description is not available"
            data['description'].iloc[product] = product_description
    except:
        pass
    return product_description  
        

The function uses the xpath() method of the dom object to select the text of the element that has the automation-id "productDetailsOutput". If the product description is not available the function assigns the value "Description is not available" to the “description” column.


Function to extract model of the product:


This function uses the xpath() method of the dom object to select the text of the element that has the id "model-no". If the product model is not available, the function assigns the value "Model is not available" to the “model” column.



# Function to extract model of the product

def get_model(soup):
    try:
        keys=soup.find_all
           ('div',{'class':'spec-name col-xs-6 col-md-5 col-lg-4'})
        values=soup.find_all
                     ('div',{'class':'col-xs-6 col-md-7 col-lg-8'})    
        for item in range(len(keys)):
            if keys[item].text=='Model':
                product_model=values[item].text
        data['model'].iloc[product] = product_model
    except:
        product_model = "Model is not available"
        data['model'].iloc[product] = product_model
    return product_model 
        

Function to extract connection type of the product:



# Function to extract connection type of the product

def get_connection_type(soup):
    try:
        keys=soup.find_all
           ('div',{'class':'spec-name col-xs-6 col-md-5 col-lg-4'})
        values=soup.find_all
                     ('div',{'class':'col-xs-6 col-md-7 col-lg-8'})
        for item in range(len(keys)):
            if keys[item].text=='Connection Type':
                product_connection=values[item].text
        data['connection_type'].iloc[product] = product_connection
    except:
        product_connection = "Connection type is not available"
        data['connection_type'].iloc[product] = product_connection
    return product_connection
        

The function uses the xpath() method of the dom object to select the text of the first div element which is the following sibling of the element that contains the text "Connection Type". If the product connection type is not available the function assigns the value "Connection type is not available" to the 'connection_type' column.


Function to extract category type of the product:



# Function to extract category type of the product

def get_category(dom):
    try:
        product_category = dom.xpath
                           ('(//*[@itemprop="name"]/text())[10]')
        data['category'].iloc[product] = product_category
        data[['category']] = data[['category']].astype(str)
        data['category'] = data['category'].apply
                                       (lambda x: x.strip("]'["))
        if data['category'].iloc[product] == '':
            product_category = "Category is not available"
            data['category'].iloc[product] = product_category
    except:
        pass
    return product_category
        

The function uses the xpath() method of the dom object to select the text of the 10th element that has the itemprop "name" .If the product category is not available the function assigns the value "Category is not available" to the 'category' column.


Function to extract colour of the product:



# Function to extract colour of the product

def get_colour(dom):
    try:
        product_colour = dom.xpath('//*[text()="Color"]/following::div[1]/text()')[0]
        data['colour'].iloc[product] = product_colour
    except:
        product_colour = "Colour is not available"
        data['colour'].iloc[product] = product_colour
    return product_colour
        

This function uses the xpath() method to select the text of the first div element which is the following sibling of the element that contains the text "Color". If the product color is not available the function assigns the value "Colour is not available" to the 'colour' column.


Starting the Scraping Process: Bringing it all together


With the completion of defining all the required functions, we will now begin the scraping process by consecutively calling each of the previously defined functions to retrieve the desired data.



# Costco electroic categories link
url = 'https://www.costco.com/electronics.html'

driver.get(url)

url_content=click_url(driver)

The first step is to navigate to the Costco electronic categories page using the webdriver and the specified URL. We will then use the click_url() function to click on the Audio/Video category and extract the HTML content of the page.



# Creating a dictionary with required columns
data_dic = {'product_url': [], 'item_id': [], 'brand': [],       'product_name': [], 'category': [], 'model': [], 'price': [], 'colour': [], 'connection_type': [], 'description': []}

# Creating a dataframe
data = pd.DataFrame(data_dic)

To store the scraped data, we will create a dictionary with the required columns such as 'product_url', 'item_id', 'brand', 'product_name', 'colour', 'model', 'price', 'connection_type', 'category', 'description'. We will then create a dataframe using this dictionary, named 'data', which will be used to store all the scraped data.



# Scraping product links and adding it to the dataframe column              'product_url'

product_links(url_content)

The script is now calling the product_links(url_content) function, which extracts the links of all the products present under the 4 subcategories of the Audio/Video category. These links are then added to the 'product_url' column of the dataframe 'data'.



# Scraping all the required features of each product

for product in range(len(data)):
    product_url = data['product_url'].iloc[product]
    product_content = extract_content(product_url)

    #model
    get_model(product_content)
    
    #brand
    get_brand(product_content)
    
    #connection type
    get_connection_type(product_content)
    
    #price
    get_price(product_content)
    
    #colour
    get_colour(product_content)
    
    #item id
    get_item_id(product_content)
    
    #category type 
    get_category(product_content)
    
    #description
    get_description(product_content)
    
    #product name
    get_product_name(product_content)
    

This code iterates through each product in the 'data' dataframe, extracting the product URL from the 'product_url' column and using the extract_content() function to retrieve the HTML content of the product page. It then calls the previously defined functions to extract specific features such as the model, brand, connection type, price, color, item id, category, description, and product name, and assigns these values to the respective columns of the dataframe at the specified index, effectively scraping all necessary information for each product.



data.to_csv('costco_data.csv')

With this final line of code, the dataframe 'data' containing all the scraped information for each product is exported to a CSV file named 'costco_data.csv'. This allows for easy access and manipulation of the scraped data for further analysis or use.


Conclusion


We have learned how to use Python and its web scraping libraries to extract product information from Costco's website, specifically focusing on the "Audio/Video" subcategory of the "Electronics" category. We walked through the process of inspecting the website structure, identifying the elements to extract, and writing Python code to automate the scraping process.


By mastering the basics of web scraping, you can unlock a world of valuable data that can be used for a wide range of applications, from market research to data analysis and beyond. With the ability to extract and analyze data from any website, the possibilities are endless.


We hope this blog post has provided you with a solid foundation in web scraping techniques and inspired you to explore the many possibilities that web scraping has to offer. So, what are you waiting for? Start exploring, and see what insights you can uncover with the power of web scraping.


Ready to discover the power of web scraping for your business? Contact Datahut to learn more.

1,479 views

Recent Posts

See All

1 comentário


anandhuh4899
02 de mar. de 2023

Well explained, really useful.

Curtir

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page