In today's data-driven world, where the ability to collect and analyze information is crucial for success, web scraping has become an indispensable tool for businesses and individuals. It enables the extraction of valuable data from websites, providing valuable insights, a competitive edge, and informed decision-making capabilities.
In this blog post, we'll explore how to utilize Python for web scraping and extracting product information from Costco's website. Our focus will be on the "Electronics" category, with specific emphasis on the "Audio/Video" subcategory. Our goal is to extract key features such as product name, brand, color, item ID, category, connection type, price, model, and description for each electronic device.
From this category of products, the followings features are extracted:
Product URL: URL of each electronic device
Product Name: Name of the electronic device as it appears on the website.
Brand: Brand name of the electronic device
Color: The color of the electronic device
Item Id: Unique identifier assigned to a specific electronic device
Category: Category type to which the product belongs from the 4 subcategories under Audio/Video
Connection Type: Method by which the device connects to other devices or systems
Price: cost of the device
Model: Specific version or variant of a device
Description: Overall functionality and key features of the device
Getting started with scraping Costco product data
Before we dive into the code, we'll need to install a few libraries and dependencies. We'll be using Python for our scraping, along with two popular libraries for web scraping: Beautiful Soup and Selenium. Beautiful Soup allows us to parse HTML and XML documents, while Selenium automates web browsers for testing and scraping purposes.
Once we have our libraries installed, we'll inspect the website structure to identify the elements we need to extract. This will involve examining the HTML code for the website and identifying the specific tags and attributes that contain the data we're interested in.
With this information in hand, we'll begin writing our Python code to scrape the website.
We'll use Beautiful Soup to extract the data and Selenium to automate the browser actions needed to scrape the website. Once we have our script written, we'll run it and save the data to a CSV file for easy analysis.
Installing Required Packages:
# Importing necessary libraries
import pandas as pd
from lxml import etree as et
from bs4 import BeautifulSoup
from selenium import webdriver
pd.options.mode.chained_assignment = None
from selenium.webdriver.common.by import By
Pandas is a library for data manipulation and analysis. It is used to store and manipulate the data scraped from the website.We have used ‘pandas’ here to convert the data from a dictionary format to a DataFrame format, which is more suitable for data manipulation and analysis, and then to save the DataFrame in CSV format, to make it easy to open and use in other software.
lxml is a library for processing XML and HTML documents. It is used to parse the HTML or XML content of the webpage.Here we are using ‘lxml’ with ‘et’ to navigate and search the HTML document's tree-like structure where ‘et’ stands for Element Tree, which is a module in the ‘lxml’ library that provides a simple and efficient way to work with XML and HTML documents.
BeautifulSoup is a library that makes it easy to scrape information from web pages. It allows you to parse the HTML or XML content of the webpage and extract the data you're interested in.The BeautifulSoup library is being used here to parse the HTML content obtained from the web page.
Selenium is a library that allows you to automate web browsers. It is used to automate the process of navigating to a webpage and interacting with it, like clicking buttons and filling out forms.
Webdriver is a package used by Selenium to interact with web browsers. It allows you to control the browser and execute JavaScript commands.The Selenium library with webdriver module is used here to automate the interaction with a web page, by creating an instance of a web driver and navigating to a specific URL, it allows to get the source code of the web page, which can then be parsed and analyzed.
driver = webdriver.Firefox()
One of the most important things you'll need to do when using Selenium is to create an instance of a web driver. A web driver is a class that interacts with a specific web browser, such as Chrome, Firefox, or Edge. In this code snippet, we're creating an instance of the Chrome web driver by using webdriver.Chrome(). This line of code allows us to control the Chrome browser and interact with web pages just like a user would.
With the web driver, we can navigate different pages, interact with the page's elements, fill out forms and click buttons, and extract the necessary information. With this powerful tool, we can automate tasks and gather data in a much more efficient way. With the power of Selenium and web drivers, you can unlock the full potential of web scraping and automate your data collection process like a pro!
Understanding the web scraping functions
Now that we have a basic understanding of web scraping and the tools we'll be using, it's time to dive into the code. Now we'll be taking a closer look at the functions that we've defined for the web scraping process. Defining functions allows code organization, reusability, and maintainability, making it easier to understand, debug and update the codebase.
We'll explain the purpose of each function defined and how it contributes to the overall process.
Function to extract content:
# Function to extract content from page
def extract_content(url):
driver.get(url)
page_content = driver.page_source
soup = BeautifulSoup(page_content, 'html.parser')
return soup
A function called extract_content is created, which takes a single argument, URL, and uses Selenium to navigate to that URL, retrieve the page source, parse it into a BeautifulSoup object using lxml parser, which is then passed to et.HTML() and converted to an Element Tree object. We can use the returned dom object to navigate and search the HTML document's tree-like structure and extract the information we need from the page.
Function to click on a URL:
This function uses the find_element() method with By.XPATH to locate the “'Audio/Video” category link from the costco electronics website and click() method to navigate to that page.This function allows us to navigate to the specific link on the website by clicking on it and then extract the contents of that page.
# Function to click the electronic cateory 'Audio/Video' and extract content from the page
def click_url(driver):
driver.find_element
(By.XPATH, '//*[@id="navpills-sizing/a[3]').click()
html_content = driver.page_source
soup = BeautifulSoup(html_content, 'html.parser')
return
Function to extract category links:
Upon navigating to the Audio/Video category, this function extracts the links of the 4 subcategories displayed, allowing for further scraping on those specific pages. The xpath() method of the dom object is used to find all elements that match the specified xpath expression. Here, the xpath allows to select all the “href” attributes of the “a” elements that are descendants of elements with the class "categoryclist_v2".
# Function to get the urls of sub categories under Audio/Video
def category_links(soup):
category_link = []
for div in soup.find_all
('div', attrs={"class": "col-xs-12 col-lg-6 col-xl-3"}):
for links in div.find_all('a'):
category_link.append(links['href'])
category_link = category_link[:4]
return category_link
Function to extract product links:
With the 4 subcategory links obtained, we will now proceed to scrape all the links of the products present under these categories.
# Function to extract urls of products and adding it to the dataframe
def product_links(soup):
product_urls = []
for links in category_links(soup):
content=extract_content(links)
for product_section in content.find_all
('div', {'automation-id': 'productList'}):
for product_link in product_section.find_all('a'):
product_urls.append(product_link['href'])
product_urls = list(set(product_urls))
valid_urls =
[url for url in product_urls if url.endswith('.html')]
data['product_url'] = valid_urls
return
This function makes use of the category_links() and extract_content() functions that were previously defined, to navigate to each subcategory page and extract the links of all the products present under each subcategory.The function uses the xpath() method of the content object to select all the product links by the given xpath expression which selects all the “href” attributes of the “a” elements that are descendants of elements with the automation-id "productList" and whose “href” attribute ends with ".html".
Function to extract product name:
With the links of all the products obtained, we will now proceed to scrape the necessary features of each product. The function uses a try-except block to handle any errors that may occur while extracting the features.
# Function to extract product name
def get_product_name(soup):
try:
name = soup.find('h1',{'automation-id':'productName'}).text
data['product_name'].iloc[product] = name
except:
name = "Product name is not available"
data['product_name'].iloc[product] = name
return name
Inside the try block, the function uses the xpath() method of the dom object to select the text of the element that has the class "product-title". If the product name is not available the function assigns the value "Product name is not available" to the 'product_name' column in the dataframe “data” at the position of the current product.
Function to extract brand of the product:
# Function to extract brand of the product
def get_brand(soup):
try:
product_brand = soup.find('div',{'itemprop':'brand'}).text
data['brand'].iloc[product] = product_brand
except:
product_brand = "Brand is not available"
data['brand'].iloc[product] = product_brand
return
The function uses the xpath() method of the dom object to select the text of the element that has the itemprop "brand." If the brand name is not available the function assigns the value "Brand is not available" to the column “brand”.
Function to extract the price of the product:
# Function to extract price of the product
def get_price(soup):
try:
product_price = soup.find
('span',{'automation-id':'productPriceOutput'}).text
data['price'].iloc[product] = product_price
data[['price']] = data[['price']].astype(str)
data['price'] = data['price'].apply
(lambda x: x.strip("-."))
if data['price'].iloc[product] == '':
product_price = "Price is not available"
data['price'].iloc[product] = product_price
except:
pass
return product_price
The function uses the xpath() method of the dom object to select the text of the element that has the automation-id "productPriceOutput". If the price is not available the function assigns the value "Price is not available" to the column “price”.
Function to extract item Id of the product:
# Function to extract item id of the product
def get_item_id(soup):
try:
product_id = soup.find('input',{'name':'addedItem'})['value']
data['item_id'].iloc[product] = product_id
except:
product_id = "Item Id is not available"
data['item_id'].iloc[product] = product_id
return product_id
This function uses the xpath() method of the dom object to select the text of the element that has the id "item-no”.If the product id is not available the function assigns the value "Item Id is not available" to the column “item_id”.
Function to extract description of the product:
# Function to extract description of the product
def get_description(soup):
try:
product_description = soup.find
('div',{'itemprop':'description'}).text
data['description'].iloc[product] = product_description
data['description'] = data['description'].astype(str)
data['description'] = data['description']
.apply(lambda x: x.strip('\n '))
if data['description'].iloc[product] == '':
product_description = "Description is not available"
data['description'].iloc[product] = product_description
except:
pass
return product_description
The function uses the xpath() method of the dom object to select the text of the element that has the automation-id "productDetailsOutput". If the product description is not available the function assigns the value "Description is not available" to the “description” column.
Function to extract model of the product:
This function uses the xpath() method of the dom object to select the text of the element that has the id "model-no". If the product model is not available, the function assigns the value "Model is not available" to the “model” column.
# Function to extract model of the product
def get_model(soup):
try:
keys=soup.find_all
('div',{'class':'spec-name col-xs-6 col-md-5 col-lg-4'})
values=soup.find_all
('div',{'class':'col-xs-6 col-md-7 col-lg-8'})
for item in range(len(keys)):
if keys[item].text=='Model':
product_model=values[item].text
data['model'].iloc[product] = product_model
except:
product_model = "Model is not available"
data['model'].iloc[product] = product_model
return product_model
Function to extract connection type of the product:
# Function to extract connection type of the product
def get_connection_type(soup):
try:
keys=soup.find_all
('div',{'class':'spec-name col-xs-6 col-md-5 col-lg-4'})
values=soup.find_all
('div',{'class':'col-xs-6 col-md-7 col-lg-8'})
for item in range(len(keys)):
if keys[item].text=='Connection Type':
product_connection=values[item].text
data['connection_type'].iloc[product] = product_connection
except:
product_connection = "Connection type is not available"
data['connection_type'].iloc[product] = product_connection
return product_connection
The function uses the xpath() method of the dom object to select the text of the first div element which is the following sibling of the element that contains the text "Connection Type". If the product connection type is not available the function assigns the value "Connection type is not available" to the 'connection_type' column.
Function to extract category type of the product:
# Function to extract category type of the product
def get_category(dom):
try:
product_category = dom.xpath
('(//*[@itemprop="name"]/text())[10]')
data['category'].iloc[product] = product_category
data[['category']] = data[['category']].astype(str)
data['category'] = data['category'].apply
(lambda x: x.strip("]'["))
if data['category'].iloc[product] == '':
product_category = "Category is not available"
data['category'].iloc[product] = product_category
except:
pass
return product_category
The function uses the xpath() method of the dom object to select the text of the 10th element that has the itemprop "name" .If the product category is not available the function assigns the value "Category is not available" to the 'category' column.
Function to extract colour of the product:
# Function to extract colour of the product
def get_colour(dom):
try:
product_colour = dom.xpath('//*[text()="Color"]/following::div[1]/text()')[0]
data['colour'].iloc[product] = product_colour
except:
product_colour = "Colour is not available"
data['colour'].iloc[product] = product_colour
return product_colour
This function uses the xpath() method to select the text of the first div element which is the following sibling of the element that contains the text "Color". If the product color is not available the function assigns the value "Colour is not available" to the 'colour' column.
Starting the Scraping Process: Bringing it all together
With the completion of defining all the required functions, we will now begin the scraping process by consecutively calling each of the previously defined functions to retrieve the desired data.
# Costco electroic categories link
url = 'https://www.costco.com/electronics.html'
driver.get(url)
url_content=click_url(driver)
The first step is to navigate to the Costco electronic categories page using the webdriver and the specified URL. We will then use the click_url() function to click on the Audio/Video category and extract the HTML content of the page.
# Creating a dictionary with required columns
data_dic = {'product_url': [], 'item_id': [], 'brand': [], 'product_name': [], 'category': [], 'model': [], 'price': [], 'colour': [], 'connection_type': [], 'description': []}
# Creating a dataframe
data = pd.DataFrame(data_dic)
To store the scraped data, we will create a dictionary with the required columns such as 'product_url', 'item_id', 'brand', 'product_name', 'colour', 'model', 'price', 'connection_type', 'category', 'description'. We will then create a dataframe using this dictionary, named 'data', which will be used to store all the scraped data.
# Scraping product links and adding it to the dataframe column 'product_url'
product_links(url_content)
The script is now calling the product_links(url_content) function, which extracts the links of all the products present under the 4 subcategories of the Audio/Video category. These links are then added to the 'product_url' column of the dataframe 'data'.
# Scraping all the required features of each product
for product in range(len(data)):
product_url = data['product_url'].iloc[product]
product_content = extract_content(product_url)
#model
get_model(product_content)
#brand
get_brand(product_content)
#connection type
get_connection_type(product_content)
#price
get_price(product_content)
#colour
get_colour(product_content)
#item id
get_item_id(product_content)
#category type
get_category(product_content)
#description
get_description(product_content)
#product name
get_product_name(product_content)
This code iterates through each product in the 'data' dataframe, extracting the product URL from the 'product_url' column and using the extract_content() function to retrieve the HTML content of the product page. It then calls the previously defined functions to extract specific features such as the model, brand, connection type, price, color, item id, category, description, and product name, and assigns these values to the respective columns of the dataframe at the specified index, effectively scraping all necessary information for each product.
data.to_csv('costco_data.csv')
With this final line of code, the dataframe 'data' containing all the scraped information for each product is exported to a CSV file named 'costco_data.csv'. This allows for easy access and manipulation of the scraped data for further analysis or use.
Conclusion
We have learned how to use Python and its web scraping libraries to extract product information from Costco's website, specifically focusing on the "Audio/Video" subcategory of the "Electronics" category. We walked through the process of inspecting the website structure, identifying the elements to extract, and writing Python code to automate the scraping process.
By mastering the basics of web scraping, you can unlock a world of valuable data that can be used for a wide range of applications, from market research to data analysis and beyond. With the ability to extract and analyze data from any website, the possibilities are endless.
We hope this blog post has provided you with a solid foundation in web scraping techniques and inspired you to explore the many possibilities that web scraping has to offer. So, what are you waiting for? Start exploring, and see what insights you can uncover with the power of web scraping.
Ready to discover the power of web scraping for your business? Contact Datahut to learn more.
Well explained, really useful.