Web Scraping Zara: Extracting Product Data using Python & Selenium
In the ever-evolving world of fashion, staying updated with the latest trends is not just a passion—it's a necessity for many. And when we speak of trendsetting, Zara inevitably enters the conversation. A Spanish multinational retail clothing chain, this globally recognized brand has consistently kept fashionistas on their toes, eagerly awaiting its next collection.
But what if there was a way to analyze these trends systematically, ensuring that we're not just catching up, but also forecasting the next big thing? This is precisely what web scraping helps to do.
Zara holds a treasure trove of data that holds the key to understanding evolving fashion trends, consumer preferences, and market dynamics. This kind of information is important for making smart decisions.
In this blog, we will learn how to scrape Zara product data. We'll look into customer preferences, popular product choices, and price ranges for a particular category in Zara Women: Jackets.
We'll be extracting the following attributes from Zara's product pages:
product_url: It is the unique address of a jacket on the Zara website.
product_name: It specifies the name and model of the jacket.
mrp: It is the selling price of the jacket.
color: It is the color of the jacket.
description: It is a short detail about the jacket
Step1: Importing the Required Libraries
After identifying the attributes to be scrapped, we need to import the required libraries. Here, we will be using Selenium which is a tool used to automate browsers to scrape the Zara website. The libraries to be imported are:
Selenium web driver is a tool used for web automation. It allows a user to automate web browser actions such as clicking a button, filling in fields, and navigating to different websites.
By class from selenium.webdriver.common.by which is used to locate elements on a web page using different strategies like ID, class name, XPATH etc.
The writer class from csv library is used to read and write tabular data in CSV format.
The sleep function from the time library is used to provide a pause or delay in the execution of a program for a specified number of seconds.
# Importing the required libraries from selenium import webdriver from time import sleep from csv import writer from selenium.webdriver.common.by import By
Step2: Initialization Process
After importing the required libraries, we need to initialize a few things before we can start the actual scraping process. First, we initialize a web driver by creating an instance of the Chrome web driver using the ChromeDriver executable path. It is used to establish a connection with the web browser, here which is Google Chrome. Once initialized, a Chrome web browser will be opened and Zara's website is opened using the get() function so that Selenium can interact with it. The size of the window is maximized using the maximize_window() function.
# Specify the full path to the ChromeDriver executable chrome_driver_path = r"C:\Users\Dell\Downloads\chromedriver_win32\chromedriver.exe" driver = webdriver.Chrome(executable_path=chrome_driver_path) driver.get('https://www.zara.com/us/en/search?searchTerm=women%20jackets§ion=WOMAN') driver.maximize_window()
Step 3: Getting the Products’ Link
# Scrolling the web page height = driver.execute_script("return document.body.scrollHeight") whileTrue: driver.execute_script("window.scrollTo(0,document.body.scrollHeight)") sleep(5) new_height = driver.execute_script("return document.body.scrollHeight") if height == new_height: break height = new_height
After all the products are successfully loaded, we create an empty list to store the products’ link. The product elements are located on the web page using XPath and the find_elements() function is used to scrape the product elements. This function returns the product elements as a list. To get the actual product link from these elements, we will be calling get_attribute() method on each of these elements and extract the corresponding ‘href’ property and store it in the list we created before.
product_links =  # Getting the product elements page_product_links = driver.find_elements(By.XPATH, '//div[@class="product-grid-product__figure"]/a') # Getting the product links for product in page_product_links: product_link = product.get_attribute('href') product_links.append(product_link)
Step 4: Defining Functions
We will now define functions to extract each attribute.
# Extracting product name defget_product_name(): try: product_name = driver.find_element(By.XPATH, '//h1[@class="product-detail-info__header-name"]').text except Exception as e: product_name = "Not available" return product_name # Extracting product mrp defget_mrp(): try: mrp = driver.find_element(By.XPATH, '//span[@class="money-amount__main"]').text except Exception as e: mrp = "Not available" return mrp # Extracting product color defget_color(): try: color = driver.find_element(By.XPATH, '//p[@class="product-color-extended-name product-detail-info__color"]').text except Exception as e: color = "Not available" return color # Extracting product description defget_desc(): try: desc = driver.find_element(By.XPATH, '//div[@class="expandable-text__inner-content"]/p').text except Exception as e: desc = "Not available" return desc
Step 5: Writing to a CSV File
The extracted data needs to be stored so that it can be further used for other purposes like analysis. Now we will see how to store the extracted data to a csv file.
First, we will open a file named “women_jacket_data.csv” in the write mode and initialize an object of the writer class named theWriter. The headings of different columns of the csv file are first initialized as a list and then written to the file using the writerow() function.
Now we will extract the information about each product. For this, we will iterate through each product link in the product_links and call the get() function and the functions defined earlier to extract the required attributes. The attribute values returned are first stored as a list and then written into the csv file using the writerow() function. After the process is completed, the quit() command is called which closes the web browser that the selenium web driver opened.
It can be noted that sleep() function is called in between different function calls. It is provided to avoid getting blocked by the website.
# Writing to a CSV File with open('women_jacket_data.csv','w',newline='', encoding='utf-8') as f: theWriter = writer(f) heading = ['product_url', 'product_name', 'mrp', 'color', 'description'] theWriter.writerow(heading) for product in product_links: driver.get(product) sleep(5) product_name = get_product_name() sleep(3) mrp = get_mrp() sleep(3) color = get_color() sleep(3) desc = get_desc() sleep(3) record = [product, product_name, mrp, color, desc] theWriter.writerow(record) driver.quit()
In the dynamic landscape of the fashion industry, understanding consumer preferences and emerging trends can provide a competitive edge to brands looking to build a strong footing in the industry. Through this guide, we've not only shown the technique of scraping Zara using Python and Selenium but also highlighted its potential to be adapted for various other product categories and e-commerce platforms.
While the techniques described above are excellent for data enthusiasts looking to perform small-scale extractions, larger projects demand dedicated solutions. That’s where Datahut web scraping services step in. At Datahut, we specialize in providing comprehensive web scraping services, assisting retailers in acquiring essential information seamlessly. By partnering with us, businesses can focus on interpretation and strategy, leaving the heavy lifting of data extraction to our experts.
Dive into the world of informed decision-making with Datahut and unlock the true potential of data in retail! Contact us today!