top of page
  • Writer's pictureSandra P

Web Scraping A Dynamic eCommerce website using Python

Updated: Jan 3, 2023


Dynamic websites generate content on the fly by executing server-side scripts or making API calls to a database. Therefore, scraping a dynamic website can be more challenging than scraping a static website, as the content may not be present in the HTML source code and must be generated by the server in response to user actions.


Sephora is a leading beauty retailer offering a wide selection of makeup, skincare, fragrance, and hair care products. With a diverse range of brands, including its private label, Sephora has something for every beauty enthusiast. In addition, the company's online site offers a convenient way for customers to browse and purchase products from the comfort of their own homes.


Here, this blog deals with how to scrape data from the Sephora website, specifically the perfumes for women. Sephora has a dynamic website. A dynamic website is a website that can give personalized content to each user. That is, when the user interacts with the website, the content can be changed based on what the user wants to see.


Web scraping is an excellent tool if you're a competitor to Sephora or want to understand more about its product catalog. Having all the necessary details of each product organized in a structured format, such as in a spreadsheet or database, can make it easier to compare and analyze products and track changes over time. It can also help track trends and identify areas for improvement.


One way to scrape a dynamic website using Python is to use a headless browser, such as Selenium, which allows you to programmatically control a web browser and interact with a website as a user would. For example, with Selenium, you can execute JavaScript code, fill out forms, and click buttons, just as you would manually.


To get started with web scraping using Selenium and Python, you must install the Selenium package and a web driver. A web driver is a software that allows Selenium to interact with a web browser. There are different web drivers for different browsers, such as Chrome, Firefox, and Safari.


How to scrape data from Sephora using Python.


We will be scraping the data of perfumes for women on the Sephora website using Python Selenium and BeautifulSoup. The details of the process is described below.


Attributes

  • Product URL: The link to a specific product

  • Brand: The name of the brand

  • Product Name: The unique name of the product

  • Number of Reviews: Number of reviews the product has got

  • Love Count: Count of loves the product has got

  • Star Rating: The star rating the product has got

  • Price: The price of the movie

  • Fragrance Family: The fragrance family which the perfume belongs to

  • Scent Type: The type of the scent of the perfume

  • Key Notes: Keynotes on the product

  • Fragrance Description: The description of the fragrance of the perfume

  • Composition: The composition of the perfume

  • Ingredients: Ingredients used in the perfume


Importing required libraries


Let‘s begin by importing all the required libraries to help us interact with the Sephora website and parse the data we are interested in. These libraries include:

  • The time library, which we can use to add delays to our script to avoid overwhelming the website with requests.

  • The random library, which we can use to generate random numbers to add variety to our requests.

  • The pandas library, which we can use to store and manipulate the data we scrape.

  • The BeautifulSoup module from the bs4 library, which we can use to parse HTML and extract data.

  • The Selenium library, which we can use to control a web browser and interact with the Sephora website.

  • The webdriver module from the Selenium library, which we can use to specify which browser we want to use with Selenium.

  • Various extensions of the webdriver module, such as Keys and By, can be used to access additional functionality in Selenium.

If any of these libraries are not installed on your system, you can use the pip command to install them. Here is the code to import all of these libraries:


import time
import random
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
pd.options.mode.chained_assignment = None
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

Now we have to create a selenium browser instance. The below code does that:


driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))


Defining all the required functions


In order to make our code more readable and maintainable, we can define reusable pieces of code as functions. These functions, known as user-defined functions, enable us to encapsulate specific tasks and reuse them throughout our script without having to rewrite the same code multiple times. By defining functions, we can make our code more organized and easier to understand.


Function delay:


We will have to delay some processes in between. We can use a function to do this by suspending the execution of the next piece of code for a random number of seconds between 3 and 10. This function can be called whenever we need to add a delay to our script.


 def delay():
   	time.sleep(random.randint(3, 10))

Function lazy_loading:


When scraping data from dynamic websites, we may encounter the problem of lazy loading, where additional content is not loaded until it is needed. To ensure that we can access all of the data on the page, we can write code to scroll down the page and load the additional content. To do this, we can use the Keys class from the webdriver module to send page-down keys to the browser whenever we detect the body tag. We can also add delays to allow sufficient time for the products to load properly.


def lazy_loading():
    element = driver.find_element(By.TAG_NAME, 'body')
    count = 0
    while count < 20:
        element.send_keys(Keys.PAGE_DOWN)
        delay()
        count += 1

Function pagination:


The Sephora perfume homepage initially displays only 72 out of around 1000 available products. To access all of the products, we will need to continuously click the "Show More Products" button until everything is loaded. Each click will load an additional 60 products. To determine how many times we need to click the button, we can first locate the element containing the total number of available products using Selenium's webdriver and the XPath of the element. We can then divide this number by 60 and click the "Show More Products" button that number of times.


The XPath "//div[@class='css-unii66']/p" indicates that the required element is inside a division with class name 'css-unii66' and within the paragraph(p) tag in it.


Also, This is where we will be using the lazy_loading function for the proper loading of all products.


def pagination():
    total_number_of_products = driver.find_element(By.XPATH, "//div[@class='css-unii66']/p").text
    total_number_of_products = total_number_of_products.split(' ')
    total_number_of_products = int(total_number_of_products[2])    # Finding the total number of products in the site
    click_count = 0
    while click_count < (total_number_of_products / 60) + 1:  # Since number of products loaded per click is 60
        try:
            lazy_loading()
            driver.find_element(By.XPATH, "//div[@class='css-unii66']/button").click()  # Clicking the button saying
                                                                                        # 'load more products'
            delay()
            click_count += 1
        except:
            click_count += 1
            pass
    lazy_loading()

Function fetch_product_links:


When we go and inspect the Sephora perfume’s home page, we can see that the products that do not have lazy loading are enclosed in a division with class name ‘css-foh208’. And the links lie inside the anchor tag with href attribute.



To get the links to products that do not have lazy loading, we can define a function that uses Selenium to locate elements with a specific class name and extract the href attribute from the anchor tag within those elements. We can do this by searching for the 'css-foh208' class and extracting the href attribute from the anchor tag within that division. This function can then return the extracted links for further processing.


def fetch_product_links(all_products_section):
    for product_section in all_products_section.find_all('div', {'class': 'css-foh208'}):
        for product_link in product_section.find_all('a'):
            if product_link['href'].startswith('https:'):
                product_links.append(product_link['href'])
            else:
                product_links.append('https://www.sephora.com' + product_link['href'])

Function fetch_lazy_loading_product_links:


For products with lazy loading, the class name is different from above. For them, it is 'css-1qe8tjm'. The rest is the same.



So the function will be as follows:


def fetch_lazy_loading_product_links(all_products_section):
    for product_section in all_products_section.find_all('div', {'class': 'css-1qe8tjm'}):
        for product_link in product_section.find_all('a'):
            if product_link['href'].startswith('https:'):
                product_links.append(product_link['href'])
            else:
                product_links.append('https://www.sephora.com' + product_link['href'])

Function extract_content:


This is the function to extract the source code of the entire web page we are currently accessing. Selenium is used to obtain the source code and BeautifulSoup is used to parse it. The HTML code is parsed using the module ‘html.parser’.


def extract_content(url):
    driver.get(url)
    page_content = driver.page_source
    product_soup = BeautifulSoup(page_content, 'html.parser')
    return product_soup

Function brand_data:


Here is the function to extract the brand name of the products. This uses BeautifulSoup to find the required element with a specific attribute, such as "data-at='brand_name'". This text is taken into the corresponding row of the ‘brand’ column. If the function cannot find the required element, it can populate the column with a default value, such as "Brand name not available."


def brand_data(soup):
    try:
        brand = soup.find('a', attrs={"data-at": "brand_name"}).text
        data['brand'].iloc[product] = brand
    except:
        brand = 'Brand name not available'
        data['brand'].iloc[product] = brand

Function product_name:


The product name lies in the span tag with attribute value for the ‘data-at’ as ‘product_name’. The text within this is taken as the value in the corresponding row of thcolumn' ‘product_name’. If such a text is not found, the value taken will be "Product name not available."


def product_name(soup):
    try:
        name_of_product = soup.find('span', attrs={"data-at": "product_name"}).text
        data['product_name'].iloc[product] = name_of_product
    except:
        name_of_product = 'Product name not available '
        data['product_name'].iloc[product] = name_of_product

Function reviews_data:


This takes the count of reviews for each product. This detail lies in the span tag with attribute'‘data-at'’ as '‘number_of_review'’. The text within this will be the number of reviews. We will take the value as"Number of reviews not found".


def reviews_data(soup):
    try:
        reviews = soup.find('span', attrs={"data-at": "number_of_reviews"}).text
        data['number_of_reviews'].iloc[product] = reviews
    except:
        reviews = 'Number of reviews not available'
        data['number_of_reviews'].iloc[product] = reviews

Function love_data:


This is the function to extract the love count obtained for each product. This lies in the span tag with class name =‘css-jk94q9’. This text is taken as the value in column ‘love_count.’ If this is not found, we provide the value "Love count not available".


def love_data(soup):
    try:
        love = soup.find('span', attrs={"class": "css-jk94q9"}).text
        data['love_count'].iloc[product] = love
    except:
        love = 'Love count not available'
        data['love_count'].iloc[product] = love

Function star_data:


This function is used to extract the star rating obtained for each product. This lies inside a span tag with the class name as 'css-1tbjoxk'. It comes as the value of attribute ‘aria-label’' If this is not found, then we will take the value to be "star rating not available".


def star_data(soup):
    if data['love_count'].iloc[product] != 'Love count not available':            # Since love count and star rating usually coexists
        try:
            star = soup.find('span', attrs={"class": "css-1tbjoxk"})['aria-label']
            data['star_rating'].iloc[product] = star
        except:
            star = 'Star rating not available'
            data['star_rating'].iloc[product] = star

Function price_data:


The price data is available inside the bold tag where class name is ‘'ss-0’' The text inside this will be the price of the product. If it is not available, we will use the value "Price data not available".

def price_data(soup):
    try:
        price = soup.find('b', attrs={"class": "css-0"}).text
        data['price'].iloc[product] = price
    except:
        price = 'Price data not available'
        data['price'].iloc[product] = price

Function ingredients_data:


This function is used to extract the ingredients of the perfume. This differs slightly in different products. So we have to consider all those cases, all of which are included in this function. The data basically lies in a division with class name ‘css-1ue8dmw eanm77i0’.

For some products, there will be just one element in this position. In that case, there will be two sub-cases. That is, in some products, the price will be provided as text, and in others, it is not. So both are considered here.


Now, if there is more than one element in this position, we will take the second element. And if none of this is found, we will use the text "ingredients data not available".


def ingredients_data(soup):
    try:
        for ingredient in soup.find('div', attrs={"class": "css-1ue8dmw eanm77i0"}):
            if len(ingredient.contents) == 1:
                try:
                    data['Ingredients'].iloc[product] = ingredient.contents[0].text
                except:
                    data['Ingredients'].iloc[product] = ingredient.contents[0]
            else:
                data['Ingredients'].iloc[product] = ingredient.contents[1]
    except:
        data['Ingredients'].iloc[product] = 'Ingredients data not available'

Function find_element_by_xpath:


This function finds the elements by using XPath with Selenium webdriver. This finds the division of interest using XPath'‘//div[@class='css-32uy52eanm77i0''’ which indicates a division tag with class name being''css-32uy52 eanm77i''. The division looks like this:

The whole text from this division is extracted and is split at new lines '‘\'’). Now this has become a list (split_sections) of several features that are being split by new lines. This list will now be used to find remaining features like fragrance family, scent type, key notes, fragrance description and composition.


def find_element_by_xpath():
    try:
        section = driver.find_element(By.XPATH, "//div[@class='css-32uy52 eanm77i0']").text
        split_sections = section.split('\n')
        return split_sections
    except:
        pass

Function fragrance_family:


This function is to extract the data of the fragrance family of perfume. Here, each feature in the previously obtained list is split at colon(:). Now the word before the colon will be the feature name and the word after the colon will be the feature value. So here, each word behind the colon is checked whether it is ‘Fragrance Description'. If it is, then the word after the colon is our element of interest. If such a feature is not found, then we will take the value as "Fragrance family data not available".


def fragrance_family():
    split_section = find_element_by_xpath()
    try:
        for feature in split_section:
            key_and_value = feature.split(':')
            try:
                if key_and_value[0] == 'Fragrance Family':
                    data['Fragrance Family'].iloc[product] = key_and_value[1]
            except:
                data['Fragrance Family'].iloc[product] = 'Fragrance family data not available'
    except:
        pass

Function scent_data:


Here we use this function to find the scent type of the perfume. The process is same as with fragrance family with the difference being in checking the value to be 'Scent Type' instead of 'Fragrance Family'. If it is not available, then we use "Scent type data not available".


def scent_data():
    split_section = find_element_by_xpath()
    try:
        for feature in split_section:
            key_and_value = feature.split(':')
            try:
                if key_and_value[0] == 'Scent Type':
                    data['Scent Type'].iloc[product] = key_and_value[1]
            except:
                data['Scent Type'].iloc[product] = 'Scent type data not available'
    except:
        pass

Function key_notes:


This is the function to extract the key notes from the product description. Here also, the process is the same and we check the word before the colon to be 'Key Notes'. If it is not available, then we use "Key notes data not available".


def key_notes():
    split_section = find_element_by_xpath()
    try:
        for feature in split_section:
            key_and_value = feature.split(':')
            try:
                if key_and_value[0] == 'Key Notes':
                    data['Key Notes'].iloc[product] = key_and_value[1]
            except:
                data['Key Notes'].iloc[product] = 'Key notes data not available'
    except:
        pass

Function fragrance_description:


This is also the same as the previous process and here we check if the word before colon is 'Fragrance Description'. If it is, then we will take the word after that and if not, then we take "'Fragrance description not available".


def fragrance_description():
    split_section = find_element_by_xpath()
    try:
        for feature in split_section:
            key_and_value = feature.split(':')
            try:
                if key_and_value[0] == 'Fragrance Description':
                    data['Fragrance Description'].iloc[product] = key_and_value[1]
            except:
                data['Fragrance Description'].iloc[product] = 'Fragrance description not available'
    except:
        pass

Function composition_data:


This function checks the lower form of the word before colon to be 'composition'. If there is no such word, then we take the value to be "Composition not available".


def composition_data():
    split_section = find_element_by_xpath()
    try:
        for feature in split_section:
            key_and_value = feature.split(':')
            try:
                if key_and_value[0].lower() == 'composition':
                    index = split_section.index(feature)
                    data['COMPOSITION'].iloc[product] = split_section[index + 1]
            except:
                data['COMPOSITION'].iloc[product] = 'Composition not available'
    except:
        pass


Fetching all product URL s


Now that we have defined all the required functions, it's time to call one by one into action.


To begin the scraping process, we will start with the homepage link, also known as the start link. We will use Selenium's webdriver to access this link and load all of the available products by calling the pagination function, which includes the lazy_loading function to handle lazy loading. Next, we will extract the source code of the currently loaded webpage, which now includes all of the available products, using BeautifulSoup and the html.parser module. From this source code, we will call the fetch_product_links and fetch_lazy_loading_product_links functions to retrieve links to all of the products. These links will be stored in a list called product_links.


# Sephora website link
start_url = 'https://www.sephora.com/shop/fragrances-for-women'

driver.get(start_url)

# Continuously clicking the button to show more products till everything is loaded
pagination()

# Converting the content of the page to BeautifulSoup object
content = driver.page_source
homepage_soup = BeautifulSoup(content, 'html.parser')

# Fetching the product links of all items
product_links = []
all_products = homepage_soup.find_all('div', attrs={"class": "css-1322gsb"})[0]
fetch_product_links(all_products)               # Fetching the product links that does not have lazy loading
fetch_lazy_loading_product_links(all_products)  # Fetching the product links that have lazy loading

Initializing the Dataframe


To store the data we are collecting, we can create a dictionary with the required columns as keys and blank lists as values. We can then convert this dictionary into a Pandas dataframe called data. Finally, we can assign the product_links list to the product_url column in the dataframe. This will allow us to easily store and organize the data we are collecting.


# Creating a dictionary of the required columns
data_dic = {'product_url': [], 'brand': [], 'product_name': [],
            'number_of_reviews': [], 'love_count': [], 'star_rating': [], 'price': [], 'Fragrance Family': [],
            'Scent Type': [], 'Key Notes': [], 'Fragrance Description': [], 'COMPOSITION': [], 'Ingredients': []}

# Creating a dataframe with those columns
data = pd.DataFrame(data_dic)

# Assigning the scraped links to the column 'product_url'
data['product_url'] = product_links


Extracting all the required features


For each link in the column 'product_url', the content is extracted by calling the function extract_content. Then the functions for extracting the required details are called one by one.


# Scraping data of all required features
for product in range(len(data)):
    product_url = data['product_url'].iloc[product]
    product_content = extract_content(product_url)

    # brands
    brand_data(product_content)

    # product_name
    product_name(product_content)

    # number_of_reviews
    reviews_data(product_content)

    # love_count
    love_data(product_content)

    # star_rating
    star_data(product_content)

    # price
    price_data(product_content)

    # ingredients
    ingredients_data(product_content)

    # Fragrance Family
    fragrance_family()

    # Scent Type
    scent_data()

    # Key Notes
    key_notes()

    # Fragrance Description
    fragrance_description()

    # COMPOSITION
    composition_data()

Saving into a CSV file


The dataframe that we have obtained now is to be saved into a csv file for future use.


# Saving the dataframe into a csv file
data.to_csv('sephora_scraped_data.csv')


Here are some best practices for web scraping dynamic websites:


If you are using web scraping for commercial purpose and at scale, here are some best practices to follow.

  1. Respect the website's terms of use and robots.txt file: Make sure that you are allowed to scrape the website, and respect any rules or restrictions that the website has set.

  2. Use a headless browser: A headless browser, such as Selenium, allows you to programmatically control a web browser and interact with a website as a user would. This can be helpful for scraping websites that are generated on the fly or require user interactions.

  3. Use caching: To reduce the load on the website's servers and improve scraping speed, you can use caching to store the data you have already retrieved. This way, you can avoid making unnecessary requests and save time.

  4. Be polite: Don't scrape a website too frequently or send too many requests at once, as this can cause performance issues for the website and may even result in a ban.

  5. Handle errors gracefully: Websites can change over time, so be prepared to handle errors and adapt your code accordingly. Use try-except blocks and implement retry logic to handle temporary errors and timeouts.

  6. Use a proxy: If you are doing a large amount of scraping, you may want to use a proxy to hide your IP address and avoid being detected as a scraper. This can also help you bypass any IP-based rate limiting that the website may have in place.

  7. Use a well-behaved user agent: When making requests to a website, you should always set a user agent that identifies your scraper as a bot. This helps the website identify and differentiate your requests from those of a human user.



Conclusion

In conclusion, web scraping can be useful for collecting data from websites. However, it is important to be mindful of any potential legal and ethical considerations, such as respecting website terms of use and not overloading the server with excessive requests.


In this blog, we have demonstrated how to use Python to automate the process of collecting relevant details for Sephora's perfumes for women. We have achieved this by combining the capabilities of the Selenium and BeautifulSoup libraries. While BeautifulSoup is generally faster, Selenium may be necessary for extracting data that require HTML rendering. Once the data has been collected, it can be used in various ways, such as for comparison between products and brands.


Thanks for reading, and I hope you found this blog helpful.


1,993 views0 comments

Comments


Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page