top of page
  • Writer's pictureThasni M A

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

Dollar General, a renowned chain of discount stores in the United States, stands out for its diverse selection of merchandise, encompassing essential household supplies, consumables, and seasonal items. Notably, Dollar General often presents captivating offers and deals on products in line with Mother's Day festivities.

In this tutorial, we will be scraping Dollar General using the Selenium web automation tool to extract and acquire data related to Mother's Day special products from Dollar General's website. We will extract the following data attributes from the individual product pages of Dollar General's website.

  • Product URL - The URL of the resulting products.

  • Product Name - The name of the products.

  • Image - The image of the products.

  • Price - The price of the products.

  • Number of Reviews - The number of reviews of the products.

  • Ratings - The ratings of the products.

  • Description - The description of the products.

  • Product Details - The additional product Details of products which include information such as brand, unit size, etc.

  • Stock Status - The stock status of the products.

Here's a step-by-step guide for using the Selenium web automation tool to scrape Mother's Day special product data from Dollar General's website.

Importing Required Libraries

Selenium is a tool that is designed to automate web browsers. It is very useful to scrape data from the web because of automation capabilities like Clicking specific form buttons, Inputting information in text fields, and Extracting the DOM elements for browser HTML code. To start our process we will need to import Required libraries that will interact with the website and extract the information we need. These are the necessary packages that are required to extract data from an HTML page.

import time
import warnings
import pandas as pd
from selenium import webdriver
from import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
  • time - This module is used for working with time-related tasks such as waiting for elements to load on a webpage.

  • warnings - This module is used to filter out warning messages that may be generated during the scraping process.

  • pandas - This library is a powerful, widely used open-source library in Python data analysis tool used to manipulate and analyze data in tabular form.

  • selenium - This package is a web testing framework used for automating web browsers to perform various tasks such as scraping data from websites.

  • webdriver - This package helps for interacting with a web browser that allows you to automate browser actions, such as clicking buttons, filling out forms, and navigating pages,This module from the Selenium library.

  • ChromeDriverManager - The ChromeDriverManager function from the webdriver-manager, which automatically downloads and installs the latest version of the Chrome web driver and sets it up for use with Selenium.

Finally, we installed a Chrome driver with webdriver.Chrome() instance and configured, and stored it in the driver variable.

Also Read: Scraping Decathlon using Playwright in Python

Extraction of Product URLs

The second step is extracting the resultant product URLs. Extracting product URLs is a critical step in web scraping. With these URLs of individual products, we can extract detailed product information, reviews, ratings, and other relevant data. In our case, the desired products are spread across 21 pages, so we need a function that can extract the product URLs from each page by clicking on the "next page" button and moving to the subsequent pages. The product URL extraction code is provided below.

def get_product_links(driver, url):
    product_links = []
    page_number = -1  
    prev_links_count = -1
    skipped_pages = []
    while True:
        if page_number <= 0: 
            page_url = url
            page_url = f"{url}&page={page_number+1}"
        perform_request_with_retry(driver, page_url)
        paths = driver.find_elements("xpath", '//div[@class="dg-product-card row"]')
        for path in paths:
            link = f"{path.get_attribute('data-product-detail-page-path')}"
        links_count = len(product_links)
        print(f"Scraped {links_count} links from page {page_number+1}")
        next_page_button = driver.find_elements("xpath", '//button[@class="splide__arrow splide__arrow--next"][@data-target="pagination-right-arrow"][@disabled=""]')
        if len(next_page_button) > 0:
        if links_count == prev_links_count:
            skipped_page_url = page_url
            print(f"No new links found on page {page_number+1}. Saving URL: {skipped_page_url}")
            prev_links_count = links_count
        page_number += 1
    for skipped_page in skipped_pages:
        perform_request_with_retry(driver, skipped_page)
        paths = driver.find_elements("xpath", '//div[@class="dg-product-card row"]')
        for path in paths:
            link = f"{path.get_attribute('data-product-detail-page-path')}"
        print(f"Scraped {len(paths)} links from skipped page {skipped_page}")
    print(f"Scraped a total of {len(product_links)} product links")
    return product_links

The function extracts all resultant product URLs from Doller General's dynamic webpages using XPath expressions and stores them in a list called product_links. Instead of clicking on the next button to navigate through pages, the function generates the URLs of all 21 pages from the base URL. This approach is taken as the webpage's dynamic nature may cause issues while scraping. The function checks whether the number of product URLs extracted on the current page is the same as the previous page. If so, the function adds the URL of the current page to a list called skipped_pages. After the loop completes, the function scrapes each skipped page by loading them and extracting product URLs from them. The function efficiently scrapes all product URLs by navigating through all pages and handling any missed pages.

Extraction of Product Name

The next step is the extraction of the names of the products from the web pages. The name of the product plays a crucial role in defining its identity and might reveal information about the kind of goods being offered.

def get_product_name():
        product_name = driver.find_element("xpath",'//h1[@class="productPickupFullDetail__productName d-none d-md-block"]')
        product_name = product_name.text
        product_name = 'Product name is not available'
    return product_name

The function will extract the name of a product from the Dollar General website using Selenium web driver. The function uses a try-except block to handle any errors that may occur during the web scraping process. The function attempts to find the product name element using an XPath locator and stores the text value of that element in the product_name variable. In case the element is not found for any reason, the function sets the product_name variable to the string 'Product name is not available'.

Extraction of Image of the Products

The product images are a crucial part of the user experience while purchasing online. High-quality images can enhance a product's appeal, assist buyers in making knowledgeable judgments about their purchases, and set a product apart from its rivals.

def get_image_url():
        image_url = driver.find_element("xpath","//figure[@class='carousel__currentImage']/img").get_attribute('src')
        image_url = 'Image URL is not available'
    return image_url

The function will extract the URL of the current image displayed on a product carousel from the Dollar General website using the Selenium web driver. The function uses a try-except block to handle any errors that may occur during the web scraping process. The function attempts to find the image element using an XPath locator and extract the URL of the image. If the image URL is not available, the except block sets the value of image_url to a string indicating that the URL is not available.

Similarly, we can extract other attributes such as the Number of Reviews, Ratings, Price, and Stock Status. We can apply the same technique to extract all of them.

Extraction of the Number of Reviews for the Products

def get_number_of_reviews():
        number_of_reviews = driver.find_element("xpath",'//a[@class="pr-snippet-review-count"]')
        number_of_reviews = number_of_reviews.text
        number_of_reviews = number_of_reviews.replace("Reviews", "")
        number_of_reviews = number_of_reviews.replace('Review', '')
        number_of_reviews = 'Number of Reviews is not available'
    return number_of_reviews

Extraction of Ratings of the Products

def get_star_rating():
        rating_string = driver.find_element_by_xpath("//div[contains(@class,'pr-snippet-stars') and @role='img']")
        rating_string = rating_string.get_attribute("aria-label")
        rating = float(rating_string.split()[1])
        rating = 'Product rating is not available'
    return rating

Extraction of Price of the Products

def get_product_price():
        price_element = driver.find_element("xpath","//div[@class='productPriceQuantity']//span[@class='product-price']")
        product_price = price_element.text.replace('$', '').strip()
        product_price = 'Product price is not available'
    return product_price

Extraction of Stock Status of the Products

The stock status of a product refers to its availability in a particular store or online marketplace.

 def get_stock_status():
        stock_info = driver.find_element("xpath","//p[@class='product__stock-alert' and @data-target='stock-alert-pickup']").text.replace('in stock', '').replace('at', '').strip()
            stock_info = driver.find_element("xpath","//p[contains(@class,'product__stock-alert') and contains(@class,'product__text-red')]").text.replace('in stock', '').replace('at', '').strip()
            stock_info = 'Stock information is not available'
    return stock_info

Extraction of Product Description

Next, We are going to extract the product description and product details using Selenium.

From the Product Details section, We will extract the first section text which describes the product, and store it as "Product Description". Additionally, we will extract other relevant information from the second section, such as "Available" and "Brand Description", and store it as "Product Details".

def get_product_description_and_features():
        details_section = driver.find_element_by_xpath("//div[@id='product__details-section']")
        details_list = details_section.find_elements_by_xpath(".//p | .//li")
        product_details = [detail.text for detail in details_list]
        product_details = ['Product description is not available']
    return product_details

The function will extract product descriptions and features using the Selenium web driver. The function searches and finds all paragraphs within that section using XPath expressions and extracts the text content of each element. The resulting product details are returned as a list. If the product details section cannot be found on the webpage, the function returns that the description is not available in a list.

Also Read: Scraping Amazon Product Category Without Getting Blocked

Extraction of Product Details

def get_product_details():
        details_dict = {}
        show_more_button = driver.find_element_by_xpath("//button[@class='product__details-button' and @data-target='show-more-button']")
        details_list = driver.find_elements_by_xpath('//div[@class="product__details-data"]/div')
        for detail in details_list:
            detail_name = detail.find_element_by_xpath('p').text
            detail_value = detail.find_element_by_xpath('span').text
            if detail_name != '':
                details_dict[detail_name] = detail_value
        details_dict = {'Product details': 'Not available'}
    brand_description = details_dict.get('Brand Description', 'Brand Description not available')
    unit_size = details_dict.get('Unit Size', 'Unit Size not available')
    sku = details_dict.get('SKU', 'SKU not available')
    return details_dict, brand_description, unit_size, sku

The function will extract product details from a website using the Selenium web driver. It first clicks on the "Show More" button to reveal additional details that are hidden by default. Then it uses XPath to locate all the elements on the page that contain product details and stores them in a list called details_list. The function then loops through the details_list and extracts the name and value of each detail, and stores them in a dictionary called details_dict. If any error occurs during this process, such as if the product details are not available or the webpage does not contain a "Show More" button, the function sets the details_dict dictionary to a default value of 'Not available'. Finally, the function extracts three specific product details from the details_dict dictionary - brand description, unit size, and SKU - and returns them as a tuple. This tuple can then be used to create separate columns in a data frame.

Request Retry with Maximum Retry Limit

Request retry is a crucial aspect of web scraping as it helps to handle temporary network errors or unexpected responses from the website. The aim is to send the request again if it fails the first time to increase the chances of success.

Before navigating to the URL, the script implements a retry mechanism in case the request timed out. It does so by using a while loop that keeps trying to navigate to the URL until either the request succeeds or the maximum number of retries has been reached. If the maximum number of retries is reached, the script raises an exception. This code is a function that performs a request to a given link and retries the request if it fails. The function is useful when scraping web pages, as sometimes requests may time out or fail due to network issues.

def perform_request_with_retry(driver, url):
    retry_count = 0
    while retry_count < MAX_RETRIES:
            retry_count += 1
            if retry_count == MAX_RETRIES:
                raise Exception("Request timed out")

This function will perform a web request to a given URL using a Selenium web driver. It uses a loop with a retry mechanism to ensure that the request is successful. Inside the while loop, the function attempts to load the page by calling the driver.get(url) method. If this method call is successful, the loop is exited. The MAX_RETRIES variable is set to 5, which means the function will attempt to load the page a maximum of 5 times if an error occurs. The retry_count variable is initially set to 0. If an exception occurs during the page load, the retry_count variable is incremented and the code enters an if statement that checks if the maximum number of retries has been reached. If the maximum number of retries has been reached, the function raises an exception with the message "Request timed out". Otherwise, the code sleeps for 60 seconds before attempting to load the page again. Overall, this function provides a retry mechanism to handle any network or server errors that may occur during web scraping. It ensures that the web scraping script can continue running and retrieve the data even if there are some transient issues with the website.

Extracting and Saving the Product Data

In the next step, we call the functions and save the data to an empty list.

 def main():
    url = ""
    product_links = get_product_links(driver, url)

    data = []
    for i, link in enumerate(product_links):
        perform_request_with_retry(driver, link)
        product_name = get_product_name()
        image = get_image_url()
        rating = get_star_rating()
        review_count = get_number_of_reviews()
        product_price = get_product_price()
        stock_status = get_stock_status()
        product_description = get_product_description_and_features()
        product_details, brand_description, unit_size, sku = get_product_details()

        data.append({'Product Link': link, 'Product Name': product_name, 'image': image, 'Star Rating': rating,
                     'review_count': review_count, 'Price': product_price, 'Stock Status': stock_status,
                     'Brand': brand_description, 'Unit_Size': unit_size, 'Sku': sku,
                     'Description': product_description, 'Details': product_details })

        if i % 10 == 0 and i > 0:
            print(f"Processed {i} links.")

        if i == len(product_links) - 1:
            print(f"All information for {i + 1} links has been scraped.")
    df = pd.DataFrame(data)
    print('CSV file has been written successfully.')

if __name__ == '__main__':

The function will extract the product details from Dollar General's website for Mother's Day special products using the functions defined earlier. The main function starts by extracting the URLs of each product from the web pages. Then it loops through each URL, using various functions to extract the desired product details for each product. For each product, the function extracts the product name, image URL, star rating, number of reviews, product price, stock status, product description and features, and product details such as brand description, unit size, and SKU.

These details are then stored in a dictionary, which is appended to a list of all products called data. Finally, the code converts the data list into a pandas dataframe and saves it to a CSV file named "product_data.csv". The web driver is then closed to end the script.

Insights from the scraped data

Having successfully scraped the requisite data, we can now leverage it to derive critical insights that provide a deeper understanding of Dollar General's Mother's Day special products. Here are some of the key insights that can be inferred from the scraped data:

  1. Dollar General's extensive range of Mother's Day special products comprises 241 items, featuring 46 renowned brands such as Artskills, Scent Happy, Maybelline, Believe Beauty, and Clover Valley, among others.

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

2. Despite the abundant selection, certain products went out of stock due to high demand or other factors, with 51 out of 241 products currently out of stock.

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

3. Dollar General's focus on providing affordable products is evident in the pricing of the majority of the products, with a considerable portion priced below $10. This pricing strategy provides customers with a vast selection of budget-friendly options while maintaining the quality of the products.

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

4. An analysis of the review counts revealed that out of the total 241 products, 210 products received reviews within the range of 0-50. This suggests that a significant number of products garnered a relatively lower number of reviews.

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

5. An assessment of the ratings distribution indicated that the majority of the products, i.e., 115 out of 241, were rated either 0-1.0 or 4.0-5.0. This suggests that customers either loved or hated the products, with only a few receiving ratings between 1.0-2.0 or 2.0-3.0, indicating a polarized customer sentiment towards the products.

Scraping Dollar General: Scrape Mother's Day Special Products using Selenium

Also Read: How to Build an Amazon Price Tracker using Python

Ready to discover the power of web scraping for your brand?

Web scraping has proven to be an effective tool for extracting valuable data from e-commerce websites like Dollar General. By leveraging the power of web automation tools like Selenium, we were able to scrape essential product attributes, such as URLs, prices, images, descriptions, and stock status, which we analyzed to extract meaningful insights.

Whether you're a business owner looking to monitor your competitor's pricing strategy, a market researcher seeking to analyze market trends, or an individual looking to explore data-driven opportunities, web scraping is a game-changer.

If you're interested in leveraging web scraping to enhance your business operations, consider partnering with a reliable web scraping service provider like Datahut. With years of experience in web data extraction, Datahut can help you extract, clean, and deliver high-quality web data that meets your business needs.

Learn more about our web scraping services and how we can help you achieve your data goals. Take the first step towards data-driven success by contacting Datahut today!

25 views0 comments

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page