How to Scrape Product Information from Walmart using Python beautifulsoup

Srishti Saha
Apr 27, 2021
6 min read

Updated: Nov 3

How to Scrape Product Information from Walmart using Python beautifulsoup

Walmart is a leading retailer. It has both an online store and brick-and-mortar stores across the world. With a large variety of products in their portfolio and 519.93 billion USD in net sales, Walmart not only dominates the retail market but also has a lot of data that could be used to gain insights on customer behavior, product portfolios, and even market trends.

In this article, we will scrape product data from Walmart.com and store it in a SQL database. We used Python to scrape the website. The package used for this scraping exercise is BeautifulSoup. Along with that, we also used Selenium as that library allows us to interact with the Google Chrome browser.

Scraping Walmart Product Data

The first step is to import all necessary libraries. Once, we had the packages imported, we started by setting up the flow of the scraper. To modularize our code, we first investigated the structure of the URL of the product pages on Walmart.com. A URL or a Uniform Resource Locator is the address of the web page that a user is referring to and can be used to uniquely identify a page.

In the above example here, we have created a list of URLs of pages within the electronics department on Walmart.com. We have also created a list (or technically termed an array) of the names of these product categories. We will use them later to name our datasets or tables.

You can add or remove as many subcategories as you want for each major product category. All you need to do is go to the subcategory pages and extract the URL of the page. This address is common for all the products available on that page. You can do this for as many product categories as you want. In the image below, we have added categories like Food and Toys as an example, for our demo here.

Also Read: 15 Web Scraping Questions To Ask Before Writing Your Own Scraper

Also, please note that we have stored the URLs in a list as it makes processing data in Python easier in this manner. Once, we have these lists ready, we can now move on to writing the scraper.

We have created a loop to automate the scraping exercise. However, we can always run it for just one category and subcategory as well. Let us assume, we want to scrape information for just one sub-category i.e. TVs in the ‘Electronics’ category. We will later show how to scale the code up for all sub-categories.

Here, the variable pg=1 ensures we are scraping data for only the first URL in the array ‘url_sets’ i.e. only for the first subcategory in the main category. Once you do that, the next step would be to define the number of product pages you would want to visit/open to scrape information from. For this demo, we are scraping information from the top 10 pages (as defined in the top_n array).

We then loop through the complete length of the top_n array i.e. 10 times to open the product pages and extract the complete web-page structure in the form of an HTML code. This is similar to inspecting the elements of a web page (by right-clicking on a page and selecting ‘Inspect Element') and copying the resultant HTML code. However, we have added a constraint that only the part of the HTML structure that lies within the tag ‘Body’ is extracted and stored as an object. This is because the relevant product information is only in the HTML body of the page.

This object can now be used to pull relevant product information for all the products that were listed on the active page. To do so, we identified that the tag containing product information is a ‘div’ tag with the class, ‘search-result-gridview-item-wrapper’. So, in the next step, we used the find_all function to extract all such occurrences of the above class. We stored this information in a temporary object called ‘codelist’.

Now, we constructed the URL of individual products. To do so, we observed that all product pages start with the basic string ‘https://walmart.com/ip’. Any unique-identifies was only added in front of this string. The unique identifier was the same as the string values extracted from the ‘search-result-gridview-item-wrapper’ items stored above. So now, in the next step, we looped through the temporary object codelist, to construct the complete URL of a particular products’ page.

Using this URL, we would be able to extract specific product-level information. For this demo, we are pulling details like the unique Product code, Product name, Product_description, URL of the product page, name of the parent page category in which the product is placed (parent breadcrumb), name of active subcategory in which the product is placed on the website (active breadcrumb), Product price, rating (in terms of number of stars), number of ratings or reviews for the product and other similar products recommended on Walmart’s website similar or related to that product. You could customize this list as per your requirement.

The above code follows the next step of opening the individual product page, on the basis of the constructed URL and extracting the product attributes, as mentioned in the list above. Once you are happy with the list of the attributes being pulled in the code, the final step for the scraper would be to append all this information for all the products in a subcategory in a single data frame. The code below does that for you.

The data frame ‘df’ will have all the information for the products on the top 10 pages of the selected subcategory in your code. You can either write the data out on a CSV file or export it to a SQL database. If you want to export it to a MySQL database in a table called ‘product_info’, you can use the code below:

You will need to enter the credentials of your SQL database and once you do that, Python allows you to directly connect your working environment to the database and thus push your dataset directly as a SQL dataset. Here, in the code above, if a table by that name already exists, the current code will replace the existing table. You can always change the script to avoid doing so. Python gives you the option to either 'fail', 'replace', or 'append' data here.

This is a basic structure of the code that can be modified to add exceptions to take care of missing data or late loading pages. If you opt to loop this code for multiple subcategories, the complete code will look like this:


import os
import selenium.webdriver
import csv
import time
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

url_sets=["https://www.walmart.com/browse/tv-video/all-tvs/3944_1060825_447913",
    "https://www.walmart.com/browse/computers/desktop-computers/3944_3951_132982",
         "https://www.walmart.com/browse/electronics/all-laptop-computers/3944_3951_1089430_132960",
         "https://www.walmart.com/browse/prepaid-phones/1105910_4527935_1072335",
         "https://www.walmart.com/browse/electronics/portable-audio/3944_96469",
         "https://www.walmart.com/browse/electronics/gps-navigation/3944_538883/",
         "https://www.walmart.com/browse/electronics/sound-bars/3944_77622_8375901_1230415_1107398",
         "https://www.walmart.com/browse/electronics/digital-slr-cameras/3944_133277_1096663",
         "https://www.walmart.com/browse/electronics/ipad-tablets/3944_1078524"]

categories=["TVs","Desktops","Laptops","Prepaid_phones","Audio","GPS","soundbars","cameras","tablets"]


# scraper
for pg in range(len(url_sets)):
    # number of pages per category
    top_n= ["1","2","3","4","5","6","7","8","9","10"]
    # extract page number within sub-category
    url_category=url_sets[pg]
    print("Category:",categories[pg])
    final_results = []
    for i_1 in range(len(top_n)):
        print("Page number within category:",i_1)
        url_cat=url_category+"?page="+top_n[i_1]
        driver= webdriver.Chrome(executable_path='C:/Drivers/chromedriver.exe')
        driver.get(url_cat)
        body_cat = driver.find_element_by_tag_name("body").get_attribute("innerHTML")
        driver.quit()
        soupBody_cat = BeautifulSoup(body_cat)
 
 
        for tmp in soupBody_cat.find_all('div', {'class':'search-result-gridview-item-wrapper'}):
            final_results.append(tmp['data-id'])
            
    # save final set of results as a list        
    codelist=list(set(final_results))
    print("Total number of prods:",len(codelist))
    # base URL for product page
    url1= "https://walmart.com/ip"


    # Data Headers
    WLMTData = [["Product_code","Product_name","Product_description","Product_URL",
   "Breadcrumb_parent","Breadcrumb_active","Product_price",         
   "Rating_Value","Rating_Count","Recommended_Prods"]]
 
    for i in range(len(codelist)):
        #creating a list without the place taken in the first loop
        print(i)
        item_wlmt=codelist[i]
        url2=url1+"/"+item_wlmt
        #print(url2)


        try:
            driver= webdriver.Chrome(executable_path='C:/Drivers/chromedriver.exe') # Chrome driver is being used.
            print ("Requesting URL: " + url2)


            driver.get(url2)   # URL requested in browser.
            print ("Webpage found ...")
            time.sleep(3)
            # Find the document body and get its inner HTML for processing in BeautifulSoup parser.
            body = driver.find_element_by_tag_name("body").get_attribute("innerHTML")
            print("Closing Chrome ...") # No more usage needed.
            driver.quit()     # Browser Closed.


            print("Getting data from DOM ...")
            soupBody = BeautifulSoup(body) # Parse the inner HTML using BeautifulSoup


            h1ProductName = soupBody.find("h1", {"class": "prod-ProductTitle prod-productTitle-buyBox font-bold"})
            divProductDesc = soupBody.find("div", {"class": "about-desc about-product-description xs-margin-top"})
            liProductBreadcrumb_parent = soupBody.find("li", {"data-automation-id": "breadcrumb-item-0"})
            liProductBreadcrumb_active = soupBody.find("li", {"class": "breadcrumb active"})
            spanProductPrice = soupBody.find("span", {"class": "price-group"})
            spanProductRating = soupBody.find("span", {"itemprop": "ratingValue"})
            spanProductRating_count = soupBody.find("span", {"class": "stars-reviews-count-node"})
 
            ################# exceptions #########################
            if divProductDesc is None:
                divProductDesc="Not Available"
            else:
                divProductDesc=divProductDesc
 
            if liProductBreadcrumb_parent is None:
                liProductBreadcrumb_parent="Not Available"
            else:
                liProductBreadcrumb_parent=liProductBreadcrumb_parent
 
            if liProductBreadcrumb_active is None:
                liProductBreadcrumb_active="Not Available"
            else:
                liProductBreadcrumb_active=liProductBreadcrumb_active
 
            if spanProductPrice is None:
                spanProductPrice="NA"
            else:
                spanProductPrice=spanProductPrice


            if spanProductRating is None or spanProductRating_count is None:
                spanProductRating=0.0
                spanProductRating_count="0 ratings"


            else:
                spanProductRating=spanProductRating.text
                spanProductRating_count=spanProductRating_count.text




            ### Recommended Products
            reco_prods=[]
            for tmp in soupBody.find_all('a', {'class':'tile-link-overlay u-focusTile'}):
                reco_prods.append(tmp['data-product-id'])


            if len(reco_prods)==0:
                reco_prods=["Not available"]
            else:
                reco_prods=reco_prods
            WLMTData.append([codelist[i],h1ProductName.text,ivProductDesc.text,url2, liProductBreadcrumb_parent.text, liProductBreadcrumb_active.text, spanProductPrice.text, spanProductRating, spanProductRating_count,reco_prods])


        except Exception as e:
            print (str(e))

# save final result as dataframe
    df=pd.DataFrame(WLMTData)
    df.columns = df.iloc[0]
    df=df.drop(df.index[0])

# Export dataframe to SQL
import sqlalchemy
database_username = 'ENTER USERNAME'
database_password = 'ENTER USERNAME PASSWORD'
database_ip       = 'ENTER DATABASE IP'
database_name     = 'ENTER DATABASE NAME'
database_connection = sqlalchemy.create_engine('mysql+mysqlconnector://{0}:{1}@{2}/{3}'. format(database_username, database_password, database_ip, base_name))
df.to_sql(con=database_connection, name='‘product_info’', if_exists='replace',flavor='mysql')

You can always add more complexity to this code to add customization to your scraper. For instance, the above scraper takes care of missing data in attributes like description, price, or reviews. This data could be missing for numerous reasons like the product is sold out and out of stock, incorrect data entry, or is too new to have any rating or information currently.

To adapt to varying web structures, you will have to keep modifying your scraper for it to be functional when the webpage is updated. This scraper provides you with the base template for a Python scraper on Walmart.com.

Looking to scrape web data for your business? Contact Datahut, your web scraping experts.

Related readings:

How to Scrape Product Information from Walmart using Python beautifulsoup

Scraping Walmart Product Data

Recent Posts

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?