How to Scrape Data from Flipkart using Python

Flipkart is India’s leading e-commerce website, which provides services that include online shopping, courier services, and bill payment options. It is also one of the largest online marketplaces in India, providing a wide range of products to its users.

The sheer volume of data available on Flipkart’s servers makes it a treasure trove for marketing professionals and marketers. The data available on Flipkart can help brands learn about their competitors. The detailed analysis can help brands understand the consumer better, increase their brand image, win more market share and maintain brand loyalty from existing customers.

In this blog, we are going to see how to scrape data from the smartwatch category of Flipkart. We will scrape the details of different smartwatches of the top smartwatch brands in India and save them into a csv file.

The Brands

We'll be extracting smartwatch data of the following brands:

The Attributes

We will extract the following attributes of each product:

Product URL - It is the address of a particular product on the Internet.
Product title - It is the name of the product on the Flipkart website.
Brand - It is the brand to which the product belongs to, for example, Apple.
Sale price - It is the selling price of a product after the discounts are applied.
MRP - It is the market price of a product.
Discount percentage - It is the percentage deducted from the MRP of a product.
Memory - It refers to the internal memory of the product.
No. of ratings - It is the total number of ratings the product has received.
No. of reviews - It is the total number of reviews the product has received.
Star rating - It is the overall rating of the product. Higher the rating, the better the product.
Description - It is a short detail or specifications about the product provided on the website.

Required Libraries

The first step in any scraping project is to import the required libraries. Here, we import the BeautifulSoup library, requests library, etree module from the lxml library, csv library, random library, and time library.

from bs4 import BeautifulSoup
import requests
from lxml import etree as et
import csv
import random
import time

BeautifulSoup is a python library that is used for parsing and pulling data out of HTML and XML files.
The requests library in Python is used to send HTTP requests to servers.
The lxml library of Python is used for the processing of HTML and XML files. An ElementTree or etree is a module in lxml used to parse XML documents.
The csv library is used to read and write tabular data in CSV format.
random library is used to generate random numbers or randomize a process in Python.
time library is used to represent time in different ways.

Now, let’s move on to the scraping process.

Scraping Process

After importing the required libraries, the next step is to define a list of user agent headers. A user agent string is a text that a client sends through a request which helps the server to identify the browser, type of device, and operating system which is being used.

Most of the websites block scraping as they do not wish to share their data or they wish to share it only with valid users and not bot scrapers. To avoid getting blocked by a website, we need to define a user agent while making requests. With a single-user agent header, there is a chance of the web server identifying and blocking our user agent. To avoid this, we must keep rotating our user agent. In this way, it will appear as if the request is coming from different browsers.

header_list = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,                           like Gecko) Chrome/103.0.5060.66 Safari/537.36",

              "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0",

              "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"]

base_url = "https://www.flipkart.com"

smartWatch_brands = ['APPLE', 'Noise', 'boAt', 'Honor', 'SAMSUNG', 'FITBIT', 'Amazfit', 'GARMIN', 'Huawei', 'FOSSIL']

product_list = []

In the above block of code, we can see that we have defined a list named header_list and passed 3 user agent headers inside it. We have also initialized another two lists. One is the smartWatch_brands which contains the name of the brands and an empty list product_list to store the link to each product. We have also initialized a string, base_url, which we will use later in the program.

Next, we will see what all operations are taking place when scraping a particular URL.

for brand in smartWatch_brands:
   page_url = "https://www.flipkart.com/wearable-smart-devices/smart-watches/pr?sid=ajy%2Cbuh&otracker=categorytree&p%5B%5D=facets.brand%255B%255D%3D" + brand
   dom = get_dom(page_url)
   pages = dom.xpath('//div[@class="_2MImiq"]/span/text()')
   page_product_list = dom.xpath('//a[@class="_1fQZEK"]/@href')
   product_list += page_product_list
   if not pages:
       continue
   else:
       no_of_pages = int(pages[0].split()[3])
       for i in range(2, no_of_pages + 1):
           next_page_url = page_url + "&page=" + str(i)
           dom = get_dom(next_page_url)
           page_product_list = dom.xpath('//a[@class="_1fQZEK"]/@href')
           product_list += page_product_list
           time.sleep(random.randint(2,8))

For scraping a website, we first need the URL of the website. In this for loop, we take each brand from smartWatch_brands and concatenate it with a string and store it in a variable named page_url. This variable now contains the URL of the first page of that brand in the smartwatch category of Flipkart. Now we call the get_dom() function and pass the page_url as the parameter.

def get_dom(the_url):
   user_agent = random.choice(header_list)
   header = {"User-Agent": user_agent}
   response = requests.get(the_url, headers=header, stream=True)
   soup = BeautifulSoup(response.text, 'lxml')
   current_dom = et.HTML(str(soup))
   return current_dom

In the get_dom function, we randomly select a user-agent header from the header_list and pass it as a value to the User-Agent. Then we send an HTTP request to the URL of the webpage we want to access with the user agent as the header. For sending this request, we used python’s requests library. It will return a Response object which will contain the server's response. Here, the response is the HTML content of the website and is stored in a variable named response. This HTML content is in raw form and is difficult to parse. So, for the simplification of traversing, we will convert this raw HTML content into a parse tree using a parser. We can achieve this by using python’s BeautifulSoup library and converting it into a tree-like structure using python’s etree module from the lxml library. This tree-like structure is returned and stored in a variable named dom, which is inside the loop.

As a next step, we will use XPath to locate different elements we want to extract from the website. For this, we will select the element on the website we want to extract, right-click on the mouse and select inspect. We will get the tag and the class name within which the element is enclosed. We will use this tag and class name in a XPath to get the element out of it. An example of extracting an element using XPath is shown below.

We can see that the link of the product is enclosed within the anchor tag or the ‘a’ tag, and the class is named ‘_1fQZEK’. The link to this product is extracted using the below-given line of code.

page_product_list = dom.xpath('//a[@class="_1fQZEK"]/@href')

This line of code in our program will store the links to all the products on the page to the list named page_product_list. Then we will add this list to another list named product_list. After the for loop has finished, product_list will contain the links to all the products in the smartwatch category of all the brands mentioned in the smartWatch_brands list. In the end, we have given a time delay using python’s time library. It is another way to avoid getting blocked during scraping.

Extraction Process

Now that we have the link to all the products and know how to extract a particular element from a website, we can extract the attributes required by us for the smartwatches. For that purpose, we have defined the following functions:

def titleandbrand(dom1):
   try:
       title = dom1.xpath('//span[@class="B_NuCI"]/text()')[0]
   except Exception as e:
       title = 'No title available'
   if title == 'No title available':
       product_brand = 'No brand available'
   else:
       product_brand = title.split()[0]
   return title, product_brand


def salespriceandmrp(dom1):
   try:
       sales_price = dom1.xpath('//div[@class="_30jeq3 _16Jk6d"]/text()')[0].replace(u'\u20B9','')
   except Exception as e:
       sales_price = 'No price available'
   try:
       mrp = dom1.xpath('//div[@class="_3I9_wc _2p6lqe"]/text()')[1]
   except Exception as e:
       mrp = sales_price
   return sales_price, mrp


def discount(dom1):
   try:
       disc = dom1.xpath('//div[@class="_3Ay6Sb _31Dcoz"]/span/text()')[0].split()[0].replace('%','')
   except Exception as e:
       disc = 0
   return disc


def noofratings(dom1):
   try:
       no_of_ratings = dom1.xpath('//div[@class="col-12-12"]/span/text()')[0].split()[0]
   except Exception as e:
       no_of_ratings = 0
   return no_of_ratings


def noofreviews(dom1):
   try:
       no_of_reviews = dom1.xpath('//div[@class="col-12-12"]/span/text()')[1].split()[0]
   except Exception as e:
       no_of_reviews = 0
   return no_of_reviews


def overallrating(dom1):
   try:
       overall_rating = dom1.xpath('//div[@class="_2d4LTz"]/text()')[0]
   except Exception as e:
       overall_rating = 0
   return overall_rating


def description(dom1):
   try:
       desc = dom1.xpath('//div[@class="_1mXcCf RmoJUa"]/text()')
       if desc:
           desc = desc[0]
       else:
           specs_title = dom1.xpath('//tr[@class="_1s_Smc row"]/td/text()')
           specs_detail = dom1.xpath('//tr[@class="_1s_Smc row"]/td/ul/li/text()')
           specs_dict = {}
           for i in range(len(specs_title)):
               specs_dict[specs_title[i]] = specs_detail[i]
           desc = str(specs_dict)
   except Exception as e:
       specs_title = dom1.xpath('//tr[@class="_1s_Smc row"]/td/text()')
       specs_detail = dom1.xpath('//tr[@class="_1s_Smc row"]/td/ul/li/text()')
       specs_dict = {}
       for i in range(len(specs_title)):
           specs_dict[specs_title[i]] = specs_detail[i]
       desc = str(specs_dict)
   return desc


def memory(dom1):
   try:
       features = dom1.xpath('//li[@class="_21lJbe"]/text()')
       for ele in features:
           if 'GB' in ele:
               mem = ele.replace('GB','')
               break
           else:
               mem = 'Memory data not available'
   except Exception as e:
       mem = 'Memory data not available'
   return mem

Writing to a CSV file

We need to store the extracted data somewhere so that we can later use it for different purposes like analyses. Now we will see how to write the extracted data to a csv file.

Here, we open a csv file named “smartwatch_data.csv” in the write mode as f. Then we will create a writer object named theWriter and initialize a list named heading which contains the headings each column in our data is going to have. We write it to the csv file using the writerow() function.

with open('smartwatch_data.csv','w',newline='', encoding='utf-8') as f:
   theWriter = writer(f)
   heading = ['Product_url','Product_name','Brand','Sale_price','MRP','Discount_percentage','Memory','No_of_ratings','No_of_reviews','Star_rating','Description']
   theWriter.writerow(heading)
   for product in product_list:
       product_url = base_url + product
       product_dom = get_dom(product_url)
       product_url = product_url.split('&marketplace')[0]
       title, brand = titleandbrand(product_dom)
       sales_price, mrp = salespriceandmrp(product_dom)
       disc = discount(product_dom)
       no_of_ratings = noofratings(product_dom)
       no_of_reviews = noofreviews(product_dom)
       overall_rating = overallrating(product_dom)
       desc = description(product_dom)
       mem = memory(product_dom)
       record = [product_url, title, brand, sales_price, mrp, disc, mem,  no_of_ratings, no_of_reviews, overall_rating, desc]
       theWriter.writerow(record)

After writing the headings into the csv file, we will now iterate through each product in the product_list list. The links in the product_list do not have the domain name of Flipkart. So, we concatenate the domain name to it by using the base_url we declared at the beginning of the program and pass this link as a parameter to each function. Each function extracts a different attribute of the product and returns it. These values are then stored in a list named record. Then we will write the details of that particular product as the next row in the csv file using the writerow(record) function.

Conclusion

In this blog, we learned how to scrape smartwatch data from Flipkart using Python. We saw different libraries used for scraping. By applying some minor changes, we can extract data from any product category from Flipkart. In the next blog, we will analyze the data we have extracted, visualize it and understand the importance of data analysis in the market.

You can download the extracted data from this link.

At Datahut, we help market research companies get all kinds of data in every industry for conducting competitive research. If you need web data from another category for analysis, Contact Datahut

Related Reading: