Scraping Aliexpress: Extracting Digital Camera Data
AliExpress is an online retail service based in China and owned by the Alibaba Group. It is made up of small businesses in China and other locations like Singapore. From gadgets and clothing to home appliances and electronics, it offers a large variety of products to international online buyers. This has made AliExpress a treasure trove of information in this digital age.
Scraping product data from AliExpress will help us to get valuable insights into the present market conditions. In this blog, we will be learning how to scrape product data from the AliExpress website. For this purpose, we will be scraping product data from the digital camera category and storing it in a CSV file.
Why Scrape AliExpress?
Scraping AliExpress, the giant e-commerce platform, offers a multitude of compelling reasons that can significantly benefit businesses and individuals alike. From market research to competitive analysis, the act of extracting data from AliExpress can unveil a wealth of strategic advantages. Here are some compelling motivations behind this practice:
Market Trends Analysis: By scraping AliExpress data, you gain access to a vast repository of product listings, prices, and descriptions. Analyzing this information enables you to track evolving market trends, identify emerging product categories, and stay ahead of consumer preferences.
Competitor Insights: Uncover valuable intelligence on your competitors. Scraping product data from AliExpress allows you to monitor their offerings, pricing strategies, and customer engagement, helping you fine-tune your own business approach for a competitive edge.
Product Research and Development: Scraping AliExpress provides a goldmine of ideas for product research and development. Explore customer reviews, ratings, and feedback to understand pain points and preferences, guiding your innovation efforts.
Pricing Strategy: Pricing your products competitively is crucial. By scraping pricing data from AliExpress, you can benchmark your prices against similar offerings in the market, ensuring your pricing strategy is both attractive to customers and profitable for your business.
Enhanced Customer Understanding: The reviews and comments left by AliExpress customers can offer valuable insights into consumer sentiments and preferences. Leveraging this data can help you tailor your products and marketing strategies to better resonate with your target audience.
Supply Chain Optimization: For businesses involved in dropshipping or sourcing products, scraping AliExpress can streamline your supply chain management. Accurate and up-to-date product information aids in making informed decisions about inventory and sourcing.
Data-Driven Decision-Making: In the age of data, informed decisions reign supreme. Scraping AliExpress empowers you to make data-driven choices, minimizing risks and maximizing opportunities for growth.
Let's dive into the scraping process.
Before we move on to the scraping process, we need to identify the attributes to be extracted for each product. We will extract the following attributes of each product:
Product_url: It is the address of a particular product on the Internet.
Product_name: It is the name of the product on the AliExpress website.
Sale_price: It is the selling of a product after the discounts are applied.
Mrp: It is the market price of a product.
Discount: It is the percentage deducted from the MRP of a product.
Rating: It is the overall rating a product has received from the customers.
No_of_reviews: It is the total number of reviews the product has received.
Seller_name: It is the name of the seller or store selling the product.
Importing the Required Libraries
Once the attributes are identified, we can start the coding process for scraping the AliExpress website. The first step in the coding process is to import the required libraries. Here, we will be scraping AliExpress using Selenium which is a tool used to automate web browsers. We will be using various libraries of Selenium along with some others to accomplish our task successfully. The libraries to be imported are:
Selenium web driver is a tool used for web automation. It allows a user to automate web browser actions such as clicking a button, filling in fields, and navigating to different websites.
ChromeDriverManager is a library that simplifies the process of downloading and installing the Chrome driver, which Selenium requires to control the Chrome web browser.
By class from selenium.webdriver.common.by which is used to locate elements on a web page using different strategies like ID, class name, XPATH etc.
The writer class from csv library is used to read and write tabular data in CSV format.
The sleep function from the time library is used to provide a pause or delay in the execution of a program for a specified number of seconds.
# importing required libraries from csv import writer from selenium import webdriver from selenium.webdriver.common.by import By from webdriver_manager.chrome import ChromeDriverManager from time import sleep
After importing the required libraries, we need to initialize a few things before scraping the digital camera data from the AliExpress website. First, we initialize a web driver by creating an instance of the Chrome web driver using the ChromeDriverManager method. Once initialized, a Chrome web browser will be opened so that Selenium can interact with it. The size of the window is maximized using the maximize_window() function.
Next, we initialize an empty list named product_link_list. We will be first scraping the link of each product from all the resulting pages when we search for a digital camera. This list will be used to store all these links. We have defined a variable named page_url, which will hold the link of the web page we are currently scraping. We will initialize it with the link of the first resulting page when we search for a digital camera. With this, our initialization process is complete.
# Initializing web driver driver = webdriver.Chrome(ChromeDriverManager().install()) driver.maximize_window() # Initializing an empty list to store product urls product_link_list =  # Url of first resulting page page_url = "https://www.aliexpress.com/category/200216589/digital-cameras.html?CatId=200216589&g=y&isCategoryBrowse=true&isrefine=y&page=1"
Extraction of Product URLs
As mentioned above, we will be first scraping the link of each product from all the resulting pages when we search for a digital camera. To implement this, we will be using a while loop so that the scraping goes on till the last resulting page. The code for the same is given as follows:
while 1: driver.get(page_url) sleep(5) driver.execute_script("window.scrollTo(0,document.body.scrollHeight)") sleep(5) page_product_links = driver.find_elements(By.XPATH, '//div[@class="list--gallery--34TropR"]/a') for product in page_product_links: product_link = product.get_attribute('href') product_link_list.append(product_link) try: next_button = driver.find_element(By.XPATH, './/li[@class="pagination--paginationLink--2ucXUo6 next-next"]') next_button.click() page_url = driver.current_url except Exception as e: break
Here, inside the while loop, we will first call the get() function with the page_url passed as a parameter. It is a predefined function which opens the url passed as parameter. The ‘execute_script("window.scrollTo(0,document.body.scrollHeight)")’ function is used to scroll our web page. It is executed because AliExpress website uses dynamic content loading techniques. It means that all the contents of the web page are not loaded initially. They will be loaded corresponding to some actions performed on the web page like scrolling. Therefore, we need to first scroll our web page so that all the products of that web page are loaded.
Now that all the products are loaded, we need to scrape the product links. For this, we will use find_elements() function and locate the element for product link on the web page using its XPATH and By class. This will return the product url elements as a list. To get the actual product link from these elements, we will be calling get_attribute method on each of these elements and extract the corresponding ‘href’ property and store it in product_link_list.
Next, we need to move on to the next page. There is a ‘next’ button located at the end of each page and clicking it will take us to the next page. So, we will locate the ‘next’ button on the current page using its XPATH and store it in a variable named next_button. When the click() function is applied to this variable, the next page will be loaded and the current_url function will retrieve the page url of the new page and assign it to the page_url variable. On the last page, the next button will not be there and therefore an error will be thrown while locating the next button. This error is handled by breaking out of the while loop and the product_link_list will now contain the link of all the products.
We will now define functions to extract each attribute.
# function to extract product name def get_product_name(): try: product_name = driver.find_element(By.XPATH, '//div[@class="title--wrap--Ms9Zv4A"]/h1').text except Exception as e: product_name = "Not available" return product_name # function to extract sale price def get_sale_price(): try: sale_price = '' sale_price_elements = driver.find_elements(By.XPATH, '//div[@class="es--wrap--erdmPRe notranslate"]/span') for sale_price_ele in sale_price_elements: sale_price_ele = sale_price_ele.text sale_price = sale_price + sale_price_ele except Exception as e: sale_price = "Not available" return sale_price # function to extract mrp def get_mrp(): try: mrp = driver.find_element(By.XPATH, '//span[@class="price--originalText--Zsc6sMv"]').text except Exception as e: mrp = 'Not available' return mrp # function to extract discount def get_discount(): try: discount = driver.find_element(By.XPATH, '//span[@class="price--discount--xET8qnP"]').text except Exception as e: discount = 'Not available' return discount # function to extract rating def get_rating(): try: rating = driver.find_element(By.XPATH, '//span[@class="overview-rating-average"]').text except Exception as e: rating = 'Not available' return rating # function to extract number of reviews def get_reviews(): try: no_of_reviews = driver.find_element(By.XPATH, '//a[@class="product-reviewer-reviews black-link"]').text except Exception as e: no_of_reviews = 'Not available' return no_of_reviews # function to extract seller name def get_seller(): try: seller_name = driver.find_element(By.XPATH, '//a[@class="store-header--storeName--vINzvPw"]').text except Exception as e: seller_name = 'Not available' return seller_name
Writing to a CSV File
The extracted data needs to be stored so that it can be further used for other purposes like analysis. Now we will see how to store the extracted data to a csv file.
First, we will open a file named “digital_camera_data.csv” in the write mode and initialize an object of the writer class named theWriter. The headings of different columns of the csv file are first initialized as a list and then written to the file using the writerow() function.
Now we will extract the information about each product. For this, we will iterate through each product link in the product_link_list and call the get() function and the functions defined earlier to extract the required attributes. The attribute values returned are first stored as a list and then written into the csv file using the writerow() function. After the process is completed, the quit() command is called which closes the web browser that was opened by the selenium web driver.
It can be noted that sleep() function is called in between different function calls. It is provided to avoid getting blocked by the website.
# Opening a CSV file with open('digital_camera_data.csv','w',newline='', encoding='utf-8') as f: theWriter = writer(f) heading = ['product_url', 'product_name', 'sale_price', 'mrp', 'discount', 'rating', 'no_of_reviews', 'seller_name'] theWriter.writerow(heading) for product in product_link_list: driver.get(product) sleep(5) product_name = get_product_name() sleep(3) sale_price = get_sale_price() sleep(3) mrp = get_mrp() sleep(3) discount = get_discount() sleep(3) if mrp == 'Not available': mrp = sale_price rating = get_rating() sleep(3) no_of_reviews = get_reviews() sleep(3) seller_name = get_seller() sleep(3) record = [product, product_name, sale_price, mrp, discount, rating, no_of_reviews, seller_name] theWriter.writerow(record) # Closing the web browser driver.quit()
In this blog, we learned the process of scraping digital camera data from AliExpress using some powerful Python libraries and techniques. This data is of great importance as it provides valuable information about the market trends and the e-commerce landscape. It is of great value to businesses as it helps them to track pricing and analyze their competitors and customer sentiments.
Ready to harness the power of data-driven decisions for your own business ventures? Dive into the world of seamless web scraping with DataHut's web scraping services. Contact us today!