Scraping Amazon Product Listings – What You Should Know?

Amazon is one of the world’s largest e-commerce sites with millions of products. This Data can be used for a variety of purposes.

If yes, you might be knowing that Amazon is somewhat difficult to scrape, but it is definitely not impossible. To get the product you need, scraper need to dig very deep. The complexity of extracting data depends on the type of anti-scraping mechanisms in Amazon.

Even though there are many methods in the application level to block bots, Amazon seems to be using IP-based captcha most of the time. What this means is that, If you download too many pages from the same IP at a very high speed, Amazon will come up with captcha. Captchas are almost impossible to beat. Only an intelligent method can get you data from amazon. Never bombard Amazon with thousands of requests per second.

The best way to circumvent IP-based captcha is by using an IP rotator that Rotates IP addresses periodically. We used Python Scrapy framework to write web scrapers that scrape data from Amazon with great success. Nutch is also a good choice if you are looking for a non Pythonic solutions.

These are the most common items extracted from Amazon:

  • Product Name  

  • Price

  • Product Features

  • Product Type

  • Manufacture & Brand

  • Deals & Offers

  • Product Description

  • Company Description

  • Customer Reviews

  • SKU

  • Rank of product

  • Rank In a particular category

Scraped data can be easily exported into CSV, XML, JSON formats or to a database like MySQL, Mongo DB. Listings which span across multiple pages and categories can be easily extracted.

