The e-commerce industry is being more and more data-driven. Scraping product data from Amazon and other large e-commerce websites is a crucial piece of competitive intelligence. There is a massive volume of data in Amazon alone ( 120 + million as of today). Extracting this data every day is a huge task.
At Datahut, we work with many customers helping them get access to data.
But some people need to set up an in-house team to extract data for many reasons. This blog post is for those people who need to understand how to set up and scale an in-house team.
Assumptions
These assumptions will give you a rough idea of the scale, efforts and challenges we will be dealing with:
You are looking to extract product information from 20 large e-commerce websites, including Amazon.
You need data from the 20 – 25 subcategories within the electronics category from a website. The total number of categories and subcategories are around 450.
The refresh frequency is different for different subcategories. Out of 20 subcategories (from one website) 10 need refresh daily, five need data once in two days, 3 need data once in three days and two need data once a week.
There are four websites with anti-scraping technologies implemented.
The volume of the data varies from 3 million to 7 million per day, depending on what day of the week it is.
Understanding e-commerce data
We need to understand the data we’re extracting. For demonstration purpose – let’s choose Amazon. Note the fields we need to extract:
Product URL
breadcrumb
Product Name
Product description
Pricing
Discount
Stock Details ( In stock or Not )
Image URL
Average star rating
The Frequency
The refresh frequency is different for different subcategories. Out of 20 subcategories ( from one website) 10 need refresh daily, five need data once in two days, 3 need data once in three days and two need data once a week. The frequency could change later, depending on how the business teams priorities change.
Understanding specific requirements
When we are working with large data extraction projects with our enterprise customers - they always have special requirements. These are done to ensure internal compliance guidelines or to improve the efficiency of an internal process.
Here are common special requests:
Have a copy of the extracted HTML ( unparsed data) dumped into a storage system like Dropbox or Amazon s3.
Build an integration with a tool to monitor the progress of data extraction. Integrations could be a simple slack integration to notify when data delivery is complete or building a complex pipeline to BI tools.
Getting screenshots from the product page.
If you have such requirements either now or in the future – you need to plan ahead. A common case is storing data for analysing it later.
Reviews
In some cases, you need to extract reviews as well. A common case is to improve brand equity and brand reputation by analysing reviews. Review extraction is a special case, and most teams miss this during the project planning state thereby overshooting the budget.
You can read about review extraction here – Extracting reviews from Amazon.
Here what is special about reviews –
There might be 10000 reviews for a product like iPhone 10. If you want to extract 10000 reviews – you’ll need to send 10000 requests. When you are estimating the resources – this thing needs to be considered.
The Data extraction process
A web scraper is designed based on the structure of a website. In layman’s terms, you send a request to the site, the website returns you an HTML page, and you parse the information from the HTML.
Here is what happens In a typical low volume data extraction use case - You write a web scraper using a Python or any other frameworks like Scrapy. You run it from your terminal and convert this into a CSV file. Simple.
At huge volumes like say 5 million products a day – everything changes.
Data extraction challenges
1. Writing & Maintaining Scrapers
You can use python to write scrapers to extract data from e-commerce websites. In our case – we need to extract data from 20 subcategories from a website. Depending on structural variations, you will need multiple parsers within your scraper to get the data.
Amazon and other large e-commerce websites change the pattern of categories and subcategories frequently. So the person responsible for maintaining web scrapers needs to make constant adjustments to the scraper code.
One or two members of your team should write scrapers and write parsers when the business team adds more categories and websites. Scrapers usually need adjustments every few weeks. A minor change in the structure would affect the fields you scrape. It might either give you incomplete data or crash the scraper, depending on the logic of the scraper. And eventually, end up building a scraper management system.
Web scrapers work based on the way the website are built. Each website will be representing data in a different way. Handling all this mess requires a common language, a unified format. This format will also evolve over time, and you have to get it right the first time.
Detecting changes early enough is the key to ensuring you don’t miss the data delivery to schedule. You need to build a tool to detect pattern changes and alert the scraper maintenance team. This tools should ideally run every 15 minutes to detect changes.
At Datahut, we built an early warning system for website changes using Python. Do you need us to write a blog on how to build a simple website pattern change detection system? Do let us know in the comments.
2. Big data & scraper management systems
Handling a lot of scrapers via terminal isn’t a good idea. You need to find productive ways to handle them. At Datahut, we built a GUI that can be used as an interface to the underlying platform for deploying and managing scrapers without the need to depend on terminal every time.
Managing huge volumes of data is a big challenge, and you either need to build a data warehousing infrastructure inhouse or use a cloud-based tool like a snowflake.
3. Auto scraper generator
Once you build a lot of scrapers, the next thing is to improve your scraping framework itself. You can find common structural patterns and use them to build scrapers faster. You should think about building an auto scraper framework once you have a considerable number of scrapers.
4. Anti-scraping & Change in anti-scraping
As told in the intro, websites will have anti-scraping technologies to prevent/make it difficult to extract data. They either build their own IP based blocking solution or install a third-party service. Getting around the anti-scraping at huge scale is not simple. You need to purchase a lot of IP’s and efficiently rotate them.
For a project that requires 3-6 million records every day, you’ll need two things:
You’ll need a person to manage the proxies and IP rotator – if not managed well – it will skyrocket your IP purchase bill.
You’ll need to partner with 3-4 IP vendors.
Sometimes e-commerce websites block a range of IP’s, and your data delivery will be interrupted. To avoid this – use IP’s from multiple vendors. At Datahut, we have partnered with more than 20 providers to ensure we have enough IP’s in our pool. Depending on your scale – you should decide how many IP partners you need.
Just rotating the IP’s can’t do the job. There are many ways to block bots and e-commerce websites keep on changing it. You need a person with a research mindset to find solutions and keep the scraper running.
Queue Management
When you are scraping data on a small scale – you can afford to make requests in a loop. You can make ten requests a minute and still get all the data you need in a matter of hours. You don’t have this luxury at the scale of millions per products a day.
Crawling and parsing part of your scrapers needs to be separated and should be run as multiple tasks. If anything happens to one part of the scraper, that part can be re-executed in isolation. You need to use an efficient queue management system like Redis or Amazon SQS to do it properly. The biggest use case of this is retiring failed requests.
You also need to process the crawled URL’s in parallel to speed up the data extraction process. If you are using Python, try a threading interface library like Multiprocessing to speed up the execution.
Data Quality challenges
The business team, which consumes the data is the one concerned about the quality of the data. Bad data makes their job difficult. The data extraction team often overlooks data quality until a major problem occurs. You need very tight data quality protocols in place at the beginning of the project if you’re using this data on a live product or for a customer.
Pro tip: If you are working on a POC for a consulting project where product data is the key- Data quality can be the difference between your POC accepted or rejected. Most of your competitor will ignore the data quality part and do a poor job on the proposal and POC. At Datahut, we work with our customers to prepare a proposal where data quality guidelines and a framework is specifically mentioned. The business team that takes the decisions on POC loves it.
Get in touch with us if you’re working on a POC that requires web data as one key component.
The records which do not meet the quality guidelines will affect the overall integrity of the data. Making sure that the data meets quality guideline while crawling is difficult because it needs to be performed in real-time. Faulty data can cause serious problems if you are using it for making business decisions.
Do you want us to write about how to build a data quality guideline and framework? Do let us know in the comments.
Here are the errors that are common in scraped product data from e-commerce websites:
1. Duplicates
When collecting and consolidating data, it’s possible that duplicates pop up depending on scraper logic and also how nicely amazon play. This is a headache for data analysts. You need to find them and remove them.
2. Data Validation Errors
The field you are scraping should be an integer, but when scraped it turned out to be a text. This kind of errors is called data validation errors. You need to build rule-based test frameworks to detect and flag this kind of errors. At Datahut, we define the data types and other properties of every data item. Our data validation tools will flag to the projects QA team if there are any inconsistencies. All those flagged items will be manually checked and reprocessed.
3. Coverage errors
If you are scraping millions of products – there is a chance that you miss many items. It can be due to request failures or improper design of the scraper logic. This is called the item coverage inconsistency.
Sometimes the data you scraped might not contain all the fields necessary. This is what we call field coverage inconsistency. Your test framework should be able to identify these two types of errors.
Coverage inconsistency is a major problem for self-service tools and data as a service powered by self-service tools.
4. Product Errors
There are cases where multiple variants of the same product need to be scrapped. In those cases – there might be data inconsistency across different variants. Data unavailability and depiction of data in a different way causes confusions in data.
E.g., Depicting data in metric system and SI system. Currency variations.
E.g., In the case of a mobile phone, there can be variations in the RAM size, colour, price etc.
Your Q&A team framework needs to tackle this challenge as well.
5. Site Changes
Amazon and other large e-commerce websites change their patterns frequently. This can be a site-wide change or in a few categories. Scrapers usually need adjustments every few weeks, as a minor change in the structure would affect the fields you scrape, might either give you incomplete data.
There is a chance that the website changes the pattern while you are crawling it. If the scraper is not crashed – the data being scraped might be corrupted ( the data scraped after the pattern change).
If you are building an in-house team, you need a pattern change detector to detect the change and stop the scraper. Once you made the adjustments – you can resume to scrape amazon – saving a lot of money and computing resources.
Data management challenges
Managing large volumes of data comes with a lot of challenges. Maybe you have the data, storing, and using the data comes with a whole new level of technical and functional challenges. The amount of data you are collecting will only continue to increase. However, without a proper foundation in place to use large amounts of data, organizations won’t be able to get the best value out of it.
1. Storing data
You need to store data into a database for processing. Your Q&A tools and other systems will fetch data from the database. Your database needs to be scalable and fault-tolerant. You also need a backup system to access the data in case the primary storage fails someway. There were even reported cases of using ransomware to hold the data hostage. You need a backup for every record to handle both cases mentioned above.
2. Understanding the need for a cloud-hosted platform
If data is a must for your company, a data extraction platform is also necessary. You can’t run scrapers to the terminal every time. Here are a few reasons why you should consider investing in building a platform early on.
3. Need data frequently
If you need data frequently and automate the scheduling part – you need a platform with an integrated scheduler to run the scraper. Having a visual user interface is even better because even non-technical people can start the scraper with the click of a button.
4. Reliability is a must
Running scrapers on your local machine is not a good idea. You need a cloud-hosted platform for a reliable supply of data. Use the existing services of Amazon Web services or Google cloud platform to build a cloud-hosted platform.
5. Anti-scraping technologies
You need the ability to integrate tools to evade anti-scraping technologies, and the best way to do this is to connect their API to your cloud-based platform.
6. Data Sharing
Data sharing with your internal stakeholders can be automated if you can integrate your data storage with Amazon S3 Azure storage or similar services. Most analytics and other data preparation tools in the market have native Amazon S3 or google cloud platform integrations.
7. DevOps
DevOps is where any application starts and used to be a hectic process. Not anymore. AWS, Google cloud platform and similar services provide a set of flexible tools designed to help you build applications to more rapidly and reliably. These services simplify Devops, managing data platform, deploying application and scraper code and monitoring your application and infrastructure performance. It is always best to choose a cloud platform and use their services depending on your preferences.
8. Change Management
Depending on the way your business team uses the scraped data. There will always be changes. These changes could be in the structure of the data, change in the refresh frequency or something else. Managing these changes are very process-driven. Based on our experience – the best way to manage change is to do two basic things right.
Use a single point of contact - There might be ten people in your team. However, there should only be one person you should contact for a change request. This person will delegate the tasks and get them done.
Use a ticketing tool - We found the best way to deal with change management is to use a ticketing tool internally. If there needs to be a change – open a new ticket, work with stakeholders and close it.
9. Team Management
Managing a process-driven team for a large scale web scraping project is very hard. However, here is a basic idea of how the team should be split to handle a web scraping project.
10. Team structure
You need the following type people to manage each part of the data extraction process.
Data scraping specialists – These people writes and maintains web scraper. For large scale web scraping project with 20 websites – we need 2-3 people dedicated to this.
Platform Engineer. You need a dedicated platform engineer to build the data extraction platform. He will also integrate the platform with other services.
Anti-scraping solution specialist. You need someone with a research mindset to find solutions for anti-scraping problems. He will also find new tools and services and evaluate them to see if it works against anti-scraping.
Q & A Engineer: The Q&A engineer will be responsible for building the Q&A framework and ensuring data quality.
Team lead: Team lead should be a person with technical and functional knowledge. A good communicator who understands what it takes to deliver clean data.
11. Conflict Resolution
Building a team is hard; managing them is even harder. We’re a big-time advocate of “disagree and commit” philosophy of Jeff Bezos. This philosophy breaks down to a few simple ideas that come in handy when you are building an in-house team.
People have different approaches on how to solve a problem. Assume a case where one guy from your team needs to go with a solution A, and another guy needs solution B to solve a problem. Both solutions look logical, having its own merits and demerits. Both team members will fight for their choices. This is always a headache situation for leaders. The last thing you need is ego and politics in your team.
Here are a few things that make the situation tricky:
You can’t go with both A and B at the same time – but choose one.
Choosing one solution over the other will make someone unhappy.
How do you make the unhappy person get on board and deliver his best work?
To find a solution to this problem, you should start preparing for it even before you reach this situation. When building a team, you should clearly describe the following steps to your team members.
The first step is to make people understand why is it essential to put the interests of the company ahead of the individual while the individual still having a role in the execution. I usually give an example of how special forces are functioning and how this idea of the priority of interests is the most crucial part of their mission success.
The second step is to help team members understand how the decision-making process will happen if there is a tie break situation. In my case – I’ll ask them to bring data to prove their points and compare them side by side. I’ll make a decision based on my judgment.
The third step is to make them understand the importance of trusting the judgment of the leader. Even if they don’t agree with it, they must commit to the decision and deliver their best work to make the idea successful. Leaders make the most mistakes – because that is how they learn things.
The fourth step is to make people understand the importance of having parameters and rules to adhere to and why it is essential to get creative but within those parameters.
Compartmentalization for improved efficiency
It is essential to compartmentalize the data team and Business team. If a team member is involved in both ( other than the Executive or project manager ) – the project is destined to fail. Let Data team do what they do best and the same in the case of the Business team.
Need a free consultation? Get in touch with us today.
Comments