Cost control for web scraping projects
Updated: Feb 5, 2021
With the COVID-19 impacting businesses greatly, companies are now looking for ways to cut costs wherever possible. Some businesses are spending a lot on acquiring web data for their operations, and controlling the cost of web scraping projects can be a massive help for them.
This blog is ideal for an audience who are spending at least $5000 per month for web data extraction to see a significant result. However, the ideas can be used by anyone; after all, a dollar saved is a dollar earned.
As a professional web data extraction company, we’ve been perfecting the cost optimization techniques for a very long time. Through this blog, we aim to highlight those techniques which you can be useful for the cost-cutting of your data extraction mechanism.
A word of caution. It would be best if you did not implement every tip in the blog at once. Take it slowly, in steps.
Optimization of workflow
From writing the first line of code to delivering the ready to use data to the production server, many stages are involved. Optimizing this workflow can improve the efficiency of the process, improve quality, reduce errors, and costs incurred for data extraction. For a large scale data extraction project, there is always scope for improving the workflow.
One of the main reasons why professional web scraping companies excel at what they do better than your inhouse team is because of the optimized workflow and automation. The best way to detect optimization flaws is to implement logging at every stage of your workflow and use an anomaly detection based alter for your programmers to identify the defect and fix it.
Here are some typical stages where you can optimize the cost using automation
Q & A – You can use existing Q & A tools or build rule-based programs to check for common errors and inconsistencies in the data. This can eliminate much time spent manually checking the data. ( Remember – the time spent is money ). This can be improved over time.
Pattern change detection: Implementing a pattern change detector is a huge cost optimizing factor if you’re scraping data from a lot of websites. We will discuss this in detail below.
Automation of cloud resources: If you’re using a public cloud for the deployment of your scrapers – chances are they offer a lot of cloud automation tools. Most people won’t use it, but you should. Proper usage of cloud automation tools can help you save the time of your DevOps resources. The saving are huge.
There are some other areas as well, but these are the main areas. Start with this.
Choice of Technology
Four primary cloud resources contribute to your bill CPU, Memory, Storage, and Network. Depending on – how the systems you’re using are built the cloud bill can increase a lot. Assuming you’re having an inhouse data extraction team – your choice of technology has a big impact on the cloud bill. There are three common ways people get data.
1) Auto scraping/self-service tools in a saas model.
Many organizations are using self-service tools to get data from the internet. These self-service tools will be charging you for the number of API calls you make or the number of records you extract. During a crisis like COVID – you should revisit your priorities. You should ask yourself these questions.
Do I need data from all the sites that I used to scrape?
Can I reduce the frequency of data extraction?
Can I remove a few non-performing categories from the scraping activity?
You should be doing an audit and prioritize what are the essential data sets you need, how frequently you will need it etc. to minimize the bill.
2) Auto scraping/self-service tools deployed on your public cloud.
There are proprietary and open source self-service tools you can deploy on your public cloud. If it is a proprietary tool – you’ll be paying an annual licensing fee. Most people don’t really care about the cloud bill in the initial stages. As you scale your data extraction activity – you’ll see a surge in the cloud bills. This could be due to poor architecture of the auto scraping tool or incorrect configurations.
If you’re using an auto scraping tool deployed on your cloud, do make sure you check if the configurations are correct.
3) You choose an open-source technology and code yourself.
There is a chance that you’re using open-source technology to build web scrapers. If you are using poor programming practices, say – write your scraper code that is not memory efficient – you’ll see a spike in the usage of cloud resources. You have to be careful when choosing the libraries and functions you use.
What we suggest is to conduct an audit into the coding practices of your developers. Identify common flaws and put a strict protocol in place so that they write code in a way that it doesn’t eat up all the cloud resources.
Cost-effective anti-scraping evasion techniques
Websites block bots using anti-scraping technologies. The spectrum of anti-scraping technologies ranges from simple ajax based blocking to captchas.
Getting around anti-scraping requires third party services like proxies, IP rotators, etc depending on the type of blocking.
Evading anti-scraping technologies can be costly if you are not choosing the right method and vendor.
Public Cloud vs. Dedicated server
Some companies are still using dedicated servers for web scraping, which is a complete waste of money. Their primary argument is that dedicated servers ensure better performance, but considering other tradeoffs – it is not worth it. If you’re using a dedicated server – you should switch to a public cloud for the following reasons.
Huge money savings
Considering the overhead and the amount you pay for physical machines – the money savings are huge for a public cloud.
When you want to scale to millions of records per day, the dedicated servers just don’t scale. However – the public cloud will scale, and it can give almost the same performance as a dedicated server with correct configurations.
Automation & integration
Most public cloud providers give you the tools to automate the DevOps and improve the workflow. This helps you save a ton of time and increase the development speed. They also have native integrations with most tools you use which you don’t get with a dedicated server.
Pattern Change Detector
A scraper code worked yesterday, but when deployed today – the data retrieved is faulty – this is a common problem data mining engineers face. The websites change their pattern and based on the change – the scraper might crash or give faulty data. Having a patter change detector can help you solve this problem. Run the pattern change detector before running the scraper to make sure that the existing scraper code is compatible with the current website structure.
Resume, don’t restart
When a web scraper crashes or stops halfway, the natural method is to restart. However, when you have millions of records to extract, this is not the most optimal way. The best way to do this is to implement a logging system to record what you’ve extracted and resume from where you stopped. This can save you a ton of money.
Outsourcing to a Data as a Service provider
Outsourcing the data extraction to another vendor is the last method you should try. If you can’t work out things internally, outsourcing it to a vendor like Datahut or other data as a service provider would be the best way forward.