Web scraping is often the first program aspiring programmers write to get familiar with using libraries. I certainly did that - I wrote a simple web scraper using Beautiful soup and Python.
When you're working on a scraping project for a business use case - you need to follow the best practices. These best practices involve both programmatic and no programmatic aspects. Also, following the best practices help maintain an ethical track.
We've listed out the following web scraping best practices you must follow:
1. Check out if an API is available
What is an API?
An API or Application programming interface helps you get data you need via a simple computer program while hiding the complexity from data consumers.
If an API is available, you pass a search query into the API, and it will return the data as a response. You can take this data and use it. Let us consider three possible cases.
a) API is available, and the data attributes are sufficient: Use the API Service to extract the data.
b) API is available, but the data attributes are insufficient for the use case: You need to use web scraping to get the data.
c) API is not available: Scraping is the only way to gather the information you need.
2. Be gentle
Every time you make a request - the target website has to use their server resources to get you a response. Therefore, the volume and frequency of queries you make should be minimal to not disrupt the website's servers. Hitting the servers too often affects the user experience of the target website.
There are a few ways to handle it.
If possible, you can scrape during off-peak hours when the server load will be minimal compared to the peak hour.
Limit the number of parallel / concurrent requests to the target website.
Spread the requests across multiple IP's
Add delays to successive requests
3. Respect Robots.txt
The robot.txt is a text file the website administrators create to instruct web scrapers on how to crawl pages on their website. Rules for acceptable behavior such as which web pages can and can't be scraped, which user agents are not allowed, how fast you can do it, how frequently you can do it, and so on will be contained in robot.txt.
If you're attempting web scraping - it is probably a good idea to look at the robot.text file first. The robot.txt file is available in the root directory. I'd also recommend you read the terms of service of the website.
4. Don't follow the same crawling pattern.
Even though human users and bots consume data from a web page, there are some inherent differences.
Real Humans are slow & unpredictable, but bots are fast but predictable. The anti-scraping technologies on the website use this fact to block web scraping. So it is probably a good idea to incorporate some random actions that confuse the anti-scraping technology.
Once we explained this to a customer, and he said, "So you are making a scraper look like a drunken monkey to get around anti-scraping mechanisms."
I can't put this in better terms. :)
5. Route your requests through Proxies
When your request hits the server of a target website - they'll know and log it. The website will have a record of every activity you are doing on the website. Websites will be having an acceptable threshold on the rate of requests they can receive from a single IP address. Once the request rate crosses this threshold - the website will block the IP.
The best way to get around this problem is to route your requests through a proxy network and rotating the IPs frequently. You can get free but not so reliable IPs for experimental hobby projects. But for a serious business use case - you need a smart and reliable proxy network.
There are several methods that can be used to change your outgoing IP.
a) VPN
VPN will change your original IP address into a new one and conceals your real IP. It helps you access location-based content. VPN is not really created for a large-scale business use case of scraping but ensuring anonymity for an individual user. However - for a small-scale use case, VPN is sufficient.
b) TOR
TOR or the Onion router routes your outgoing traffic through a free worldwide volunteer network with several thousand relays. You can use TOR to conceal your original location. TOR is very slow, and it could affect the speed of the scraping process. Putting more load on the TOR network might not be ethical as well. I would not recommend TOR for large-scale web scraping.
c) Proxy services
Proxy services are the IP masking systems developed with business users in mind. The proxy services usually have a large pool of IP addresses to route your requests. It is better in terms of scale and reliability.
Depending on your use case and budget - you can choose from Shared proxies, residential proxies, or Datacenter proxies. Residential proxies are expensive and only used as a last resort. Residential IPs are the most efficient in sending anonymous requests.
6. Rotate User Agents and Request Headers
User-Agents
When your browser connects to a website, it identifies itself through the user agent and says to the server. Hey, Hi, I'm Mozilla Firefox on MAC OS" or "Hi, I'm Chrome on an iPhone."
Here is the common format of a user agent string
User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>
Here is an example of a real user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Assume you are making a simple request using the Python requests library. If you don't use a user agent in your request - the target website will detect that you are not a real user and block you from accessing the content. Using and rotating common user agents between successive requests is one of the best practices of web scraping. User-Agent lists are available publically online and you have to use User-agent rotation to scrape the data without getting blocked. Also, it helps the server to identify you as a legitimate user requesting content from a real browser.
User-agent Rotation is often overlooked by people who do web scraping.
7. Request Headers
When you are hitting the target website with your request - you don't just say - give me the information now. You need to provide the context of your request so that the server can get you a tailored response. A request header provides context to an HTTP request.
There are five types of request headers every programmer who uses an automation library to perform web scraping needs to know.
HTTP header User-Agent - Specifies what user agent is being used
HTTP header Accept-Language - It specifies which language the user understands.
HTTP header Accept-Encoding -The Accept-Encoding request-header specifies to the target website server what compression algorithm to use when the request is handled
HTTP header Accept - Specifies what kind of data format to be responded with
HTTP header Referer - Specifies the referring page URL or previous web page's address before the request is sent to the server. In many cases, using google as a referring domain is a clever way to optimize your scraper.
The best practice is to inspect the responses using a tool like Postman and use additional Request Headers to optimize the scraper. Such practices can help beat intelligent anti-crawling mechanisms. The target website will get the idea that the traffic is coming from a REAL browser.
8. Cache to avoid unnecessary requests
If there is a way to know the pages, your web scraper already visited - the time to complete a scrape can be reduced. That is where caching comes into play. It is a good idea to cache HTTP requests and responses. Then, you can simply write it to a file for a one-time thing and if you need to perform the scrape repeatedly - write it to a database. Caching the pages can help you avoid making unnecessary requests.
Another case of unnecessary requesting is the loose scraper logic when there is a case of paginations. Make sure you spend time finding efficient combinations that get you the maximum coverage instead of brute-forcing with every possible combination. Always Optimize the scraper logic to avoid making unnecessary requests.
9. Beware of Honeypot Traps
Honeypot traps or honeypot links are links placed on a website by website designers to detect web scrapers. These are links that a human or a legitimate user using a browser can't see, but a web scraper can. So if a honeypot link is accessed - the server can confirm it is not a real human and start blocking the IPs or put the scraper into a wild goose chase draining the resources.
While learning web scraping using the Python requests library - I once ran into the Honeypot Trap on a website. Took me a lot of time to figure out what is wrong.
Honeypot links usually have their background-color CSS property set to None to mask it from users. Take advantage of this to check if the link is a honeypot link or not.
10. Use Captcha Solving Services
CAPTCHA service is a common method used by companies to block web scraping. Websites ask visitors to solve various puzzles to confirm they're legitimate users. Advanced scraping operations require Captcha Solving Services to get around Captcha.
11. Scrape data at off-peak hours
During peak hours, the server load on the target website will also be at its peak. Therefore, scraping during peak hours might result in a bad user experience for the website's actual users. A great way to handle this is to schedule your scrape to off-peak hours. You can use a tool like cron to schedule scrapers.
12. Use a headless browser.
It is easy for web servers to identify if the request is coming from a real browser. This can help them block your IPs.
Fortunately, Headless browsers have built-in browser tools to help solve the problem. A headless browser, as the name suggests, is a browser without the GUI. There are cases where you need browser automation to scrape data. Headless browsers have built-in browser tools that can help solve a lot of javascript related problems. There are many browser Automation libraries like selenium, puppeteer, playwright, PhantomJS, CasperJS, etc.
13. The legal issues you should be looking at
The purpose of compliance is to protect your business from unwanted lawsuits, claims, fines, penalties, unwanted negative PR, and investigations. Compliance also ensures that organizations do not overuse scraping activities and misuse the data they acquire. Before attempting web scraping - every programmer should look at the possible compliance issues. From sending anonymous requests to performing advanced scraping operations - is a complicated process.
a) Is the data behind a login?
If the data is behind a login, scraping the data without permission from the target websites is illegal. It can result in your account being suspended, canceled, or expose to a lawsuit. If you have permission, you'll need to use some advanced methods and tools to get the data.
b) Does it violate copyright?
On some websites, the content can be copyrighted. A good example is music and videos. If you scrape such data and use it - it can have Copyright Issues, and the owner of the actual content can file a copyright infringement suit. Thus, copyright law violations are serious, and you could be liable for paying a heavy fine.
c) Does it violate Trespass to chattel law?
If you're not so nice with your scraping and do a ton of parallel requests, there is a chance that you turn the scraping activity into a DDoS attack. If data scraping overloads the server, you could be held responsible for the damage can be prosecuted under the "trespass to chattels" law (Dryer and Stockton 2013).
d) Does it violate GDPR?
GDPR puts breaks on scraping activities when it is PII or personally identifiable information. Therefore, you have to audit your scraper logic to filter the personally identifiable information.
Final thoughts.
When you are a business user trying to extract data - following best practices can save you time, money, resources and help you stay away from nasty lawsuits. So be a good guy and follow the best practices.
From Copyright Issues to trespass to chattel - you need to keep monitoring the web scraping activity every time. An Automation Library can help scrape the data but it will not save you from legal issues. From reading terms of service to selecting the scraping method -you have to be very vigilant.
If you don't want to worry about these issues and get the data - leave it to us. We will get the data for you. Contact Datahut to learn how.
Related Reading