Web scraping is both loved and hated. Web scraping is a boon for some: consumers love price comparison services to save money in purchases, and market researchers get to gauge sentiment on social media and build a better product.
However, “bad bots” conduct various fraudulent activities, such as online fraud, data theft, theft of intellectual property, unauthorized vulnerability scans, and digital fraud. These bots take control away from a website’s owner.
So the big question is: Is web scraping legal or illegal? Web scraping and crawling aren’t illegal by themselves, provided you follow compliance.
Startups and big organizations love using web scrapers for their gain as it’s the best (and cheapest) way to get competitive data without partnering with the organizations. Most companies engage in data scraping to gather competitor trends, conduct market research, and do inquisitive analytics on their data. The intention is to discover lost opportunities for revenue generation and gain financially.
Web scraping is an automated way of gathering data from websites. How does a retailer price its products competitively in the age where e-commerce giants like Amazon triumph in the online marketplace. Small retailers need to extract product data regularly. They can do it manually, but it will be time-consuming. And by the time you are done gathering this data – the data is already obsolete. Web scraping solves this problem efficiently.
Web scraping compliance is always a headache for companies, and when a company wants to engage in scraping activity – they want to make sure that their scraping activity is within the bounds of the law. There are many court battles about web scraping, and it is essential to assess and ensure the legality of your scraping activity.
In this blog, we’ve decided to consolidate the top 10 questions we get from our customers and prospects:
Can you assess the legality of my web scraping use case?
Is web scraping legal in the U.S.?
Is it legal to scrape Google?
Is it legal to scrape Facebook?
Is web scraping legal in India?
Do you have references about the court cases on web scraping?
Is it legal to scrape data from a password-protected website?
Does web scraping infringe on copyright?
Can you scrape Linkedin?
There is no single answer to the question “is web scraping legal.”
The correct answer is that legal compliance depends on many factors and those factors could change depending on the country’s l laws. For example, GDPR put brakes on many data crawling activities as collecting personal data became illegal.
Also Read: Is Web Data Scraping Legal?
The importance of web scraping compliance
The purpose of compliance is to protect your business from unwanted lawsuits, claims, fines, penalties, unwanted negative PR, and investigations. Compliance also ensures that organizations do not overuse scraping activities and misuse the data they acquire.
If you’re not careful with the personal data protection protocols – the fines could be huge. The most significant GDPR fine was issued to Google, of about $120 million, for dropping cookies on Google.fr without consent. Google automatically dropped tracking cookies when a user visited the domains, which resulted in a breach of the country’s Data Protection Act.
Even if you’re extracting public data, you could still land into trouble if there is a breach of other known data extraction compliance principles. Compliance is not something that you have to take lightly.
Fines imposed on data controllers
Copyright infringement is a serious violation of the law you have to consider while engaging in web scraping projects. You could be scraping ( unknowingly ) copyrighted works, and if the website owner traces it back to you – you could be hit with a cease and desist letter. It can follow with a civil or criminal lawsuit.
A Crawler can’t distinguish between copyrighted and free content. Before starting a web scraping project – you have to inspect the source website and check for copyrights manually. Copyright infringements have dire legal ramifications, and usually, organizations don’t give much time to check the compliance of their scraping activities.
Terms of Service
Terms of service are the legal agreements between a website owner and a person who wants to browse that website ( to access information or access some services.) The person must agree to abide by the TOS to use the website.
People who are not in favor of web scraping often argue that a website owner can block web scraping / programmatic access by explicitly prohibiting this in the “terms of service.” However, there are counter-arguments that some courts agree with.
Generally, terms of service agreements are considered unenforceable. However, we encourage you to check what the law is in your country of business.
Computer Fraud and Abuse Act
CFAA is a federal criminal law that prohibits accessing a computer without authorization. People who are not in favor of web scraping used CFAA as an argument to prevent web scraping.
Last year, the US 9th circuit court of Appeals ruled that web scraping public sites does not violate the CFAA (Computer Fraud and Abuse Act). The court legalized web scraping and made it clear that the bot’s entry is not legally different from the browser’s entry. In both cases, the “user” requests public data.
One such case was during HiQ labs ( a data analytics startup ) vs. LinkedIn (a Microsoft company) trial, where the decision was made in favor of hiQ Labs. LinkedIn previously ordered hiQ Labs to stop scraping its data, and the startup fired back with a lawsuit. A US District Judge granted hiQ Labs with a preliminary injunction that provides access to LinkedIn data. Linkedin was instructed to remove the technical barriers placed that blocked the web scrapers of HiQ labs.
Trespass to Chattel:
Excessive crawl rates can harm the servers of the website getting scraped. There is no rule against the legal limit of crawl rate in the view of federal courts. However, If data scraping overloads the server, then the person responsible for the damage can be prosecuted under the “trespass to chattels” law (Dryer and Stockton 2013).
However, the damage needs to be material and easy to prove in court for the website owner to be eligible for financial compensation.
Companies crawling at huge rates usually use Proxies or VPN to distribute the crawling activity. It is tough for companies to trace the scraping activity back to the company if they are using anonymization techniques. Even if they trace it – proving this in courts will be a tough job.
Misappropriation of trade secrets
A recent verdict from the U.S. Court of Appeals for the 11th Circuit has ruled that scraping a public website can be deemed a misappropriation of trade secrets under certain conditions. However, the court found that web scraping is not an improper means to get data from a website.
Our observation is that the scraper ran millions of queries and ignored the crawl rate limits, and their anonymization setup was weak. This matter is still going on, and we have to see where it ends.
Non-public information/ scraping behind a login
Sometimes people want to scrape non-public information from a website. At Datahut, we get a ton of requests to scrape Facebook and LinkedIn. Scraping non-public data is illegal unless you have permission to scrape it from the website owner. It is easy to detect scraping activity if the user is logged in and can bring you many troubles, from the suspension of an account to legal action.
Ask these questions to evaluate the legality of your web scraping project.
We came up with a set of questions that need to be addressed to determine whether your web scraping project is legal. After analyzing the verdicts and observations from courts on different cases relating to web scraping, we came up with these questions. To learn more about the cases, scroll above.
Is the data you want to scrape behind a login, and you don’t have permission from the website owner?
Is web scraping or web crawling explicitly prohibited by the website owner?
Is the website’s data copyright protected?
Can the use of this data be interpreted as illegal?
Is the crawling rate ( the requests per second ) too high compared to the total number of records on the website? ( If there are 100000 records on the website and you are sending 1000 requests – it is excessive )
Can the scraping activity cause material damage to the website leading to a claim filing under Trespass to Chattel?
Does the data obtained through web crawling in any way compromise the privacy of the individual?
Does the data collected via web scraping contain confidential information about the website?
Does the data contain pornography, especially child pornography? (having child pornography in the data set is a serious offense that can attract lawsuits)
A positive answer to any of these questions is a red flag, and you need to take proper legal advice from a practicing lawyer about your web scraping project.
Best practices for web scraping compliance
1. Use APIs for data extraction instead of scraping if the website allows that
APIs are essentially interface modules that allow users to gather data without clicking on links and repeatedly copying data. You can directly extract data using APIs without violating any regulations. However, scraping comes in handy when the website does not provide APIs for data extraction or, in other cases, when the website has an API but cannot provide the data you require.
2. Limit the speed of web scraping
Ensure that you are not shooting too many requests in a short period onto the website and not overburdening the servers powering the website. Detection of unusually high traffic and requests ( or download rate), especially from a single client or I.P. address within a short period or a trend of repetitive tasks performed on the website, is considered unethical, and you could get sued under trespass to chattel.
3. Use anonymization techniques
Anonymization is the first line of defense you need to take if you’re doing web scraping for commercial purposes. From using residential proxies to route web scraping requests to changing the scraping pattern, there are a lot of things you can do. A professional web scraping company can help guide you through this process.
At Datahut, we built our internal platform for anonymous scraping so that it is hard for the website owner to trace it back to our customer.
4. Extract only what you need – not what you can from a source
Companies should only extract and store as much data as is required to accomplish their tasks. Companies often give in to the tendency of using web scraping to hoard large quantities of data from a website and capture as much as possible for future usage. In our observation, in most cases, the data sits in a data warehouse doing nothing.
5. Check for copyright infringement before starting the project
The content of some websites might be copyrighted. You need to check the content manually for copyrighted content before performing scraping. Usually, people who do the web scraping have their technical team handle this and don’t go in-depth of the copyright infringement and other violations. (It’s not the technical team’s job to ensure this)
6. Extract public data only
Extracting personal data requires you to comply with data protection laws in the jurisdiction where you’re scraping personal data. Therefore it is highly advised to scrape public data and recheck.
As a rule of thumb, go for only public data extraction. In case you require private data extracted, ensure that you receive proper permissions from the source site. A typical example is retailers wanting to extract the sales data from their partner websites, and the data usually sits behind a login, rendering it private. In such cases, when they request data extraction, we ask them to take permission from their partner websites and whitelist a range of IPs. If such permission is not obtained, the partner site’s default system settings will block or suspend the retailer’s account.
Best Practices for Web Scraping Compliance
Famous legal battles related to web scraping compliance
1. eBay Vs. Bidder’s Edge
Bidder’s Edge is an “aggregator” of auction listings. It automatically-collected data from various auction sites, including eBay. Bidders Edge users could easily search auction listings in one place without having to go through all the major auction websites.
eBay tried to block IPs from Bidder’s Edge to prevent scraping; however, they continued crawling eBay’s data by using proxy servers to evade eBay’s IP address blocks.
eBay then sued Bidder’s Edge for scraping the eBay marketplace data in 2000. eBay argued that the trespass to chattels doctrine would apply, and the activity of Bidders Edge is illegal.
eBay Vs. Bidder’s Edge was one of the first significant cases involving eCommerce data scraping.
2. Nguyen v. Barnes & Noble Inc.
In August 2011, Barnes & Noble had a discount sale of Hewlett-Packard Touchpads. Kevin Khoa Nguyen bought the Touchpads on the Barnes & Noble website and received an email confirmation of the purchase. The next day, Nguyen received an email from Barnes & Noble stating his order was canceled.
In April 2012, Nguyen filed a class-action lawsuit in California Superior Court against Barnes & Noble for “deceptive business practices” and “false advertising.”
If you are considering starting a web scraping project for your business and wish to assess its legality and compliance, don’t hesitate to reach out to us.