Thasni M A
- Jun 3, 2023
- 11 min read

Automate Web Scraping Using ChatGPT: How to Scrape Amazon using Chatgpt

Updated: Nov 11, 2023

Web scraping has become an essential tool for businesses and individuals who regularly need to gather data from multiple sources. Unfortunately, web scraping can be intimidating for beginners. However, LLM-based tools made it easy for beginners to learn web scraping. LLMs can be considered unpaid interns or tuition teachers.

The most popular LLM-based tool on the planet, Chat GPT, is one of the most valuable resources for beginners in web scraping, providing guidance and support as they navigate the process. With the help of Chat GPT, beginners can quickly and effectively scrape data from websites and gain insights that can inform their decision-making.

Beginners can ask Chat GPT questions about web scraping and receive helpful responses to guide them through the process. Experienced people can use it to get their job done faster. At Datahut, we use Chatgpt and Github Copilot to get our jobs done faster and more efficiently.

For example, beginners can ask Chat GPT how to scrape data from a specific website, what tools and technologies to use, and how to clean and analyze the data after web scraping.

Chat GPT can provide detailed and easy-to-understand explanations, making it easier for beginners to learn and apply web scraping techniques. This can help beginners build their knowledge and confidence in web scraping, leading to more accurate and efficient data acquisition.

In this blog, we will explore how to ask more accurate questions to learn web scraping coding quickly from Chat GPT. And as an example, we show you how you can scrape the Amazon site using ChatGPT.

Steps Involved in Web Scraping

Before beginning the web scraping coding, let's look at the steps involved.

Identify the target website: The first step in the web scraping process is to identify the data source, which is the website in our case.
Choose a web scraping tool: Multiple web scraping libraries are available for developers. You must select a web scraping tool or library that suits your needs. Some popular web scraping tools include BeautifulSoup, Scrapy, Selenium, and Playwright. Here is a list of 33 web scraping tools.
Inspеct thе wеbsitе: You nееd to undеrstand how thе data is bеing shown on thе wеbsitе to chеck If thе data is bеing loadеd dynamically. You also nееd to undеrstand thе wеbsitе structurе you want to scrapе. Usе your wеb browsеr's dеvеlopеr tools to inspеct thе HTML and CSS codе.
Build a wеb scrapеr: Writе a script to еxtract thе data aftеr sеlеcting thе library to scrapе thе data. Hеrе arе thе stеps for building thе wеb scrapеr.
1. Sеt up your scrapеr dеvеlopmеnt еnvironmеnt: Install thе chosеn scraping tool or library on your local machinе. Sеt up your dеvеlopmеnt еnvironmеnt and gеt rеady.
2. Fеtch thе HTML contеnt for thе targеt wеbsitе: Writе a function to sеnd a rеquеst to thе targеt wеbsitе and fеtch thе HTML contеnt of thе dеsirеd wеb pagе. Ensurе you havе systеms for handling rеquеst timеouts and othеr possiblе scеnarios.
3. Parsе thе HTML contеnt using a parsеr library: Parsе thе HTML contеnt using thе parsing library of your wеb scraping framеwork to еxtract thе spеcific data attributеs you'rе trying to accеss.
4. Dеaling with Pagination: If thе data you nееd is sprеad across multiplе pagеs, you must handlе pagination or rеquirе intеraction (е.g., clicking buttons or filling out forms). This may involvе analyzing thе wеbsitе's URL structurе, submitting form data or following links to subsеquеnt pagеs.
5. Handlе anti-scraping mеasurеs and othеr issuеs: Somе wеbsitеs dеploy anti-scraping tеchnologiеs to prеvеnt wеb scrapеrs. Thеy usе tеchniquеs such as timing dеlays, slow pagе loading, lazy loading of contеnt, еtc. To avoid dеtеction or ovеrcomе thеsе mеasurеs, you may nееd to implеmеnt additional stratеgiеs such as proxiеs, rotating usеr agеnts or introducing dеlays bеtwееn rеquеsts.
Tеst thе scrapеr: Run thе wеb scrapеr on a small subsеt of thе data to еnsurе it еxtracts thе right information you nееd. If thеrе arе any issuеs - corrеct it.
Run the web scraper on a production server: Run the web scraper on a server or a production environment.
Store the data: Write it into a database or export it into a suitable format like csv or json.
Clean and process the data: Depending on your use case, you may need to clean and preprocess the data before using it for analysis or other purposes.
Monitor the website: If you plan to scrape the website regularly, set up a monitoring system to check for changes in the website's structure or content.
Respect website policies: Follow the website's terms of service and data policies. Do not overload the website with requests; avoid scraping sensitive or personal information.

Chat GPT will assist you in navigating through each step mentioned above. When requesting assistance, please provide precise information to receive correct and relevant answers. Start by specifying the website from which you wish to scrape data. You can either provide the URL or describe the website's structure and content to help the chatgpt understand the task better. Additionally, clearly state the specific data you want to extract, including elements, sections, or patterns of interest if you have a preferred web scraping tool or library, such as BeautifulSoup or Scrapy, specify that as well.

Alternatively, you can leave the choice open-ended, and ChatGPT will suggest a suitable library based on your task requirements. If you have any additional requirements or constraints, such as pagination handling, dynamic content handling, or proxy usage, please include them in your query. These details will help us generate more accurate and relevant code.

It is essential to understand the different types of websites based on their characteristics and behavior before starting the web scraping process. These include:

Static Websites: These websites have fixed content that does not change frequently. The HTML structure remains the same each time you visit the site.
Dynamic Websites: These websites generate content dynamically using JavaScript, AJAX, or other client-side technologies. The content may change based on user interactions or data retrieved from external sources.
Websites with JavaScript Rendering: These websites heavily rely on JavaScript to render content dynamically. The data may be loaded asynchronously, and the HTML structure may undergo modifications after the initial page load.
Websites with Captchas or IP Blocking: These websites implement Captchas or block IP addresses to prevent automated scraping. Additional measures are required to overcome these obstacles during the scraping process. Approaching a professional web scraping company would be the way to go, as chatgpt won’t be of much use here.
Websites with Login/Authentication: These websites require user login or authentication to access specific data. Proper authentication techniques must be employed to access and scrape the desired content.
Websites with Pagination: These websites display data across multiple pages, typically using pagination links or infinite scrolling. Special handling is necessary to navigate through and scrape content from multiple pages.

It is essential to consider these characteristics and behaviors when selecting the appropriate web scraping techniques and tools. Each situation may require different approaches and tools to retrieve the desired data effectively. BeautifulSoup is a popular Python library for scraping static websites, offering efficient parsing and navigation of HTML/XML documents. With its simplicity and powerful parsing capabilities, BeautifulSoup is well-suited for scraping static websites and extracting desired data efficiently.

On the other hand, for dynamic websites that generate content using JavaScript, AJAX, or other client-side technologies, Selenium is a valuable tool. Selenium is a widely-used web automation framework that allows you to control web browsers programmatically. It enables you to interact with dynamic elements, simulate user actions like clicks and form submissions, and retrieve the rendered HTML content after the JavaScript has been executed. This makes Selenium an excellent choice for scraping dynamic websites where traditional parsing libraries like BeautifulSoup may not be sufficient.

When dealing with more complex scenarios, such as websites with JavaScript rendering, you might consider using libraries like Playwright. Playwright is a powerful automation library that provides a unified API to control multiple web browsers, including Chromium, Firefox, and WebKit.

In this tutorial, we have selected Amazon as our e-commerce website to demonstrate web scraping using Chatgpt. To scrape Amazon effectively, it may be necessary to use advanced web scraping tools that are capable of handling dynamic content. Some suitable options include using Beautiful Soup with requests-HTML, Selenium, Scrapy, and Playwright.

For example, we will target the Amazon product page for toys for kids. The target web page contains product details such as titles, images, ratings, and prices.

Also Read: How to Build an Amazon Price Tracker using Python

Scraping Amazon website with Chat GPT

The first step in web scraping is to extract product URLs from an Amazon webpage. To accomplish this, it is necessary to identify the URL element on the page that corresponds to the desired product. First, we need to check the structure of the webpage. To inspect components, right-click on any component of interest and select the "Inspect" option from the context menu. This will allow us to analyze the HTML code and find the data needed for web scraping.

To generate the code, left-click on the content of the corresponding URLs and copy it. Here we will be utilizing Beautiful Soup for web scraping.

The code generated by Chatgpt will extract the URLs of products listed under the category of "toys for kids."

The program begins by importing the necessary libraries, requests, and BeautifulSoup. The base URL is set to the Amazon India search page for toys for kids. The program sent a request to the base URL using the Python requests library.

The response to the request is stored in the 'response' variable. Then, a Beautiful Soup object is created from the response content, using the HTML parser library as the parser. We can use other parsers such as lxml as well, but for this, let's stick with html parser.

The program first generates a CSS selector that can locate the URL. Then it uses BeautifulSoup's 'find_all' method and searches for all anchor elements (links) with a CSS selector. These are the elements that contain the URLs of the products on the page.

Wе initiatе an еmpty list namеd 'product_urls', to storе thе еxtractеd URLs. A for loop is then executed to iterate through each element in 'product_links'. For each element, the 'href' attribute is extracted using BeautifulSoup's 'get' method. If a valid 'href' is found, the base URL is appended to it, forming the complete URL of the product. This full URL is then added to the 'product_urls' list. Finally, the program prints the list of extracted product URLs- just to be sure.

In this use case, a CSS selector is used to locate the element from the Amazon product page. There are alternate ways to do that, like using an XPath. Some developers prefer Xpaths over CSS selectors. If you prefer to use XPath, reference "using XPath" in your initial prompt to Chatgpt.

Here is a quick tutorial on using Xpaths: Xpaths for web scraping

The category we’re scraping contains a lot of products with unique product urls that differentiate each product. Our objective is to scrape data from these individual pages (known as product description pages). We will solve the pagination problem by l inspecting the next button and copying the content to prompt chatgpt

Thе abovе codе is an еxtеnsion of thе first codе snippеt. Wе'rе just еxtеnding it to scrapе all thе product URLs from multiplе pagеs of thе sеarch rеsults on Amazon. In thе first part of thе codе, only thе product URLs from thе catеgory pagеs wеrе еxtractеd. Howеvеr, thе sеcond codе snippеt introducеs a whilе loop to itеratе through multiplе pagеs to gеt around thе pagination.

Thе loop continuеs until thеrе is no "Nеxt" button on thе pagе, indicating that all availablе pagеs havе bееn scrapеd. Thе codе chеcks if thеrе is a "Nеxt" button on thе pagе using BеautifulSoup's find mеthod. If a "Nеxt" button is found, thе URL for thе nеxt pagе is еxtractеd and assignеd to thе nеxt_pagе_url. Thе basе URL is thеn updatеd to nеxt_pagе_url, allowing thе loop to continuе onto thе nеxt pagе. If no "Nеxt" button is found, indicating that it is thе last pagе of thе sеarch rеsults, thе loop brеaks and thе script prints thе complеtе list of all thе product URLs scrapеd.

Aftеr succеssfully navigating through an Amazon catеgory, thе nеxt stеp is to еxtract thе product information for еach product. To do this, wе nееd to еxaminе thе structurе of thе product pagе. By inspеcting thе wеbpagе, wе can idеntify thе spеcific data wе nееd for wеb scraping. By locating thе appropriatе еlеmеnts, wе can еxtract all thе dеsirеd information and procееd with our wеb scraping procеss.

Let's explore how to scrape product names.

Typically, we inspect the product names and copy the content of product names as usual.

Here, the code snippet enhances the web scraper by extracting not only the product URLs but also the product names. Additionally, it uses the Pandas library to create a data frame from the collected data and save it to a CSV file. In the second code snippet, after appending each product URL to the product_data list, the code sends a request to the product URL and then finds the element containing the product name. The product name is extracted and appended to the product_data list along with the product URL. Once the scraping process is complete, we use Pandas to create a DataFrame from the product_data list. This dataframe organizes the product URLs and names into columns. Finally, the data frame is saved to a CSV file named 'product_data.csv.'

Likewise, we can extract the price of each product. Typically, we inspect the price and copy the content of the price as usual.

Similarly, we can extract all other product information such as rating, number of reviews, image, etc.

Rating:

Number of Reviews:

Image:

Let's standardize the code for better readability, maintainability, and efficiency without altering its external behavior.

Also Read: A Guide to Scrape Indeed using Selenium and BeautifulSoup

Limitations of using ChatGPT for web scraping

When requesting ChatGPT to generate code for web scraping, there are several limitations to be aware of:

1. Limited Contextual Understanding

ChatGPT has limited contextual understanding beyond a few preceding messages. It may not be aware of the specific website or preferred web scraping libraries, which can result in code that doesn't precisely align with your requirements.

2. Accuracy and Error Handling

The generated code may not always be accurate or error-free. ChatGPT's responses are derived from patterns and examples in its training data, and there is a possibility of syntax errors or code that doesn't function as intended. Additionally, the code may lack comprehensive handling of edge cases or effective error handling.

3. Limited Knowledge of Recent Advances

ChatGPT's training data is current up until September 2021, so it may not be aware of the most recent advancements in web scraping libraries, techniques, or changes in website structures or APIs. This can result in code generation that is less accurate or incomplete for newer technologies. It is common to see depreciation warnings and errors when running code generated by Chatgpt.

4. Adherence to Best Practices

The generated code may not adhere to best practices or employ the most efficient implementation strategies. It's important to review and optimize the code for performance, readability, and maintainability. Additionally, it may lack robust error handling or fail to account for all potential edge cases, so appropriate error-handling mechanisms should be incorporated.

5. Tool and Library Recommendations

While ChatGPT may suggest specific web scraping tools or libraries based on the information provided, it may not consider all available options or your project's specific requirements. It's essential to conduct your own research and choose the appropriate tools or libraries based on your needs.

6. Complex or dynamic websites

Web scraping can become challenging when dealing with complex or dynamically generated web pages. ChatGPT might generate code that works for simple websites but fails to handle dynamic content, JavaScript rendering, or CAPTCHAs.

7. Limited Back-and-Forth Dialogue

ChatGPT operates on a message-response basis, which limits its ability to engage in a back-and-forth dialogue to fully understand and refine your specific requirements. This can result in code generation that is less accurate or incomplete.

8. Legal and Ethical Considerations

Web scraping may have legal restrictions or be against the terms of service of certain websites. It's essential to ensure compliance with applicable laws, regulations, and website policies. Obtain proper permissions if required and respect the website's terms of use.

ChatGPT generates just a basic web scraper, and it may not be ideal for production-level usage. Considering these limitations, it's essential to use the generated code as a starting point, carefully review it, and make necessary modifications according to best practices, specific requirements, and the latest web technologies. To enhance the code, it's advisable to leverage your own expertise and conduct additional research. Additionally, it is crucial to be mindful of legal and ethical considerations when engaging in web scraping activities.

If you are a beginner looking to learn web scraping or need a basic one-time copy project, asking ChatGPT can be a suitable option. However, if you require regular data extraction or prefer not to spend significant time on web scraping code, it is recommended to seek assistance from a professional company like Datahut that specializes in web scraping.

Wrapping up

Web scraping has become essential for data gathering, but it can be intimidating for beginners. LLM-based tools like ChatGPT have made web scraping more accessible.

Chat GPT provides guidance and support, helping beginners scrape data effectively. It offers detailed explanations and helps build knowledge and confidence in web scraping. By following the steps involved in web scraping and utilizing the appropriate tools like BeautifulSoup, Selenium, or Playwright, beginners can extract data from websites and make informed decisions. While Chat GPT has its limitations, it serves as a valuable resource for beginners and experienced users alike.

Looking for reliable web scraping services for your data needs? Contact Datahut today!