The Amazon Standard Identification Number, or ASIN, is a unique identifier used by Amazon to catalog its millions of products. It is a ten-digit alphanumeric code that acts as product identifiers on Amazon. Every product listed on Amazon has its own ASIN, making it easy to pinpoint the exact item. Our Amazon ASIN lookup tool leverages this system to help you quickly gather detailed product information with just a few clicks, eliminating the need to navigate through the extensive Amazon Seller Central.
The nature of search results influenced by various factors such as relevant keywords, tail keywords, and Amazon SEO practices can pose a challenge for users seeking to revisit specific product details they've previously accessed. Relying solely on keyword-based searches for Amazon's vast product database, where concepts like Amazon SEO, competitor keywords, keyword rankings, and indexing hold significant sway within the seller central ecosystem can sometimes prove simply inefficient and time consuming. In such scenarios, the ASIN lookup tool becomes a valuable solution, offering users a direct approach to access product information directly via ASINs.
By harnessing the power of ASIN, our lookup tool simplifies your shopping experience, saves you time, and helps you make better decisions by offering a simple means to retrieve and revisit product details with precision and efficiency. Whether you're a shopper, an amazon seller, or even a researcher, our ASIN tool is your go-to solution for navigating the immense product catalog of the Amazon Seller Central efficiently using ASINs as needed.
How does the Asin Tool Work?
Our Python-powered tool utilizes a custom scraper to fetch product data corresponding to provided ASIN values. This functionality empowers users to effortlessly retrieve product details without navigating through the diverse and vast product listings generated by keyword-based searches.
The data obtained is displayed properly using a streamlit frontend. An option to even download and store the data is also provided, allowing users to make informed decisions with precision and confidence.
Now let us look into the proper working of our tool.
As said earlier, our tool consists of two parts - the scraper and the frontend user interface. Let's look into the both of them and see how they are connected together as well.
Tracking down information with the Scraper
Before diving into our scraper, let's first understand what web scraping is. Web scraping involves automatically gathering data from websites. Think of it as a tiny robot that navigates through a site, gathers information, and organizes it, much like a spider weaving its intricate web.
Importing Libraries
The scraper we have made uses the following libraries:
asyncio
unicodedata
pandas
price_parse
playwright
Asyncio: Extracting data with Asynchronous Programming
Asyncio, short for Asynchronous I/O, is a robust Python library that facilitates writing concurrent code efficiently and straightforwardly. The primary advantage of Asyncio is that it allows multiple tasks to run simultaneously, eliminating the need to wait for each task to complete before starting the next. This introduces the concept of asynchronous programming, which will be explained in more detail later.
Unidecode
To represent a multitude of characters from different languages some kind of character encoding system has to be used. And unicode is exactly this character encoding system.
Unicode is a character encoding system that represents various characters from different writing systems, including Latin, Greek, Chinese, and many others. It is the standard character encoding system used by most modern computers.
In the context of web scraping, the unicodedata is mainly used to remove invisible characters from text. It also handles text encoding issues, such as when the text is encoded in a different encoding than your code is expecting.
Pandas: For working with tabular data
The data scraped has to be structured and stored appropriately. This is where Pandas come in. It provides a number of data structures to aid this task.
The DataFrame is such a data structure in Pandas and has a tabular data structure similar to a spreadsheet. It is used to store and organize data in rows and columns. The data structured and stored in a dataframe can then be cleaned and analyzed and converted into a csv file as well.
Extracting prices
Prices scraped from a web page are seldom standalone and most likely have some text present together with it.
The "Price" function in the price_parser library is a fundamental feature that enables us to parse and extract price information from text. It is designed to identify and process prices in various formats, making it the right tool for extracting prices from text data that can contain other supplementary data.
We give an input string that contains the price information that we want to parse to the "Price" function. The string can include currency symbols, numerical values, decimal points, thousands separators, and currency codes or some other supplementary text as well. From this we extract the numerical value.
Playwright
Playwright is a powerful end-to-end testing and web automation framework developed by Microsoft. It supports multiple browsers such as Chromium, Firefox, and WebKit, making it a versatile choice for developers. Playwright enables developers to write tests that interact with web applications in a way that simulates real user behavior. This ensures that applications function correctly across different browsers and devices.This also makes Playwright a great web scraping tool. Playwright is integrated with headless browsers to provide a more powerful and efficient way to automate web testing.
A headless browser is a web browser without a graphical user interface (GUI). These browsers can be controlled programmatically, making them ideal for automated testing, web scraping, and performance monitoring.
Playwright offers methods of both asynchronous and synchronous programming and our tool uses the asynchronous method by calling the ‘asynch_playwright’ module from the Playwright library.
Asynchronous programming enables applications to perform tasks concurrently rather than sequentially. This is particularly useful in web automation and testing, where multiple tasks such as network requests, file I/O, or user interactions need to be handled simultaneously. By leveraging asynchronous programming, Playwright can execute multiple actions in parallel, such as navigating to a webpage, clicking buttons, and validating responses, thus significantly speeding up test execution and improving overall performance.
Main Working
Let’s look into the main working of the scraper.
The scraper starts by creating a new context manager block that uses the playwright library.
Context managers help manage resources and ensure that certain setup and teardown operations are performed in a clean and predictable way. They are commonly used for tasks like file handling, database connections, or, in this case, browser automation. The ‘with’ statement in Python is typically used while initializing context managers.
Playwright's asynchronous environment is set up using 'async_playwright()' which helps with configuring and launching web browser instances for automation and using asynchronous programming methods.
In the context manager, the result of calling async_playwright() is assigned to the variable pw. This allows us to reference the Playwright environment and its functionality within the context block using the variable pw.
One of the advantages of using a context manager is that it automatically performs cleanup actions when we exit the block. In the case of Playwright, it ensures that the browser instances are properly closed when we use the block. This helps prevent resource leaks and ensures that browser instances are shut down gracefully.
After setting up the context manager, a new Chrome browser instance for web scraping is launched. Note that Chrome for Playwright should be separately installed using the command “playwright install chromium”.
A context is then created for the browser. A context is like a separate browsing session. A new page is created within this context, which will be used to navigate and scrape data from the web page corresponding to the ASIN provided.
When an ASIN is provided to the scraper it uses the BASE_URL (will be described in the next section) to search the ASIN in the Amazon website. If the ASIN is an invalid one the page loaded will be the one shown below.
If the page obtained is the one above then the scraper will detect it and print the message - “Invalid ASIN”.
If a proper ASIN was provided then a proper functioning page will be loaded and the scraper will then obtain the following data from the web page -
Product Name: The name of the product.
Discount: The percentage discount applied to the product.
Selling Price: The current selling price of the product.
Max Retail Price: The maximum retail price of the product.
Currency Used: Shows the currency in which the product price is being displayed.
Average Rating: The average rating of the product.
Rating Count: The number of ratings available for the product.
Product Specifications: Detailed technical specifications or attributes of the product.
After obtaining the data, the browser instance is closed and the data is stored in a CSV file.
Now, let’s look into the Global Constants and various functions used by the scraper to complete execution.
Global Constants
Only a single global constant is used by our scraper. Which is the BASE_URL. The scraper reaches the product page by obtaining a page url by adding the ASIN value provided to the end of the BASE_URL.
Extracting Product Information
Several functions are defined for extracting different information from product page including name, prices, ratings and various details. Here's an explanation of each of the functions :
get_product_name()
This function takes a web page object as input and waits for the page to load till a <div> element with an id of ‘titleSection’ is found.
After finding the element it’s text content is extracted and any spaces at the beginning or end are removed using the strip() function.
get_product_price_discount()
When we go through various products on the Amazon website, we sometimes see products that don’t have any discounts. This means that the product has the same selling price and MRP, and only one would be displayed.
Most products have their selling price stored in a <span> tag of class ‘a-price aok-align-center reinventPricePriceToPayMargin priceToPay’ and MRP in a <span> tag of class ‘a-price aok-align-center reinventPricePriceToPayMargin priceToPay’. But in some cases the selling price is stored in a <span> tag of class ‘a-price a-text-price a-size-medium apexPriceToPay’ and MRP in a <span> tag of class ‘a-price a-text-price a-size-base’ . If the first set of tags aren’t found in the page then we search for the second set. But the second set of tags are more commonly seen when the product has no discount for it and has the same selling price and MRP. And in such cases the MRP tag would generally produce a zero value. In such cases we set the MRP equal to the selling price.
The price_parser library is used here to extract the price while removing unnecessary values like whitespaces and other details.
After getting the selling price and MRP we then calculate the discount percent using them.
get_avg_rating()
Customer reviews require a lot of processing and the sheer quantity can make its analysis quite difficult and time consuming . Thus here we consider only the customer ratings.
The average rating of the product is present in a <span> tag having a ‘data-hook’ attribute of value ‘rating-out-of-text’. We obtain the text from the <span>, but it would be in the form ‘4.5 out of 5’. So we split the text, using the whitespaces and then take only the first value, which is the rating (4.5 in the earlier said case ).
get_rating_count()
The total number of ratings given to a product is extracted from a <span> tag having a ‘data-hook’ attribute of value ‘total-review-count’. The value that we extract would be in the form ‘10,000 ratings’. So we split the text around the whitespace and take the first element, which is the value ( 10,000 in the earlier said case ).
get_product_specs()
Product details encompass a range of vital information, including product category, material composition, components, style, manufacturer details, and more. While the specific details may vary depending on the nature of the product, they collectively form essential data that users require for informed decision-making.
When we visit a PDP in Amazon, we can see that all the product details are shown near the end of the page, just before reviews and ratings. This function finds a <div> element with an id of ‘prodDetails’ from the page, this element contains the product details in its entirety. Within the <div> element the data is stored within a <table> element. We extract the data from each row of the table in such a way that each specification name and its value are extracted and then stored.
Sometimes the specifications may not be stored inside tables. In such cases the element in which they are stored may vary. So we search for the element having the text “Product Details” itself and then using it we find elements storing product specification data.
One important aspect of this function is that the extracted data is normalized using unicodedata. The product specification name and value pairs we have extracted sometimes contain a unicode character called RIGHT-TO-LEFT MARK. It is used to change the way adjacent characters are grouped with respect to text direction. Amazon uses it when the product specifications are displayed as a column and not as a table. To remove this unicode character, we perform unicode normalization, which is basically filtering and removing unwanted characters.
Now let’s look into the second aspect of our tool - the frontend app.
Building a Dynamic Frontend with Streamlit
The frontend refers to the part of a website that users see and interact with directly. This includes things like the layout, colors, images, text, buttons, and animations. Frontend developers are responsible for creating and maintaining the user interface (UI) of a website using technologies like HTML, CSS, and JavaScript.
However, this also means that to create a website which users can use to interact with our tool considerable knowledge about technologies like HTML, CSS and JavaScript is needed.
This brought the invention of alternatives which didn’t require the users to be very much knowledgeable in these fields. One such alternative is Streamlit which we have used to make the frontend side of our tool.
Importing Libraries
The frontend has been implemented with the following libraries.
Streamlit
Os module
Pandas
Streamlit: Building data apps in minutes
Streamlit is an open-source Python library that makes it super easy to build and share beautiful, interactive web apps for data science and machine learning. It is an easy to use library that turns Python code into smooth applications within minutes.
Major benefits of using Streamlit are:
No need to learn frontend frameworks like React or Angular. Streamlit uses only Python, making it familiar and accessible for data scientists.
Streamlit apps are fast and interactive. Streamlit apps run with one line of code, updating live as we modify our script.
Streamlit provides built-in components for charts, graphs, tables, and other visuals, letting us create rich and informative interfaces.
We can publish our Streamlit app with a single click and share it with others through a link. No complex server setup required.
Streamlit covers everything one needs for data exploration and model deployment, including data loading, pre-processing, visualization, and user interaction.
Main working
Now let’s look at how our frontend user interface is actually implemented. It is to be noted that not only will we see how the user interface has been set up, we will also see how the frontend interacts with the scraper and as well as the scraped data available.
Introductory content for our tool
Our tool has some general description in the beginning of the page before we move into the core functionalities.
This is just a simple description of what our tool is and what it does.
Additional Info
The provides an expander that has a question “What is an ASIN? “
The user can expand the question and get a quick recap on ASIN. The data can be expanded or collapsed based on user choice.
Entering ASIN value
A simple search bar is present for entering the ASIN value of the product to be searched.
Since the data has to be scraped from Amazon, it means that our scraper has to open the browser, go to the page based on the entered ASIN value, wait for it to load and then extract all the necessary details. This means that it takes some time for the extracted data to be displayed after entering an ASIN value.
To tell the user that the scraping is happening and that the app hasn't simply bugged, a spinner is shown, which ends when the scraping gets done.
One of the issues with Streamlit is that, if for some reason the app gets reloaded then it goes back to its initial appearance, this means that the scraping has to be done again and thus the user has to wait again to look at what they had already seen.
To solve this issue we use the session_state value of Streamlit to set when the data has to be displayed by adding a display_data_bool attribute to the session_state which becomes true when a product data has been scraped and thus its information is shown even if the page reloads.
Viewing the scraped data
After entering the ASIN value and pressing search, the scraping happens. The data thus scraped is then displayed below the search input field.
Every value that was scraped will be displayed with the name of each field in bold letters.
As in the earlier explained case of products having no sale price, we can identify them by checking for ‘Not available’ value for discount, selling price and MRP . In such cases we directly display the message “ No price present “ in place of the earlier mentioned fields.
Saving the extracted data
The user can store the data that was extracted using the download button present at the end of the displayed data, that is at the end of the page.
On clicking download, the values of the product specs field is a bit organized. After this the data is downloaded into the user’s system as in a file with the name “product_data.csv”.
Conclusion
Our ASIN tool offers a comprehensive solution for extracting and displaying product data from Amazon using a custom Python scraper and a user-friendly Streamlit frontend.
By leveraging various libraries our scraper efficiently retrieves product information like names, prices, ratings, and specifications.
The frontend, built with Streamlit, ensures that users can easily interact with the tool, input ASIN values, and view the retrieved data in an organized manner. Additionally, the option to download the extracted data as a CSV file enhances the tool's utility for users needing to store or analyze the information offline.
For seamless integration and expert assistance in web scraping solutions, consider contacting Datahut. We specialize in providing comprehensive data extraction services, ensuring efficient and reliable results for your business needs.
댓글