top of page

How to Scrape Macy’s Sale Section Using Python and Playwright

  • Writer: Anusha P O
    Anusha P O
  • 5 hours ago
  • 34 min read
How to Scrape Macy’s Sale Section Using Python and Playwright

When you think of Macy’s, it is not just a department store—it is an American retail legacy established in 1858. What began as a humble store located in Manhattan has become one of the largest and arguably best-known retail department stores in America. Macy's currently operates in hundreds of locations across the mainland United States, including its world-renowned flagship store located at Herald Square, New York City— a retail space spanning over a million square feet.


Macy's is not simply known for its variety; it is also known for the full retail experience. From casual everyday wear to upscale items for special events, Macy’s has something for everyone. With many decades of retail experience behind them, the Macy's brand is compelling to shoppers for essential everyday styles and current fashion trends.


Scrape Macy’s Sale Section: Data Collection


The data collection phase for Macy's clothing consisted of 2 steps: first, obtaining all of the product page links, and then structuring the data for deeper analysis. We narrowed our data collection to women's clothing, then focused further on their "Sales and Clearance" section.


Step 1: Collecting Product Links from Macy’s “Sales & Clearance” Section


To kick off our scraping project, our first goal was to collect all the product links from the women’s clothing sale and clearance section on Macy’s website. Now, like many online shopping sites, Macy’s doesn’t show all its products on a single page. Instead, the items are spread across many pages, and you need to click a “Next” button to see more. That means we couldn’t just grab everything in one go—we had to visit each page one by one.


To handle this, we used a tool called Playwright. Think of Playwright as a helper that can use a browser just like a real person would. It opens the website, waits for everything to load properly, and then scrolls down the page naturally so all the products become visible. Since Macy’s uses pages instead of endless scrolling, Playwright clicks the “Next” button to move from one page to the next.


As it moves through the pages, Playwright also checks for any pop-up messages—like cookie permissions or sales promotions—and closes them so they don’t block the view or interrupt the scraping. This helps make sure we don’t miss any product links.

Once the product links are gathered from a page, they’re saved into an SQLite database along with their category. This way, the data stays neat and easy to manage later. Using a database instead of just saving the links to a basic file makes it easier to search, filter, and analyze them in the future. We also made sure not to store the same link more than once, which helped us avoid duplicates.


By the end of this step, the scraper had collected unique product URLs. That gave us a strong foundation to move on to the next part—gathering more detailed information about each item, like names, prices, colors, and any discounts they might have.


Step 2: Extracting Information from Product Pages


Once we’ve saved all the product links from Macy’s women’s clothing sale and clearance section into our SQLite database, the next step is to visit each product page one at a time. But we’re not just collecting links anymore—we’re going deeper. Now, we want to gather all the important details about each item, like the product’s name, its current price, the original price (or what it cost before the sale), how many reviews it has, the brand, the discount offered, available sizes, the SKU number, whether it’s in stock, and even the product description.


Now, once we’ve collected this product data, we need to store it somewhere smart. For this task, we’re using MongoDB. It’s a type of database that’s very flexible and can handle different kinds of data easily—perfect for the variety of details we’re collecting. And because scraping large websites can sometimes crash or pause unexpectedly, we’re also saving a copy of the data in backup files called JSONL files. These act as a safety net. If the scraper stops halfway through for some reason, we won’t have to start all over again—we can just pick up where we left off using the saved data.


In short, we’re blending smart automation, careful data gathering, and reliable backup methods to collect detailed product information from Macy’s sale section smoothly and efficiently.


Data Cleaning


When you first collect data from a Macy’s, it doesn’t always come in neat and clean. You’ll often find weird symbols in the prices, missing values (like empty cells), or information that’s all over the place. For example, some items might not have a discount or a proper description — in those cases, it's a good idea to replace those blanks with "N/A" so that it's clear there's no data, rather than leaving it empty.


Also, the original price sometimes includes a lot of extra symbols, even unwanted text. One of the first things done was remove all those unnecessary symbols so the price looks clean and easy to read — just like how you'd snip off price tags before wearing new clothes.


Now, to actually do the cleaning, there are some tools that make life a whole lot easier. One of them is OpenRefine. Don’t worry if you’ve never heard of it — it’s kind of like a super-powered version of Excel. You can use it to remove duplicates, fix inconsistent data (like different spellings of the same brand), and tidy things up with just a few clicks. It’s great for visual learners and super easy to pick up.


But sometimes the mess can be a little messier; for example, what if the data has some messy HTML tags, or if the date formats are all weird? That's when Python comes in — and more specifically, a really valuable library in Python called pandas. Pandas allows you to clean and manipulate your data in a more organized way. Think of it like a tiny robot assistant who already knows how you want it all sorted out for you behind the scenes.


You know, cleaning your data may not sound very exciting, especially at first, but it is absolutely essential if you want to do anything useful with your project! And once your data is cleaned up, everything else becomes easier and more pleasurable!



Advanced Tool-sets and Libraries for Efficient Data Scraping


In order to scrape a large volume of data from websites in an efficient manner, it is vital you use the proper combination of tools and libraries. This project employs a few powerful Python libraries and modules that work in unison to automate the browsing, data extraction, data storage, and data handling processes in a coherent and reliable way.


The first library we are going to take a look at is asyncio. This library allows a program to asynchronously run tasks. Instead of waiting for a task to complete before calling the next task, asyncio allows many of the operations to be proceeding, while pending, at the same time. This is crucial for web scraping because there are a lot of tedious interactions on websites that generally slow down the data collection process. Instead of blocking the entire program during each web page interaction, asyncio allows you to manage multiple web page interactions without blocking your entire program.


The scraping automation task is handled by playwright.async_api. This is a modern browser automation tool that gives you programmatic control over browser instances such as Firefox or Chromium. With playwright, you simply specify a web page to open and the library will take care of opening the web page, simulating user actions such as scrolling down a page or clicking buttons, and even take the HTML content away upon completion of those actions. The async refers to playwright's ability to work well in an asyncio library giving it the ability to automate the scraping process more efficiently than in a typical synchronous way.


To ensure the browsing looks more human-like and to decrease the chance of being blocked by a website, this project is using playwright_stealth, a library specifically designed to hide automation fingerprints. Many websites have anti-bot strategies in place to detect automation and scripts, but playwright_stealth will use the same actions as a typical user would, and modifies properties and behaviors in the browser so it appears as if a regular user is browsing the site.


Data storage is another key area. This project will be using sqlite3 to save the scraped product URLs into a simple, local database. SQLite can be simple to set up and use and has no requirement for a server to be created or set up making it an easy way to manage and query URLs during the scraping workflow.


As a more currency specific data storage option, for larger data stores or more complicated data storage, pymongo is also included with this project as a way to connect to MongoDB, a NoSQL database particularly used for handling unstructured data. MongoDB is designed very well for saving product information with many detailed fields that may have a non-consistent format such as descriptions, pricing, reviews, etc. MongoDB will allow for flexible queries and is easily extensible.


A number of other standard Python libraries also facilitate the entire process: logging provides an overall understanding of what the scraper is doing by logging significant events or errors that can be helpful during debugging and monitoring. It is important to have the random and time libraries for adding some delay and a little randomness between actions, thereby helping to normalize organic browsing activities and to avoid triggering a web site’s anti-bot protection systems. The datetime and pathlib libraries assist in managing file paths and file creation with timestamps to ensure timely organization during data storage.


STEP 1: Extracting Product URLs from Macy's Women's Clothing Section


Importing Libraries

import asyncio

import logging

import sqlite3

from playwright.async_api import async_playwright

from playwright_stealth import stealth_async

Let’s begin by setting the foundation for our web scraping project. When you're trying to collect product details from large e-commerce websites, it's important to use the right tools that can handle the job smoothly and efficiently. That’s exactly what we’ve done in this script. We start by bringing in a tool called asyncio. Think of it like a multitasking manager—it helps the script open multiple pages and collect data at the same time, rather than doing one thing after another. This makes the process much faster and more efficient, especially when dealing with hundreds or thousands of product pages.


Next, we use something called logging. This is like keeping a diary of what the script is doing. It records important moments, like when a page is opened or when an error happens. When you're scraping a lot of data, having this kind of record is really helpful to understand what's going on and where things might go wrong. We also use sqlite3, which allows us to save the product information into a local database file. You can think of this as a storage box where every piece of scraped data is kept safely for later use. Another key tool we include is async_playwright from the Playwright library. This helps us control a browser through code, letting us visit pages and interact with them as if we were browsing manually—but much faster and more consistently.


To avoid getting blocked by websites, we use stealth_async. This adds a bit of camouflage to our script, making the browser look more like a real person is using it. It imitates human-like behavior, which helps keep the script under the radar. Together, all these tools create a strong starting point for our scraping process. They help us move quickly, stay organized, save data properly, and avoid drawing attention—just like a well-planned mission.


Tracking the Scraper’s Activity with a Clean Logging Setup

# Setup logging

logging.basicConfig(

   filename="macys_scraper_url.log",

   filemode="w",

   format="%(asctime)s - %(levelname)s - %(message)s",

   level=logging.INFO,

)

"""
This block sets up logging, which is useful for tracking the script's progress,

errors, and important events. Logs are saved in a file named 'macys_scraper_url.log'.

"""

When you're writing a script that needs to gather thousands of product links spread across multiple web pages, it’s not just about writing code that works. It’s equally important to keep track of what the script is doing while it runs. Think of it like watching over a delivery truck—you don’t just send it out, you also want updates on where it’s been and if anything went wrong. This is where logging comes in.


In the code, a simple but effective logging system is set up using Python’s built-in logging module. This system writes updates to a file named macys_scraper_url.log. So, each time something happens—like successfully collecting a link, skipping an item, or running into an error—it’s recorded in this file along with the exact time it occurred. The log file is cleared and refreshed every time the script starts, thanks to a setting called filemode="w". This ensures that the log always reflects the most recent run and doesn’t get cluttered with old data. Each log entry includes a timestamp, the type of message (like INFO for regular updates or ERROR if something breaks), and a short description of the event. By setting the log level to INFO, the script will record all important events without being too noisy. This becomes especially useful during long scraping sessions. If something goes wrong or seems off, you can just check the log file to see exactly what happened and when—almost like reading a travel diary your script kept while it was running.


Efficiently Storing Product URLs with a Lightweight SQLite Database

# SQLite setup

“””

Create a SQLite database and table for storing product URLs

“””

DB_PATH = "macys_products_url.db"

Before we start collecting data from a website, it’s important to have a proper place to store everything we gather. Think of it like setting up a clean folder before you begin a big research project—you want everything to go in the right place from the start. In our case, we’re using a simple and lightweight database called SQLite to save the product links (or URLs) we’ll be scraping from the website. Before we start collecting data from a website, it’s important to have a proper place to store everything we gather. Think of it like setting up a clean folder before you begin a big research project—you want everything to go in the right place from the start. In our case, we’re using a simple and lightweight database called SQLite to save the product links (or URLs) we’ll be scraping from the website. We define a variable called DB_PATH, which tells our program what the database file is named and where to find it. For example, here it’s named "macys_products_url.db". You can imagine this file as a mini-warehouse where all the product URLs are safely stored before we do anything else with them.


Now, why use SQLite? The reason is simple—it’s easy to set up, doesn’t require a complicated installation, and can smoothly handle large amounts of data, even if you’re collecting thousands or tens of thousands of links. It keeps everything in one file, which helps avoid confusion and keeps the entire process tidy. With this setup, you’re not just collecting data randomly—you’re organizing it from the very beginning, making your job easier as you move on to the next steps.We define a variable called DB_PATH, which tells our program what the database file is named and where to find it. For example, here it’s named "macys_products_url.db". You can imagine this file as a mini-warehouse where all the product URLs are safely stored before we do anything else with them.


Now, why use SQLite? The reason is simple—it’s easy to set up, doesn’t require a complicated installation, and can smoothly handle large amounts of data, even if you’re collecting thousands or tens of thousands of links. It keeps everything in one file, which helps avoid confusion and keeps the entire process tidy. With this setup, you’re not just collecting data randomly—you’re organizing it from the very beginning, making your job easier as you move on to the next steps.


Reliable URL Storage: Keeping Macy’s Data Clean and Organized

# database setup

def save_url_to_db(url):

   """

   Saves a single product URL to the SQLite database.


   """   

   try:

       conn = sqlite3.connect(DB_PATH)

       cursor = conn.cursor()

       cursor.execute("INSERT OR IGNORE INTO product_urls (url) VALUES (?)", (url,))

       conn.commit()

       conn.close()

       logging.info(f"Saved URL: {url}")

   except Exception as e:

       logging.error(f"Failed to save URL: {url} - Error: {e}")

Imagine you’re working with a massive list of items—like Macy’s entire collection of women’s clothing on sale, which includes over 4000 product pages. That’s a huge amount of data, and if you’re trying to collect all those web links (URLs), it’s really important to keep everything tidy and safe. You don’t want to lose links you’ve already collected, and you definitely don’t want to collect the same link more than once. That’s where this part of the code comes in—it helps store each product link neatly and reliably.


Here’s how it works: every time the scraper finds a product URL, this function takes that link and saves it into something called a SQLite database. Think of this database like a digital notebook—it stores your data in a table, and makes sure nothing gets written twice. To do this, Python uses a built-in tool called sqlite3, which connects to the database file. When it’s time to save a link, the code uses a special command—INSERT OR IGNORE. This little instruction tells the database, “Only save this if it’s not already there.” So if the same link comes up again later, it’ll be skipped without causing any problems.


Once the link is saved (or skipped), the code closes the database connection to keep things running smoothly and not use up unnecessary system resources. It also leaves a note in the log—kind of like a quick journal entry—saying whether the URL was saved successfully or if something went wrong. This whole process makes sure your data stays clean, avoids doubles, and most importantly, lets you pause and resume the scraping later without missing a beat. That’s a big win when you’re working with thousands of entries and multiple runs.


Automating Macy’s Product URL Extraction with Smart Scrolling and Pagination

# Main scraping logic

async def scrape_macys():

   """

   Scrapes product URLs from the Macy's women's clothing sale section.

   """

      

   async with async_playwright() as p:

       # Launch the Firefox browser in visible mode (headless=False)

       browser = await p.firefox.launch(headless=False)

       context = await browser.new_context()

       page = await context.new_page()

      

       # Use stealth mode to avoid getting blocked as a bot

       await stealth_async(page) 



        # Macy's clothing sale base URL

       base_url ="https://www.macys.com/shop/sale/womens-sale/womens-clothing-sale?id=338161"

       await page.goto(base_url, timeout=60000)



       # Try to close newsletter popup if it appears

       try:

           await page.locator('button:has-text("No, Thank You!")').click(timeout=5000)

           logging.info("Closed newsletter popup")

       except:

           logging.warning("Newsletter popup not found")



       # Try to accept cookies if the cookie banner appears

       try:

           await page.locator('button:has-text("Accept Cookies")').click(timeout=5000)

           logging.info("Accepted cookies")

       except:

           logging.warning("Cookie banner not found")



       current_page = 1

       max_pages = 999  # You can limit this or detect dynamically

     

       # Loop through pages until there are no more

       while True:

           logging.info(f"Processing page {current_page}")

          

         # Scroll multiple times to load all products (simulate lazy loading)

           for i in range(10):

               await page.mouse.wheel(0, 2000)

               await page.wait_for_timeout(1000)



           # Extract product URLs

           product_links = await page.locator('div.description-spacing a.brand-and-name').all()

           for link in product_links:

               href = await link.get_attribute("href")

               if href:

                   full_url = "https://www.macys.com" + href

                   save_url_to_db(full_url)

           """

           Finds all anchor (<a>) tags within a container that has the class 'description-spacing'.

           """

          

           # Try to go to next page

           try:

               next_button = page.locator('li a.next-button')

              

               if await next_button.is_visible() and await next_button.is_enabled():

                   await next_button.click()

                   await page.wait_for_load_state("domcontentloaded")

                   current_page += 1

                   await asyncio.sleep(3)

               else:

                   logging.info("No more pages. Exiting.")

                   break

           except Exception as e:

               logging.error(f"Error clicking next page: {e}")

               break

       # Close the browser after scraping is done

       await browser.close()

This piece of code represents the main logic for scraping product URLs from Macy's women's clothing sale page in order to create a full list of quantity over 17k product links. The code first launches a headfull web browser using Playwright. Playwright is a modern automation library and since we are running in a headful mode we ease debugging in case the code doesn't function as intended. The website could identify the activity as a bot, therefore a 'stealth mode' is invoked that disguises some of the automated indicators showing the activity is automated rather than done by a human. When the browser opens the Macy's sale page, the code attempts to manage common interruptions such as popups. It attempts to close newsletter signups and accept cookie banners, which are quite common in retail sites that might otherwise interfere with the browsing experience.


The primary aspect of the scraping process is loading products dynamically via scroll simulation. The Macy's website uses lazy loading, meaning that products only become available as the user scrolls down the page. The process coded for this, would continually scroll down the page, to allow the Macy's site enough time to load all of the available product listings visible at once. Following the scrolling process, the program will scrape product links, by searching for the specific HTML elements—specifically, the anchor tags found within the appropriate container (with certain classes) that describes the product listings. The program also converts each relative link to an absolute link by attaching Macy's base domain to each relative link, in order to ensure the links would be accessible on their own at a later date.


To cover the entire catalog, the script navigates through multiple pages by clicking the “Next” button, continuing the process until no more pages remain or a preset maximum page limit is reached. Each time a new page loads, the same scroll-and-extract routine repeats. This loop ensures that product URLs across all sale pages are collected systematically. Finally, once all pages are processed, the browser closes gracefully to free up system resources.


In total, this systematic scraping strategy creates a deep set of product URLs that can be utilized for downstream analysis or to scrape high-level product data like prices, descriptions, and reviews. The use of stealth browsing, popup management, dynamic content loading, and gentle pagination management results in a powerful and efficient solution for extracting huge amounts of data from contemporary, interactive e-commerce sites like Macy's.


Let the Scraping Begin: Running the Async Workflow

# ENTRY POINT TO START SCRAPING

asyncio.run(scrape_macys())

"""

This line actually starts the entire scraping process

by calling the 'scrape_macys' function

using asyncio

"""

Every time we build an asynchronous web scraper, we need a way to kick things off—that's where this line comes in: asyncio.run(scrape_macys()). Think of it as pressing the “Start” button for the entire scraping process. When this line runs, it hands control over to the scrape_macys function, which holds the main steps for getting data from the website.


Now, there’s something special about this function: it’s asynchronous. That just means it doesn't run like a regular function. Instead of going step-by-step in a straight line, it can pause and resume—kind of like a multitasker that knows when to wait and when to jump to the next task. But for that to happen smoothly, it needs something called an “event loop,” and asyncio.run() sets that up for us automatically.

By doing things asynchronously, our scraper can work much faster and smarter. For example, while it’s waiting for one page to load, it can go ahead and start grabbing data from another. This avoids wasting time just sitting around for slow responses. So in simple terms, that one line of code is what gets everything moving—it tells our scraper, “Alright, go ahead and start working,” and from there, it takes over and begins handling thousands of product pages efficiently and in order.


Step 2: Scraping Full Product Information from Individual Links


Importing Libraries

import asyncio

import json

import random

import sqlite3

import logging

from time import sleep

from datetime import datetime

from pathlib import Path

from pymongo import MongoClient

from playwright.async_api import async_playwright

import re

The initial lines of code may look like a huge list of imports, but each serves a particular function that is crucial in assisting us to extract and store product information without any issue.


We kick things off with asyncio. This little tool is what lets our script handle multiple tasks at the same time. Since scraping information from hundreds or even thousands of product pages can take a while, we use something called asynchronous programming. It’s like giving our script the ability to multitask—doing several things at once—so it doesn’t waste time just waiting around for pages to load.


Next comes json. Imagine you want to store product details like names, prices, and descriptions in a way that's neat and easy to read. JSON helps with exactly that—it’s a simple way to organize and move around structured data. Then we add a bit of randomness and short pauses between tasks using random and sleep. This makes our script behave more like a human rather than a robot, which helps avoid getting blocked by the website we’re scraping.


After that, we bring in sqlite3, which gives us a lightweight database to store all the product links we collect. The best part? It doesn't need any special setup or server—it just works right out of the box. Alongside that, we use logging, which acts like a diary for our script. If anything goes wrong or doesn’t work as expected, we can check the logs to figure out what happened.


To keep track of when things happen, we use datetime. It helps us add timestamps to our saved files or log entries, so we always know when something was done. For dealing with files and folder paths smoothly, we bring in Path from the pathlib library—it makes working with file locations cleaner and simpler.


Now, while sqlite3 is great for smaller projects, we also include pymongo for more flexibility. It connects us to a bigger, more powerful database called MongoDB. This is especially useful when we’re dealing with data that doesn’t fit neatly into a table—like product info that changes from one item to another. And then, we have the real workhorse of our scraper: playwright.async_api. Playwright is the tool that allows our script to behave like a person browsing the Macy’s website. It can scroll, click, and wait for content to appear, just like we would. Since we’re using the async version of Playwright, it fits perfectly with asyncio, making the whole process fast and efficient.


Lastly, we import re, which stands for regular expressions. Think of it like a smart search tool—it helps us pull out specific pieces of information from messy HTML pages, like finding just the price or product name in a sea of code.


So, all these imports together form a strong foundation for our scraper. They help it move through the website, collect data smartly, store it properly, and make sure everything runs as smoothly as possible—even when we’re working with a massive number of products.


Filtering the Noise: How is_valid_price() Keeps Price Data Clean

# HELPER FUNCTION: CHECK VALID PRICE FORMAT

def is_valid_price(text):

   """
   This function checks if the provided text looks like a valid price in INR format.


   """

   return bool(re.search(r'INR\s[\d,]+\.\d{2}', text))

Let’s talk about a small yet very helpful part of our web scraping process—a little function called is_valid_price(). Even though it’s not flashy or complex, it plays a big role in keeping our data clean and accurate. Its job is pretty straightforward: it checks if a piece of text actually looks like a real price, specifically in the Indian Rupee format—like “INR 1,299.00”. Now, you might wonder why we even need this. When we scrape a webpage, we often collect all kinds of text by accident. Along with actual prices, we might also pick up labels like “SALE” or “NOW”, or numbers that aren't really prices at all. This is where is_valid_price() steps in. It helps us separate the real price information from everything else.


Behind the scenes, this function uses a technique called regular expression (or regex). Think of regex as a smart filter—it lets us describe the exact pattern we’re looking for. In this case, the pattern is designed to spot prices that match the INR style.


So, in simple terms, is_valid_price() is like a careful editor. It doesn’t just collect anything that looks like a number—it makes sure we’re only keeping the real deal. That way, when we analyze our data later, we’re working with clean, reliable information instead of a messy pile of random text.


Scraper Setup Essentials: Databases, User-Agent, and Logging in One Place

# CONFIGURATION

SQLITE_PATH = "/home/anusha/Desktop/DATAHUT/Macys_clothing/macys_products_url.db"

USER_AGENTS_PATH = "/home/anusha/Desktop/DATAHUT/Macys_clothing/user_agents.txt"

JSONL_PATH = "macys_products_data.jsonl"

MONGO_URI = "mongodb://localhost:27017"

DB_NAME = "macys"

COLLECTION_NAME = "products"

LOG_FILE = "macys_scraper.log"


"""

Configuration variables for the scraper

"""

At the very beginning of our script, we define a few important file paths and settings. Think of these like setting up your workspace before starting a task—putting your tools where you can reach them easily.


We start with something called SQLITE_PATH. This is simply the path to a local database file that holds all the product URLs we plan to visit. You can imagine it like a to-do list. Instead of writing URLs on a sticky note, we save them in this database so our script knows exactly where to go when it’s time to collect product details.


Then there’s USER_AGENTS_PATH. This points to a text file that holds a list of different user agents. Now, in simple terms, a user agent is like a mask that tells a website what kind of device and browser is visiting—whether it’s a Chrome browser on a laptop, or Safari on an iPhone. By switching these user agents, our scraper can blend in and avoid raising suspicion or getting blocked. It’s like changing disguises while exploring a secure building—you just want to quietly gather information without being noticed.


Next, we have JSONL_PATH. This is the file where we’ll save all the data we scrape, and it’s stored in a format called JSON Lines. That just means we store one product’s data per line. This way, the file stays organized, even when we’re collecting information about thousands of products. It also makes it easier to go through later, especially if we want to load or process the data again.


Now let’s talk about MongoDB settings. MONGO_URI is like the home address of our database in the cloud. It tells the script where to send the data. Then we have DB_NAME and COLLECTION_NAME. These are like the specific apartment and room where we want to drop off the information. For our case, we’re placing all the Macy’s product data into a database called “macys,” inside a collection named “products.”

Finally, there's the LOG_FILE. This file acts like a behind-the-scenes journal. Every time the scraper does something—whether it visits a URL, saves some data, or runs into an error—it writes it all down in this file. That way, if something goes wrong or we want to check what happened during the run, we can just look at this log. It’s incredibly helpful when troubleshooting or tracking performance over time.


By gathering all these settings in one place at the start of our script, we make everything easier to manage. If we want to change something—like use a new database, switch out our user agents, or save data in a different place—we only need to update it here. That’s especially useful when you’re working with massive projects like this Macy’s sale scrape, where we’re handling over 4000 product links. A little organization at the beginning goes a long way in keeping everything smooth and under control.


Tracking the Scraper’s Activity with a Logging Setup

# SETUP LOGGING 

logging.basicConfig(filename=LOG_FILE,

                   level=logging.INFO,

                   format="%(asctime)s - %(levelname)s - %(message)s")

"""

Configure logging to track the script's execution

"""

Before we start collecting thousands of product links from Macy’s women’s clothing sale section, it’s important to set up a way to keep track of what our script is doing. Think of it like keeping a journal while on a big trip—you want to remember where you went, what worked well, and where things got tricky. That’s exactly what logging helps us do in our Python script.


In this project, we’re using Python’s built-in logging module to record important information while the scraper runs. It tells us when the script is working as expected and, more importantly, when something doesn’t go as planned. To set this up, we use a function called logging.basicConfig(), which lets us decide how our logs should be stored and what kind of messages we want to see.


We choose a specific file location (called LOG_FILE) where all the messages will be saved. This is helpful because, later on, you can open that file and see a detailed record of what happened during the scraping. We also tell the logger to include useful messages by setting level=logging.INFO, which ensures we capture events like which page got scraped or if the script skipped a product. Lastly, we set a format for each message that includes the time it happened, the type of message (like INFO or ERROR), and the message itself. With this setup in place, even if we’re scraping over 4000 pages, we’ll have a clear trail of everything the script did—making it much easier to understand what went right and to fix anything that didn’t.


Mimicking Real Users: A Simple Trick to Bypass Scraper Detection

# LOAD USER AGENTS 

with open(USER_AGENTS_PATH) as f:

   USER_AGENTS = [line.strip() for line in f if line.strip()]


"""

Load a list of user agents from a text file:


"""

When building a reliable web scraper for Macy’s women’s apparel sale section, one important challenge was making sure the scraper didn’t get blocked by the website. Websites often have systems in place to detect when a bot—rather than a real person—is trying to access their content. One common way they catch bots is by noticing if every request comes from the same browser or device again and again.


To avoid raising any red flags, we used a simple but effective trick. We gave the scraper the ability to pretend it was using different browsers. This is done using something called a user agent. Think of a user agent as a small piece of information your browser automatically sends when you visit a website. It tells the site what kind of browser you’re using (like Chrome or Firefox) and what kind of device you’re on (like a Windows laptop or an iPhone).


So, we created a plain text file that had a list of different user agent strings. Then, using Python, we read this file line by line. We cleaned up the lines—removing any empty spaces or blank lines—and stored the final list in a variable. Now, instead of always using the same user agent, our scraper randomly picks one from this list each time it sends a request. This makes it look like different people are visiting the site, which helps us stay under the radar and avoid getting blocked.


Preparing the Database Table to Store and Manage Macy’s Product Links

# DB SETUP

conn = sqlite3.connect(SQLITE_PATH)

cursor = conn.cursor()

cursor.execute("""

CREATE TABLE IF NOT EXISTS product_urls (

   url TEXT PRIMARY KEY,

   processed INTEGER DEFAULT 0

)

""")

conn.commit()

"""

Set up SQLite database connection and ensure the product_urls table exists


"""

When you're trying to collect thousands of product links from a website, things can get messy really fast if you don’t have a proper system to keep track of what you've already collected. That’s where using a small, local database like SQLite comes in handy—it’s lightweight, fast, and doesn’t need any complicated setup.


In the code we’re looking at, the first thing it does is open a connection to a database file (that’s what the SQLITE_PATH is for—it simply tells the program where the database is saved). Once it’s connected, it checks whether a table named product_urls already exists inside that database. If the table isn’t there yet, it creates one with two simple columns: one called url and another called processed.


The url column is used to store the actual product links. It’s also marked as the “primary key,” which is just a way of saying “no duplicates allowed”—if the same URL shows up twice, only the first one will be saved. The second column, processed, acts like a little flag. It keeps track of whether a link has been scraped or not. Every new link starts off as “not yet processed” (which is marked with a zero), and once the scraping is done, the status can be updated.


This setup is really helpful, especially if you’re working with a massive list. Having this database means you can stop the process midway and pick up where you left off later, without losing track or doing the same work again. Once the table is created and ready, the code saves the structure so that it’s always available during the scraping process.


Keeping It Organized: Retrieving Unprocessed Product URLs

# FETCH UNPROCESSED URLS 

def get_unprocessed_urls():

   """
   Retrieve URLs from the database that haven't been processed yet.

   """

   cursor.execute("SELECT url FROM product_urls WHERE processed = 0")

   return [row[0] for row in cursor.fetchall()]

When you're working on a task like product scraping—basically collecting information from different product pages—it’s important to keep track of where you’ve been and where you still need to go. Otherwise, you might waste time scraping the same page over and over, or miss some pages altogether. That’s where the get_unprocessed_urls function comes in. Think of it like a checklist that helps your scraper know which links are still waiting to be visited.


This function looks inside a small local database, stored using SQLite, which keeps a list of all the product URLs. Each URL in the database has a label attached to it—a flag that says whether it has been processed or not. If the flag is set to 0, it means the scraper hasn’t touched that URL yet. So, when this function runs, it quickly grabs only those untouched links from the database and hands them over as a neat list.

By using this approach, the scraper can pick up right where it left off—even if it was stopped in the middle for some reason. It avoids redoing the same work and keeps things running smoothly, especially when scraping thousands of pages. It’s a simple trick, but incredibly useful for staying organized and efficient during big scraping jobs.


Marking URLs as Done to Keep Scraping Efficient

# Mark a URL as processed

def mark_url_processed(url):

   """

   Mark a URL as processed in the database.



   """   

   cursor.execute("UPDATE product_urls SET processed = 1 WHERE url = ?", (url,))

   conn.commit()


Once a product page has been successfully scraped, the next thing we want to do is make sure we don’t scrape the same page again. That’s exactly what the mark_url_processed function helps us with. Think of it like checking off a task on a to-do list. After we’ve collected the product’s details, this function steps in and updates our local SQLite database to say, “Hey, we’ve already taken care of this one.” It does this by switching a flag in the database from 0 to 1, which simply means the URL has been processed.


This small action plays a big role in keeping everything running smoothly. Imagine trying to go through a long list of product pages, but you keep landing on the same ones over and over—it would waste time and energy. That’s what we’re avoiding here. By keeping track of what’s done, the scraper can move forward efficiently without repeating steps. And here’s another benefit: if the script suddenly stops or crashes for some reason, it can pick up right where it left off. Since we’ve marked the completed URLs, there’s no need to start all over again. It’s like leaving a bookmark in a long book—so the next time you open it, you know exactly where to continue.


Saving Scraped Data the Smart Way with JSONL

# SAVE DATA

def save_to_jsonl(data):

   """
   Save product data to a JSONL (JSON Lines) file.

   """

   with open(JSONL_PATH, "a", encoding="utf-8") as f:

       f.write(json.dumps(data) + "\n")

When you're collecting product information through web scraping—especially when there are thousands of items and new data coming in all the time—it’s really important to save that information in a smart and organized way. That’s where the save_to_jsonl function comes in. This function helps by writing each product’s details into a file called a JSONL file. If you’re not familiar with it, JSONL (which stands for JSON Lines) is just a file format where each line holds one complete product entry, written in a format that computers can easily understand.


Think of it like adding new pages to a notebook—each product gets its own line, and nothing gets erased when new information is added. That’s because the function opens the file in “append mode,” which means it simply adds to the end of the file instead of starting over. It also makes sure the product details, which come in as Python dictionaries, are properly turned into neat little JSON strings before saving them. This approach makes life a lot easier when you're dealing with large amounts of data—it keeps things clean, easy to work with, and reliable. Even if something goes wrong in the middle of scraping, you don’t lose the earlier data, and you can pick up right where you left off. That’s the kind of system you want when your project grows bigger and needs to handle more data without breaking.


Making Data Storage Simple and Scalable with MongoDB

# Save product data to MongoDB

def save_to_mongo(data):

   """
   Save product data to MongoDB.
   """

   client = MongoClient(MONGO_URI)

   db = client[DB_NAME]

   collection = db[COLLECTION_NAME]

   collection.insert_one(data)

Once we’ve pulled out the necessary details from a product’s webpage—like its name, price, or brand—we need a place to store that information safely for later use. That’s where the save_to_mongo function comes in. Think of it like a digital filing cabinet. This function helps us store all the product data neatly into something called MongoDB, which is a type of database used to keep information organized and easy to access.


Whenever we gather new product data, this function steps in to save it in MongoDB as a “document”—which is just a structured way of storing information, similar to a filled-out form. It knows exactly where to put the data, connecting to the correct part of the database, known as a collection. Each time we use the function, it adds a new document with the latest details.


This setup isn’t just about keeping things safe—it also makes our future work easier. Want to look up only discounted products? Or filter by brand or price? With the way the data is stored, you can run such searches quickly and easily. And because this function works in a self-contained way—meaning it does its job without needing help from other parts of the code—it stays simple and efficient. Over time, this small but powerful piece becomes a helpful tool in handling large scraping projects, keeping all our data clean, searchable, and ready for deeper analysis.


How Random Wait Times Help Avoid Bot Detection

# RANDOM WAIT 

def wait_random_delay():

   """

   This function introduces a random delay between requests to make the

  scraping patterns are less predictable and reduce the risk of being blocked.


   """



   delay = random.uniform(5,10)

   logging.info(f"Waiting for {delay:.2f} seconds...")

   sleep(delay)

When you're scraping data from a website, it's important not to make your script act too much like a robot. Websites are pretty smart these days—they can spot unnatural behavior, like clicking too fast or loading pages without any pause. That’s where a bit of randomness can help make things look more human.


To do this, we use something called the wait_random_delay() function. It simply pauses the script for a random amount of time—say, somewhere between 5 and 10 seconds—before moving on to the next step. This small pause gives the impression that a real person is browsing, taking a moment to read or look around before clicking again. It’s a bit like mimicking how you or I might casually scroll through a website. This random wait isn’t just for show—it also gives the website’s servers a breather, especially if we’re collecting data for a long time. That means we’re being more polite to the site and less likely to get blocked or flagged as a bot. To keep track of what’s happening behind the scenes, the script logs how long each delay lasts. That way, if something seems slow or gets stuck, we can look back and understand what happened. These kinds of thoughtful touches may seem small, but they really do help. They make your scraping more stable, more respectful, and more likely to succeed without drawing unwanted attention.


How the Product Data Scraper Gathers and Organizes Detailed Macy’s Product Information

# SCRAPE FUNCTION 

async def scrape_product_data(page, url):
    """
    Scrape only: name, brand, mrp, discount from a Macys product page.
    """

    try:
        await page.goto(url, timeout=60000)
        await page.wait_for_timeout(3000)

        # Accept cookie popup

"""
        Handle cookie consent popup if it appears.
       
        """
        try:
            await page.wait_for_selector("#onetrust-accept-btn-handler", timeout=7000, state="visible")
            await page.locator("#onetrust-accept-btn-handler").click()
            logging.info("Clicked Accept Cookies")
            await page.wait_for_timeout(2000)
        except Exception:
            logging.info("No cookie popup detected.")

        # Check Access Denied
        if "Access Denied" in await page.content():
            logging.warning(f"Access Denied: {url}")
            return None

        # Safe getter helpers
        async def safe_get(selectors):
            if isinstance(selectors, str):
                selectors = [selectors]
            for selector in selectors:
                try:
                    text = (await page.locator(selector).inner_text()).strip()
                    if text:
                        return text
                except:
                    continue
            return None

        #  Extract only 4 fied
        name = await safe_get("span.body")
        brand = await safe_get(".updated-brand-label > a:nth-child(1)")
        mrp = await safe_get(".extra-price")   # original price
        discount = await safe_get(".font-weight-sm")

        # If name or brand missing, skip
        if not name:
            logging.warning(f"Name missing for: {url}")
        if not brand:
            logging.warning(f"Brand missing for: {url}")

        data = {
            "url": url,
            "name": name,
            "brand": brand,
            "mrp": mrp,
            "discount": discount,
            "scraped_at": datetime.now().isoformat()
        }

        return data

    except Exception as e:
        logging.error(f"Error scraping {url}: {e}")
        return None

Scraping product details from a website like Macy’s isn’t as easy as just opening a page and grabbing the text. Websites today are more complex—they include pop-ups, interactive elements, and layouts that don’t always follow the same rules. That’s why the scrape_product_data function was created. It carefully handles these challenges in a step-by-step way.


This function uses a tool called Playwright, which helps us control a web browser automatically. It’s written as an "async" function, which means it can wait for certain tasks—like page loading—to finish before moving on. When the function starts, it opens the product’s page and pauses briefly to make sure everything loads completely. During this wait, it also checks for cookie consent pop-ups that often appear the first time you visit a site. If the "Accept" button shows up, the function clicks it; if not, it just continues.


One important thing it looks out for is whether the website has blocked access. Sites sometimes do this if they think a bot is visiting. So the script checks for a message like “Access Denied.” If it finds that message, it skips the page, logs the issue for reference, and moves on without crashing or stopping the whole process.


Next, the function starts collecting data using smaller helper functions like safe_get and safe_get_all. These helpers are smart—they try different ways to find the same piece of information, so if one method fails, another might work. They also handle cases where something is missing or not in the expected place. The script then looks at labels on the page to figure out if the product is on clearance or a final sale. This kind of detail helps later when analyzing or filtering products. After that, the scraper collects a lot of useful details about the product: its name, brand, prices (both original and discounted), discount percentage, available sizes and colors, customer ratings, materials, shipping info, and more. It puts all this neatly into a Python dictionary, and it also adds the date and time when the data was collected.


What makes this scraping setup really strong is its ability to handle errors without crashing. If something unexpected happens—like a network issue or a sudden change in the webpage layout—it simply logs the problem and returns nothing for that one item. The rest of the process keeps running. This makes it possible to collect clean and structured data from thousands of product pages without constant supervision. Whether you’re building a tool to track discounts, analyze product trends, or recommend items to shoppers, this function gives you a solid and reliable foundation.


How the Main Function Efficiently Handles Large-Scale URL Scraping

#  MAIN RUNNER 

async def main():

   """

   Main function to orchestrate the scraping process.



   """



   urls = get_unprocessed_urls()

   if not urls:

       logging.info("No unprocessed URLs found.")

       return



   logging.info(f"Found {len(urls)} URLs to process.")



   for url in urls:

       wait_random_delay()

       user_agent = random.choice(USER_AGENTS)



       async with async_playwright() as p:

           browser = await p.chromium.launch(headless=False)

           context = await browser.new_context(user_agent=user_agent)

           page = await context.new_page()



           logging.info(f"Processing URL: {url}")

           data = await scrape_product_data(page, url)



           if data:

               save_to_jsonl(data)

               save_to_mongo(data)

               mark_url_processed(url)

               logging.info(f"Scraped and saved data for: {url}")

           else:

               logging.warning(f"Skipped: {url}")



           await browser.close()

The main() function is the central engine that drives the entire scraping workflow. Think of it as the conductor of an orchestra, guiding each step from beginning to end in a smooth and organized way. Its job is to go through a long list of product web pages—sometimes even thousands of them—and collect important details from each one, reliably and efficiently. It all begins by checking a local database (specifically, an SQLite file) to see which product URLs haven’t been scraped yet. If any are still waiting to be processed, the function takes note of how many are left and gets to work. For every product page, it introduces a small, random pause before proceeding. This pause mimics how a human might browse, which helps avoid getting blocked by the website for suspicious activity. Next, a fresh browser window is opened using automation tools, and a random user agent is applied. A user agent is like a browser’s ID badge, and changing it helps disguise the scraper so it doesn’t get recognized or restricted. The scraper then visits the product page and carefully picks out key information such as the product’s name, brand, price, and size options. This information is saved in two places. First, it goes into a lightweight .jsonl file, which is easy to handle and perfect for quick reference. Second, the data is stored in MongoDB, a powerful database that works well when managing huge volumes of information. Once the scraping for that page is successful, the URL is marked as "1" in the database so it won’t be visited again.


Finally, the browser window is closed, and the function moves on to the next product page. By handling each page one at a time, the process stays clean, manageable, and much less likely to trigger any website security systems—a key benefit when dealing with such a large collection of pages.


How the Script Begins and Handles Interruptions Gracefully

# Script entry point

if name == "__main__":

   try:

       asyncio.run(main())

   except KeyboardInterrupt:

       logging.info("Scraping interrupted by user.")



Every scraping project needs a place where everything begins—a sort of “start button” for the whole process. In Python scripts, this is usually found right at the bottom of the file. You’ll often see a line that looks like if name == "__main__":. This line has an important job: it makes sure that the script only runs when you open it directly. If someone tries to use this script as a helper in another program, it won’t start running on its own, which is exactly what we want.


Now, inside this special block, we usually kick off the main part of our code. Here, it's done with asyncio.run(main()), which starts an async function called main(). This is where most of the scraping logic lives—the part that actually visits websites, collects data, and does all the heavy lifting.


Conclusion


Scraping Macy’s sale section using Playwright and Python demonstrates how automation, smart tool selection, and thoughtful data handling can transform a large, complex website into a structured and insightful dataset. Along the way, here used powerful tools like Playwright for browser automation, SQLite and MongoDB for storage, and helpful Python libraries to clean and manage the data.


What’s important here is that this isn’t just about scraping a website—it’s about doing it the right way: carefully, responsibly, and in a way that keeps your data clean, organized, and easy to use later. Whether you’re analyzing pricing trends, comparing brands, or studying how discounts work, having a solid dataset makes it all possible.

So, if you’re looking to dive into web scraping for real-world e-commerce data, Macy’s is a great place to start—and with the right tools and methods, it’s entirely achievable. With this guide and approach, you’re not just collecting data—you’re building the foundation for insights that can power smarter decisions.


FAQ SECTION


1. Is it legal to scrape Macy’s sale section?

Web scraping public data is generally legal, but it must comply with Macy’s Terms of Service, robots.txt, and local data protection laws. Scraped data should be used responsibly, avoiding personal data collection or excessive request rates that could impact site performance.


2. Why use Playwright instead of BeautifulSoup or Requests for scraping Macy’s?

Macy’s sale pages are JavaScript-rendered, meaning product details load dynamically. Playwright can execute JavaScript, handle lazy loading, pagination, and simulate real browser behavior—capabilities that traditional tools like Requests or BeautifulSoup alone cannot provide.


3. What data can be extracted from Macy’s sale section?

Using Python and Playwright, you can scrape product names, sale prices, original prices, discount percentages, availability status, product URLs, images, and category tags from the Macy’s sale section.


4. How can I avoid getting blocked while scraping Macy’s?

To reduce blocking risks, implement headless browser controls, rotate user agents, add request delays, handle cookies properly, and avoid sending too many concurrent requests. Playwright also helps mimic human browsing behavior, improving scrape reliability.


5. Can this scraping method be scaled for regular price monitoring?

Yes. The Playwright-based approach can be scaled using task queues, scheduled runs, proxy integration, and cloud deployment. This makes it suitable for price tracking, discount monitoring, and competitive analysis at scale.


AUTHOR

I’m Anusha , Data Science Intern at Datahut. I work on automating data collection and transforming unstructured retail data into meaningful insights using tools like Playwright, MongoDB, and pandas etc.


At Datahut, we help businesses in retail and e-commerce unlock the power of web data using scalable scraping solutions and smart automation. In this blog, I walk you through how we extracted and structured thousands of product listings from Macy’s women’s clothing sale section—enabling deeper analysis of pricing trends, inventory patterns, and product visibility.


If you're looking to scale your data collection efforts or need help turning raw web data into business-ready insights, connect with us through the chat widget on the right. We’d love to collaborate.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page